# Exercise 6: Feature Engineering and Model Selection
## Theory
### Task 1: MC
---
Multiple answers are possible.

**A regularized model has**
 - [ ] lower bias
 - [x] lower variance
 - [x] higher bias
 - [ ] higher variance

**Which of the following Regularization techniques can be used for feature selection?**
 - [x] L1
 - [ ] L2
 - [ ] Early stopping
 - [ ] All of the above

**Standardizing the training data by $\frac{x-\mu}{\sigma}$ can decrease model performance if**
 - [ ] the data is centered
 - [x] the data is sparse
 - [x] the data has outliers
 - [ ] All of the above

**How should you standardize (assuming StandardScaler) your test data before evaluating performance?**
 - [ ] not at all
 - [x] using the mean and variance computed from the training data
 - [ ] by recomputing the mean and variance for the test data
 - [ ] All of the above

**CrossValidation**
 - [ ] the number of folds determines how often the score is computed during cross validation
 - [x] the cross validated score is reported as the score average of all splits
 - [ ] when using KFold cross validation, every fold contains data of all classes
 - [ ] All of the above

**Consider binary classification. A model that always returns 1 as its predicted class label, is more likely to have**
 - [ ] low accuracy
 - [ ] low precision
 - [ ] high recall
 - [x] All of the above


### [Optional] Bias-variance decomposition for Regression
---
Proof that for the Mean-squared error, we have
 - $Err[(y - \hat f(x;D))^2] =  Bias(\hat f(x))^2 + Var(\hat f(x))$,
 
where Err denotes the Expectation of the MSE w.r.t to the data distribution.

Note: If you are interested, you can also compute the Bias-variance decomposition for a classification error.

<img src="img\bvdc.png" alt="Drawing" style="width: 1024px;"/>

[Source](https://en.wikipedia.org/wiki/Mean_squared_error#Proof_of_variance_and_bias_relationship)

## Programming
---
In the programming exercises, we are going to work on the speeddating dataset from the lecture. For that, you are given the preprocessing pipeline presented there:

In [15]:
import pandas as pd
import numpy as np
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

def maybe_convert_to_int(col):
  try:
    col = pd.to_numeric(col.replace('?', -1), errors='raise').astype(int)
  except Exception as e:
    return col
  return col


df = pd.read_csv('data/speeddating.csv', low_memory=False)
df = df.apply(maybe_convert_to_int, axis=0)

cat_cols = make_column_selector(dtype_include=object)
num_cols = make_column_selector(dtype_include=np.number)

num_pipe = make_pipeline(SimpleImputer(missing_values=-1, strategy='mean'), StandardScaler())
cat_pipe = make_pipeline(OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan))

ct = make_column_transformer(
  (num_pipe, num_cols),
  (cat_pipe, cat_cols),
)


#df = df[num_cols(df) + cat_cols(df)]
#df = df.drop(['decision', 'decision_o'], axis=1)
#Xtrain, Xtest, ytrain, ytest = train_test_split(df.drop('match', axis=1), df['match'])
len(df.columns)


123

### [Optional] Sklearn Feature Selection using RandomForest
Repeat the Feature Selection Exercise from the lecture (using SelectFromModel) but with RandomForestClassifier using the train and test split, as well as the column transformer defined in the cell above.

In [26]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
# Train all
forest = RandomForestClassifier().fit(ct.fit_transform(Xtrain), ytrain)
score1 = forest.score(ct.transform(Xtest), ytest)
#print(f'Test score after training on all features is {score1}')
# []

# Select Best
threshold = np.sort(np.abs(forest.feature_importances_))[-10]
sfm = SelectFromModel(forest, threshold=threshold).fit(ct.fit_transform(Xtrain), ytrain)
mask = sfm.get_support()
selected_features = np.array(Xtrain.columns)[mask]
# print(f"Features selected by SelectFromModel: {selected_features}")
# []
# Train Selection
ct.fit(Xtrain[selected_features])
transformed_reduced_dataset = ct.transform(Xtrain[selected_features])
forest = RandomForestClassifier().fit(transformed_reduced_dataset, ytrain)
score2 = forest.score(ct.transform(Xtest[selected_features]), ytest)
print(f'Test score after training only on 10 selected features is {score2}')
print(f'Accuracy difference is {score2 - score1} when trained only on {len(selected_features)} features')


Test score after training only on selected features is 0.8491646778042959
Accuracy difference is -0.010023866348448651 when trained only on 10 features


### Task 1: Feature Selection
---
Look at Algorithm 2 on page 7 of the paper [Feature Selection Based on L1-Norm SupportVector Machine and Effective Recognition Systemfor Parkinson’s Disease Using Voice Recordings](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8672565)

The goal of this exercise is to implement a similar pipeline but for the speeddating dataset from the lecture. You can download it at
https://www.openml.org/d/40536. 

We are going to repeatedly perform feature selection, using a L1 penalized SVM, of which the parameter 'C' has been optimized using GridSearchCV, until the feature set can no longer be reduced.

For this purpose, you should
1. Run GridSearchCV to pick the best 'C' for your LinearSVC. You can use for example param_grid = ```
[{'C': [0.01, 0.1, 1., 10.]}]```
2. Fit the ```LinearSVC(penalty='l1', dual=False)``` using the current set of features (do not forget to reduce your training dataset and apply the preprocessing again)
3. Reduce the number of features using ```SelectFromModel(lsvc, prefit=True)```'s transform method
4. Repeat until the number of returned features does not change or for a maximum number of iterations, e.g. 10

For debugging, you can use only a subset of the features, e.g.
```selected_features = ['attractive_o', 'funny_o', 'shared_interests_o', 'attractive_partner']``` to start with. But note that this set will not be further reduced by SelectFromModel.


In [32]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel

param_grid = [{'C': [0.01, 0.1,]}]

selected_features = np.array(Xtrain.columns)
previous_number_of_features = np.inf
max_it = 10
i = 0
iteration_num_features_not_changed  = 0
estimator = LinearSVC(penalty='l1', dual=False, max_iter=5000)

while i < max_it and iteration_num_features_not_changed < 2:
  iteration_num_features_not_changed += int(previous_number_of_features == len(selected_features))
  if previous_number_of_features != len(selected_features):
      iteration_num_features_not_changed = 0
  previous_number_of_features = len(selected_features)

  # Grid Search best params
  Xtrain = Xtrain[selected_features]
  X = ct.fit_transform(Xtrain)
  clf = GridSearchCV(estimator, param_grid).fit(X, ytrain)

  # use best params to do feature selection
  estimator = LinearSVC(**clf.best_params_, penalty='l1', dual=False).fit(X, ytrain)
  sfm = SelectFromModel(estimator, prefit=True)
  mask = sfm.get_support()

  # reduce features
  selected_features = np.array(Xtrain.columns)[mask]

  print(f'len = {len(selected_features)}')
  i += 1



len = 47
len = 47
len = 47


In [30]:
def f(arg1='str', arg2=1):
    print(arg1, arg2)

def g(a,b):
    print(a,b)

t = (1,2)
d = {'b': 1,
    'a': 2}
print(g(**d), g(*t))


2 1
1 2
None None


### [Optional] Hyperparameter optimization and Cross-validation
---
Having implemented the Feature Selection Algorithm, you can implement the whole proposed system, i.e. Implement Algorithm 1 on page 4 for our dataset. You can and should modify/simplify it whenever necessary.

In [3]:
import pandas as pd
import numpy as np
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC


"""Step 1: Preprocessing     [x]"""
def maybe_convert_to_int(col):
  try:
    col = pd.to_numeric(col.replace('?', -1), errors='raise').astype(int)
  except Exception as e:
    return col
  return col


df = pd.read_csv('data/speeddating.csv', low_memory=False)
df = df.apply(maybe_convert_to_int, axis=0)

cat_cols = make_column_selector(dtype_include=object)
num_cols = make_column_selector(dtype_include=np.number)

num_pipe = make_pipeline(SimpleImputer(missing_values=-1, strategy='mean'), StandardScaler())
cat_pipe = make_pipeline(OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan))

ct = make_column_transformer(
  (num_pipe, num_cols),
  (cat_pipe, cat_cols),
)

# ct.transform(Xtrain) will change the column order to num_cols(df) + cat_cols(df)
# we change it beforehand, so that ytrain will have same column order after calling train_test_split
df = df[num_cols(df) + cat_cols(df)]
df = df.drop(['decision', 'decision_o'], axis=1)
Xtrain, Xtest, ytrain, ytest = train_test_split(df.drop('match', axis=1), df['match'])

""" Step 2: Feature Selection [x] """
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel

a = 1

from sklearn.svm import LinearSVC

param_grid = [{'C': [0.01, 0.1,]}]
selected_features = np.array(Xtrain.columns)

prev = -np.inf
idle = 0

max_it = 10
i = 0

estimator = LinearSVC(penalty='l1', dual=False, max_iter=5000)

while i < max_it and idle < 2:

  # Grid Search best params
  Xtrain = Xtrain[selected_features]
  X = ct.fit_transform(Xtrain)
  clf = GridSearchCV(estimator, param_grid).fit(X, ytrain)

  # use best params to do feature selection
  estimator = LinearSVC(**clf.best_params_, penalty='l1', dual=False, max_iter=5000).fit(X, ytrain)
  sfm = SelectFromModel(estimator, prefit=True)
  mask = sfm.get_support()

  # reduce features
  selected_features = np.array(Xtrain.columns)[mask]
  idle += 1 if prev == len(selected_features) else 0

  prev = len(selected_features)
  print(f'len = {len(selected_features)}')
  i += 1

""" Steps 3-7: [Optional] """


# Step 3: 10-fold cross validation split
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=10)
# Step 4: train classifier
cv_score = cross_val_score(estimator=estimator, X=X, y=ytrain, cv=kf)
# Step 5: cross_val_score: test set
test_scores = cross_val_score(estimator, ct.transform(Xtest[selected_features]), ytest)
# Step 6: cross_val_score.mean()
print(f'mean_test_score = {test_scores.mean()}')
# Step 7:
print(f'best performance = {max(test_scores)}')

len = 46


KeyboardInterrupt: 