# Exercise 6: Feature Engineering and Model Selection
## Theory
### Task 1: MC
---
Multiple answers are possible.

**A regularized model has**
 - [ ] lower bias
 - [ ] lower variance
 - [ ] higher bias
 - [ ] higher variance

**Which of the following Regularization techniques can be used for feature selection?**
 - [ ] L1
 - [ ] L2
 - [ ] Early stopping
 - [ ] All of the above

**Standardizing the training data by $\frac{x-\mu}{\sigma}$ can decrease model performance if**
 - [ ] the data is centered
 - [ ] the data is sparse
 - [ ] the data has outliers
 - [ ] All of the above

**How should you standardize (assuming StandardScaler) your test data before evaluating performance?**
 - [ ] not at all
 - [ ] using the mean and variance computed from the training data
 - [ ] by recomputing the mean and variance for the test data
 - [ ] All of the above

**CrossValidation**
 - [ ] the number of folds determines how often the score is computed during cross validation
 - [ ] the cross validated score is reported as the score average of all splits
 - [ ] when using KFold cross validation, every fold contains data of all classes
 - [ ] All of the above

**Consider binary classification. A model that always returns 1 as its predicted class label, is more likely to have**
 - [ ] low accuracy
 - [ ] low precision
 - [ ] high recall
 - [ ] All of the above


### [Optional] Bias-variance decomposition for Regression
---
Proof that for the Mean-squared error, we have
 - $Err[(y - \hat f(x;D))^2] =  Bias(\hat f(x))^2 + Var(\hat f(x))$,
 
where Err denotes the Expectation of the MSE w.r.t to the data distribution.

Note: If you are interested, you can also compute the Bias-variance decomposition for a classification error.

## Programming
---
In the programming exercises, we are going to work on the speeddating dataset from the lecture. For that, you are given the preprocessing pipeline presented there:

In [3]:
import pandas as pd
import numpy as np
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

def maybe_convert_to_int(col):
  try:
    col = pd.to_numeric(col.replace('?', -1), errors='raise').astype(int)
  except Exception as e:
    return col
  return col


df = pd.read_csv('data/speeddating.csv', low_memory=False)
df = df.apply(maybe_convert_to_int, axis=0)

cat_cols = make_column_selector(dtype_include=object)
num_cols = make_column_selector(dtype_include=np.number)

num_pipe = make_pipeline(SimpleImputer(missing_values=-1, strategy='mean'), StandardScaler())
cat_pipe = make_pipeline(OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan))

ct = make_column_transformer(
  (num_pipe, num_cols),
  (cat_pipe, cat_cols),
)

# ct.transform(Xtrain) will change the column order to num_cols(df) + cat_cols(df)
# we change it beforehand, so that ytrain will have same column order after calling train_test_split
df = df[num_cols(df) + cat_cols(df)]
df = df.drop(['decision', 'decision_o'], axis=1)
Xtrain, Xtest, ytrain, ytest = train_test_split(df.drop('match', axis=1), df['match'])
Xtrain

Unnamed: 0,has_null,wave,age,age_o,d_age,samerace,importance_same_race,importance_same_religion,pref_o_attractive,pref_o_sincere,...,d_concerts,d_music,d_shopping,d_yoga,d_interests_correlate,d_expected_happy_with_sd_people,d_expected_num_interested_in_me,d_expected_num_matches,d_like,d_guess_prob_liked
1425,0,4,28,26,2,0,1,1,10,20,...,[9-10],[9-10],[6-8],[0-5],[0.33-1],[7-10],[10-20],[5-18],[6-8],[7-10]
5046,1,14,33,30,3,0,2,1,31,10,...,[0-5],[0-5],[9-10],[0-5],[0-0.33],[0-4],[0-3],[5-18],[0-5],[0-4]
7303,1,19,30,34,4,0,2,8,15,20,...,[6-8],[6-8],[9-10],[9-10],[0.33-1],[5-6],[0-3],[0-2],[6-8],[5-6]
316,1,2,26,24,2,1,-1,-1,25,15,...,[0-5],[0-5],[0-5],[0-5],[-1-0],[0-4],[0-3],[3-5],[0-5],[0-4]
1114,0,4,28,28,0,1,6,9,20,20,...,[6-8],[6-8],[6-8],[0-5],[0.33-1],[5-6],[4-9],[0-2],[6-8],[0-4]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5928,1,15,24,27,3,1,1,6,20,20,...,[6-8],[6-8],[6-8],[0-5],[0-0.33],[0-4],[0-3],[0-2],[0-5],[0-4]
5330,1,14,26,34,8,0,1,1,30,20,...,[0-5],[0-5],[6-8],[0-5],[0-0.33],[5-6],[0-3],[0-2],[6-8],[7-10]
129,1,1,22,26,4,0,3,5,15,15,...,[9-10],[9-10],[6-8],[0-5],[-1-0],[0-4],[4-9],[0-2],[9-10],[7-10]
7662,1,21,28,23,5,0,2,1,40,30,...,[0-5],[0-5],[6-8],[0-5],[-1-0],[5-6],[0-3],[3-5],[6-8],[0-4]


### [Optional] Sklearn Feature Selection using RandomForest
Repeat the Feature Selection Exercise from the lecture (using SelectFromModel) but with RandomForestClassifier using the train and test split, as well as the column transformer defined in the cell above.

### Task 1: Feature Selection
---
Look at Algorithm 2 on page 7 of the paper [Feature Selection Based on L1-Norm SupportVector Machine and Effective Recognition Systemfor Parkinson’s Disease Using Voice Recordings](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8672565)

The goal of this exercise is to implement a similar pipeline but for the speeddating dataset from the lecture. You can download it at
https://www.openml.org/d/40536. 

We are going to repeatedly perform feature selection, using a L1 penalized SVM, of which the parameter 'C' has been optimized using GridSearchCV, until the feature set can no longer be reduced.

For this purpose, you should
1. Run GridSearchCV to pick the best 'C' for your LinearSVC. You can use for example param_grid = ```
[{'C': [0.01, 0.1, 1., 10.]}]```
2. Fit the ```LinearSVC(penalty='l1', dual=False)``` using the current set of features (do not forget to reduce your training dataset and apply the preprocessing again)
3. Reduce the number of features using ```SelectFromModel(lsvc, prefit=True)```'s  method
4. Repeat until the number of returned features does not change or for a maximum number of iterations, e.g. 10

For debugging, you can use only a subset of the features, e.g.
```selected_features = ['attractive_o', 'funny_o', 'shared_interests_o', 'attractive_partner']``` to start with. But note that this set will not be further reduced by SelectFromModel.


In [3]:
pass
param_grid = [{'C': [0.01, 0.1, 1., 10.]}]

### [Optional] Hyperparameter optimization and Cross-validation
---
Having implemented the Feature Selection Algorithm, you can implement the whole proposed system, i.e. Implement Algorithm 1 on page 4 for our dataset. You can and should modify/simplify it whenever necessary.

In [4]:
pass