# Lecture 6 - Feature Engineering and Model selection

1. Bias-variance tradeoff
2. Regularization (Lasso, Ridge, Early Stopping)
3. Feature Engineering
    1. Vectorization and Standardization
    2. Feature selection
    3. Feature extraction/creation
4. Model Selection
    1. Cross Validation
    2. GridSearch & RandomSearch
    3. score metrics (Classification/Regression)

## 1. Bias-variance tradeoff
The errors of a machine learning models can be separated into:
 - reducible error
 - irreducible error

Irreducible errors can not be changed, even when using different models. 

The reducible error can be decomposed into **bias and variance**.

### Bias
Let $\hat\theta = g(x_1,...,x_N)$ be a point estimator of the true $\theta$ that generated the data.

Bias: (inherent) inability of a model to capture relation between data and parameters.
 - Bias: $\mathbb{E}_{x|\theta}[\hat\theta] - \theta$,

Example: Linear regression model has inherent bias of assuming linear relationship between X and y. In reality, there may be additional unknown- and/or non-linear relationships.

### Variance
Variance: how much does our estimator change as function of the data sample. I.e. how much does it overfit a particular dataset.
 - Variance: $Var[\hat\theta] = \mathbb{E}[\hat\theta - \mathbb{E}[\hat\theta]^2]$

Example: Decision tree with too high depth. Neural Network with too many Hidden units.

<img src="img\ufof.png" alt="Drawing" style="width: 512px;"/>

## Bias-variance tradeoff 
**=> Generalization performance**

We can't do anything about the irreducible error.
The reducible error can be decomposed into **bias and variance**.

Applying such a decomposition for the Mean Squared error loss, it can be shown that
 - $Err[(y - \hat f(x;D))^2] =  Bias(\hat f(x))^2 + Var(\hat f(x))$

**Cross validation can be used to optimize the bias-variance tradeoff of your model.**

### Underfitting/Overfitting
<img src="img\bias-variance-tradeoff.png" alt="Drawing" style="width: 512px;"/>

### Model improvements
Techniques to decrease Variance (reduce overfitting):
- dimensionality reduction
- feature selection
- regularization
- more data
- ensemble learning

## 2. Regularization
1. Lasso
2. Ridge
3. Early Stopping

<img src="img\bias-variance-tradeoff.png" alt="Drawing" style="width: 512px;"/>

### 1. L1 - Regularization (Lasso)

Penalizes the model by the l1-norm of its coefficients by adding the term
 - $\lambda \sum^p_{j=1} |w_j|$

to the objective function.
 
Higher values for $\lambda$
=> more coefficients close to or equal to zero 
=> more features irrelevant for computing the target

**=> L1 regularization can be used for feature selection!**
- e.g. remove features $j$, for which corresponding weight $w_j$ is zero. Then, train other models with reduced feature set 

**For example for regression** 회귀

- minimize $\sum^n_{i=1}(y_i - \sum^n_{j=1}x_{ij}w_j)^2 + \lambda \sum^p_{j=1} |w_j|$ 
 
**or for classification (via L1-SVMs)** 분류

- minimize $| \mathbf{w} | + C | \xi |$ subject to
 $D(A\mathbf{w} + \mathbf{1}b) +\xi \geq 1$,
 $\xi \geq 1$
 
As $\lambda$ increases, bias __ creases
As $\lambda$ increases, variance __ creases

### 2. L2 - Regularization (Ridge)

Penalizes the model by the l2-norm of its coefficients by adding the term
 - $\lambda \sum^p_{j=1} \lVert w_j \rVert^2_2$

Examples from $L1$ - Regularization apply analogously. 

However, $L2$ - Regularization is not used for feature selection!
 - because square penalty does not enforce sparsity of solution

### 3. Early Stopping
Another, less explicit, regularization technique

**Early Stopping = Only train for a limited number of iterations!**

Examples:
 - Decision Trees will have less depth
 - Neural Networks with have less memorization of training data

## 3. Feature Engineering
1. Vectorization and Standardization
 - numerical stability
 - better training results due to improved geometry of the problem
2. Feature extraction/creation
 - introduce potential feature relations
3. Feature selection
 - interpretability
 - shorter training times
 - enhanced generalization by reducing overfitting/variance

The process of creating features for machine learning from raw data is what we refer to as feature engineering.

Lets get practical with something just as interesting: Speed-dating!
On the speeddating dataset we will investigate the following instances of Feature Engineering:

 - Data Imputation
 - Standardization
 - Vectorization of categorical features 

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('data/speeddating.csv', low_memory=False)
df.head(3)

Unnamed: 0,has_null,wave,gender,age,age_o,d_age,d_d_age,race,race_o,samerace,...,d_expected_num_interested_in_me,d_expected_num_matches,like,guess_prob_liked,d_like,d_guess_prob_liked,met,decision,decision_o,match
0,0,1,female,21,27,6,[4-6],'Asian/Pacific Islander/Asian-American',European/Caucasian-American,0,...,[0-3],[3-5],7,6,[6-8],[5-6],0,1,0,0
1,0,1,female,21,22,1,[0-1],'Asian/Pacific Islander/Asian-American',European/Caucasian-American,0,...,[0-3],[3-5],7,5,[6-8],[5-6],1,1,0,0
2,1,1,female,21,22,1,[0-1],'Asian/Pacific Islander/Asian-American','Asian/Pacific Islander/Asian-American',1,...,[0-3],[3-5],7,?,[6-8],[0-4],1,1,1,1


In [2]:
# realize that numbers are stored as strings...
df.select_dtypes(exclude=int).head(3)

Unnamed: 0,has_null,wave,gender,age,age_o,d_age,d_d_age,race,race_o,samerace,...,d_expected_num_interested_in_me,d_expected_num_matches,like,guess_prob_liked,d_like,d_guess_prob_liked,met,decision,decision_o,match
0,0,1,female,21,27,6,[4-6],'Asian/Pacific Islander/Asian-American',European/Caucasian-American,0,...,[0-3],[3-5],7,6,[6-8],[5-6],0,1,0,0
1,0,1,female,21,22,1,[0-1],'Asian/Pacific Islander/Asian-American',European/Caucasian-American,0,...,[0-3],[3-5],7,5,[6-8],[5-6],1,1,0,0
2,1,1,female,21,22,1,[0-1],'Asian/Pacific Islander/Asian-American','Asian/Pacific Islander/Asian-American',1,...,[0-3],[3-5],7,?,[6-8],[0-4],1,1,1,1


In [7]:
# parse each column to numeric (after replacing missing values with -1) if possible
def maybe_convert_to_int(col):
    try:
        col = pd.to_numeric(col.replace('?', -1), errors='raise').astype(int)
    except Exception as e:
        return col
    return col
df = df.apply(maybe_convert_to_int, axis=0)
df.select_dtypes(exclude=int).head(3)

Unnamed: 0,gender,d_d_age,race,race_o,d_importance_same_race,d_importance_same_religion,field,d_pref_o_attractive,d_pref_o_sincere,d_pref_o_intelligence,...,d_concerts,d_music,d_shopping,d_yoga,d_interests_correlate,d_expected_happy_with_sd_people,d_expected_num_interested_in_me,d_expected_num_matches,d_like,d_guess_prob_liked
0,female,[4-6],'Asian/Pacific Islander/Asian-American',European/Caucasian-American,[2-5],[2-5],Law,[21-100],[16-20],[16-20],...,[9-10],[9-10],[6-8],[0-5],[0-0.33],[0-4],[0-3],[3-5],[6-8],[5-6]
1,female,[0-1],'Asian/Pacific Islander/Asian-American',European/Caucasian-American,[2-5],[2-5],Law,[21-100],[0-15],[0-15],...,[9-10],[9-10],[6-8],[0-5],[0.33-1],[0-4],[0-3],[3-5],[6-8],[5-6]
2,female,[0-1],'Asian/Pacific Islander/Asian-American','Asian/Pacific Islander/Asian-American',[2-5],[2-5],Law,[16-20],[16-20],[16-20],...,[9-10],[9-10],[6-8],[0-5],[0-0.33],[0-4],[0-3],[3-5],[6-8],[0-4]


In [4]:
# column selectors
from sklearn.compose import make_column_selector, make_column_transformer

cat_cols = make_column_selector(dtype_include=object)
num_cols = make_column_selector(dtype_include=np.number)

In [5]:
#!pip install --upgrade scikit-learn --user

In [6]:
# pipeline
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

num_pipe = make_pipeline(
    # 1. Imputation
    SimpleImputer(missing_values=-1, strategy='mean'),
    # 2. Standardization
    StandardScaler()
)
cat_pipe = make_pipeline(
    # 3. Vectorization of categorical values
    OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)
)

transform = make_column_transformer(
    (num_pipe, num_cols), 
    (cat_pipe, cat_cols)
)


We split the sample into a train and a test dataset. Only the train dataset will be used in the following exploratory analysis. This is a way to emulate a real situation where predictions are performed on an unknown target, and we don’t want our analysis and decisions to be biased by our knowledge of the test data.

In [8]:
# split + transform
from sklearn.model_selection import train_test_split

df = df.drop(['decision', 'decision_o'], axis=1)
df = df[num_cols(df) + cat_cols(df)]
Xtrain, Xtest, ytrain, ytest = train_test_split(df.drop('match', axis=1), df['match'])

# nice display
from sklearn import set_config
set_config(display='diagram')

transform.fit(Xtrain)

In [9]:
# print transformed
transform.transform(Xtrain)[0]

array([ 3.83069323e-01, -4.09555349e-02,  4.60371382e-01, -1.01610546e-01,
       -4.67942952e-01, -8.14278351e-01,  1.13954796e+00, -9.52152456e-01,
       -5.90570814e-01, -3.28621070e-01,  7.13392969e-01,  4.31624700e-01,
       -9.25743326e-01,  1.32292344e+00, -8.97041658e-02, -9.63773506e-02,
       -8.99751930e-01,  3.15199302e-01, -4.44747719e-01,  2.69353649e-01,
       -1.86307252e-01,  9.08305231e-02, -3.51088173e-02, -5.04181487e-02,
       -1.08293736e-01,  5.13567426e-01,  6.59632890e-01,  1.20868451e+00,
        1.88703445e-01,  1.49219611e+00,  1.36827911e+00, -9.36488286e-02,
        1.64116258e+00,  1.71286473e+00,  3.13605444e-01,  1.32428502e-01,
       -7.28894651e-01,  1.38552374e+00,  1.53016375e-01,  7.30165084e-01,
        1.29549864e-01, -5.19861557e-03, -7.70938149e-01,  1.66082905e+00,
        1.21212293e+00,  1.30175375e+00, -3.45777488e-01, -5.18578011e-01,
        9.59809529e-02,  6.41258819e-01, -8.49871085e-01,  1.20805371e+00,
       -1.77394836e+00, -

## 3.1 Vectorization and Standardization
Let us inspect the _sklearn.preprocessing.StandardScaler_ used above and compare it to other useful standardizers:
 1. _sklearn.preprocessing.MaxAbsScaler_
 2. _sklearn.preprocessing.MinMaxScaler_
 3. _sklearn.preprocessing.RobustScaler_
 

### _sklearn.preprocessing.StandardScaler_ 
- Computes $\frac{x - \mu}{\sigma}$ 
 
It is almost always beneficial, if not crucial, to standardize data to have zero mean and unit variance, if that data would otherwise be on different scales. 

In particular, algorithms that involve computation of hyperplanes (SVMs, Perceptron, MLPs, ...) are very susceptible to data on different scales, as their geometry relies on data to be on the same scale. 


**In short: you almost always want to use the _StandardScaler_ on your `numerical data`**

Exceptions are:
 - sparse data (Use MaxAbsScaler)
 - data with many/large outliers (Use RobustScaler)

> Remark: If you are scaling your training data in the range $[0,1]$, consider scaling it to $[-1,1]$ instead, so that it is distributed around the origin.
(c.f. http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)

### _sklearn.preprocessing.MaxAbsScaler_ 
Use this one if you have _sparse_ data!

In [10]:
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
    [ 0.,  1., -1.]]
transformer = StandardScaler().fit(X)
transformer.transform(X)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### _sklearn.preprocessing.RobustScaler_ 
> "This Scaler removes the median and scales the data according to the quantile range" (c.f. sklearn docs)
https://en.wikipedia.org/wiki/Interquartile_range#/media/File:Boxplot_vs_PDF.svg

- Use this one if you have many/large outliers in your data!

In [11]:
# RobustScaler Example
from sklearn.preprocessing import RobustScaler
X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
    [ 40.,  1., -2.]]
transformer = StandardScaler().fit(X)
transformer.transform(X)

array([[-0.62725005, -1.41421356,  0.46291005],
       [-0.78406256,  0.70710678,  0.9258201 ],
       [ 1.41131261,  0.70710678, -1.38873015]])

## 3.2 Feature selection

We can perform Feature Selection :
- Using trained models
    - Select features based on a LinearSVC estimator (c.f. L1-Regularization for feature selection) using coef_ attribute
    - Select features based on a RandomForest (using feature_importances_ attribute)

- Using Variance or score function
    - sklearn.feature_selection.VarianceThreshold
    - sklearn.feature_selection.SelectKBest / sklearn.feature_selection.SelectPercentile

### Using Models
Try out `LinearSVC.coef_` and `RandomForest.feature_importances_` as 'scores'

- [feature selection based model(모델 기반 특성 선택)](https://woolulu.tistory.com/66)
- [변수 중요도 선택](https://data-newbie.tistory.com/360)
- [모델 기반 특성 선택 (Model based feature selection)](https://wikidocs.net/26409)

In [10]:
# fit LinearSVC on all features

In [11]:
# SelectFromModel using LinearSVC
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(transform.fit_transform(Xtrain), ytrain)

lsvc.score(transform.transform(Xtest), ytest)

0.8539379474940334

In [12]:
# Reduce number of features
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(lsvc, prefit=True)

Xnew = sfm.transform(transform.transform(Xtrain))

In [13]:
# rerun training on reduced feature set
mask = sfm.get_support()
selected_features = np.array(Xtrain.columns)[mask]

# refit transformer on reduced dataset
transform.fit(Xtrain[selected_features])

# retrain classifier
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(Xnew, ytrain)

# evaluate on test set
lsvc.score(transform.transform(Xtest[selected_features]), ytest)
print(selected_features)

['age' 'd_age' 'importance_same_religion' 'pref_o_intelligence'
 'pref_o_funny' 'pref_o_ambitious' 'attractive_o' 'funny_o' 'ambitous_o'
 'shared_interests_o' 'intellicence_important' 'funny_important'
 'ambtition_important' 'funny' 'attractive_partner' 'funny_partner'
 'ambition_partner' 'shared_interests_partner' 'dining' 'art' 'gaming'
 'clubbing' 'tv' 'shopping' 'expected_num_interested_in_me'
 'expected_num_matches' 'like' 'guess_prob_liked' 'met' 'd_d_age' 'race'
 'race_o' 'd_importance_same_race' 'field' 'd_pref_o_attractive'
 'd_pref_o_sincere' 'd_attractive_important' 'd_sincere' 'd_intelligence'
 'd_funny' 'd_sports' 'd_exercise' 'd_movies' 'd_concerts' 'd_yoga'
 'd_expected_happy_with_sd_people']


In [14]:
# SelectFromModel using RandomForestClassifier: Homework

## 3.3 Feature extraction/creation
새로운 특성 만들어 내기!

- `sklearn.preprocessing.PolynomialFeatures`
- `sklearn.preprocessing.FunctionTransformer`

Create **new features** and possibly concatenate them

### sklearn.preprocessing.PolynomialFeatures

In [12]:
# Polynomials
from sklearn.preprocessing import PolynomialFeatures

X = np.array([[0, 1],
              [2, 3],
              [4 ,5]])
poly = PolynomialFeatures(2)
poly.fit_transform(X)
# only interactions
poly = PolynomialFeatures(interaction_only=True)
poly.fit_transform(X)

array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])

### sklearn.preprocessing.FunctionTransformer

In [22]:
# FunctionTransformer text example
from sklearn.preprocessing import FunctionTransformer
text = "This is some arbitPrary text. We use it to demonstrate scikit learn's FunctionTransformer API. \
        The number of sentences in the text is three."

def text_stats(document):
    return [{'length': len(document), 'num_sentences': document.count('.')}]

text_stats_transformer = FunctionTransformer(text_stats)
print(text_stats_transformer.fit_transform(text))

[{'length': 148, 'num_sentences': 3}]


In [14]:
#  FunctionTransformer log transform example
large_numbers = np.array([1e10, 1e20,1e30])

def apply_log(large_numbers):
    return np.log(large_numbers)

log_transform = FunctionTransformer(apply_log)
log_transform.fit_transform(large_numbers)

array([23.02585093, 46.05170186, 69.07755279])

## 4. Model Selection
1. Cross Validation
2. GridSearch & RandomSearch
3. score metrics (Classification/Regression)

### 1. Cross Validation
**Must have** to estimate generalization error.

Motivation: What can go wrong with the following procedure?
1. Train until convergence on the training data
2. Evaluate model performance on test set
3. tweak estimator parameters and repeat until estimator performs 
'optimally' on test set

Answer: next slide 

- Problem: Overfitting on the **test set**, i.e. information _leakage_

- Solution: Validation set

- Next Problem: reduction of data set size due to three splits

- Solution: **Cross-validation**!!! 

- 한마디로 데이터 수가 너무 적을때 CV 쓴다.

**How**

1. fit using $k-1$ folds as training data
2. remaining fold is validation set
3. repeat as many times as you have splits

<img src="img/grid_search_cross_validation.png" alt="Drawing" style="width: 400px;"/>

If $n_{splits}$ is the number of total splits (training episodes), then performance of the model is measured using $\frac{1}{n_{splits}} \sum_{i=1}^{n_{splits}} score(split_i)$

### Folding example: Stratified folds
> Note: Do not underestimate nuances of different Cross-validation techniques!

<img src="img/stratified.png" alt="Drawing" style="width: 512px;"/>

[Another useful example is GroupKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold)

### Cross validation using sklearn
>**Cross-validation = Folding (or splitting) + Scoring**

**Folding** is done by [cross-validation-generator](https://scikit-learn.org/stable/glossary.html#term-cross-validation-generator), e.g.
KFold, StratifiedKFold, GroupKFold,...
 - have `get_n_splits()` returns number of splits
 - `split()` returns `train_indices`, `test_indices` and is iterable

```python
KFold(n_splits=2, random_state=None, shuffle=False)
>>> for train_index, test_index in kf.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)

```

**Scoring** is done by [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) 

**Combine** them using either 
- ```python sklearn.model_selection.cross_val_score``` or 
- ```python sklearn.model_selection.cross_validate```. 

The latter returns among other information, the individual splits scores and can use _multiple_ metrics. 

The former can only use a single metric and _only_ returns the final score. 

In [23]:
# Example of Cross validation
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.svm import SVC

# Folding
cv = KFold(n_splits=5)

# classsifier
clf = SVC().fit(transform.fit_transform(Xtrain), ytrain)

# scoring
scoring = make_scorer(accuracy_score)

cross_val_score(clf,
                transform.transform(Xtrain),
                ytrain,
                cv=cv,
                scoring=scoring)

array([0.8265712 , 0.84725537, 0.82895784, 0.84156051, 0.82563694])

### 2. GridSearch & RandomSearch
Hyperparameter Optimization

In [16]:
# define parametergrids
param_grid = [
  {'criterion': ['gini', 'entropy'], 'min_samples_split': [2, 6]},
]

In [24]:
# demo GridSearchCV (alternative is e.g. RandomizedSearchCV)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
selected_features = ['attractive_o', 'funny_o', 'shared_interests_o', 'attractive_partner']
clf = GridSearchCV(RandomForestClassifier(),
                   param_grid,
                   cv=None,
                   scoring=None)
#transform.fit(Xtrain[selected_features])

#clf.fit(transform.transform(Xtrain[selected_features]), ytrain)
#print(clf.best_params_)
#clf.cv_results_

### 3. Score Metrics - Classification
<img src="img/precision.png" alt="Drawing" style="width: 512px;"/>

In [19]:
# compare precision, recall, harmonic mean of them (F1)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
y_pred = [1,1,1,1]
y_true = [1,1,1,0]
metrics = [accuracy_score, f1_score, precision_score, recall_score]

print([metric(y_true, y_pred) for metric in metrics])
# use accuracy whenever you care about true positives and true negatives
# use f1_score whenever you care about false positives and false negatives
# Conclusion: Consider F1 for unbalanced datasets!

[0.75, 0.8571428571428571, 0.75, 1.0]


### 3. Score Metrics - Regression

In [20]:
# create two label sets, one scaled version of other
y_true_1 = [0., .2, .5, .8, 1.]
y_true_2 = [y * 100 for y in y_true_1]

y_pred_1 = [0.1, .21, .51, .81, 1.1]
y_pred_2 = [y * 100 for y in y_pred_1]

In [21]:
# demo MSE is not scale invariant
from sklearn.metrics import r2_score, mean_squared_error
print(mean_squared_error(y_true_1, y_pred_1))
print(mean_squared_error(y_true_2, y_pred_2))

print(r2_score(y_true_1, y_pred_1))
print(r2_score(y_true_2, y_pred_2))

0.004060000000000004
40.60000000000006
0.9701470588235294
0.9701470588235294


# The End - happy to see you in the exercise session!