# Module 03

## Session 06 Model Performance, Evaluation Method and Hyperparameter Tuning

# Algorithm Chains 1

data: generate data
* y ~ normal (mean = 0, std = 1, n = 100)
* X ~ normal (mean = 0, std = 1, n = 100, dimension = 10000)
* y and X uncorelated to each other (no relationship)

information leakage
* modeling X and u using ridge + percentile feature selection (f statistics)
* do feature selection: apply fit and transform to X
* compute R-square using cross validation

no information leakage
* modeling X and y using ridge + percentile feature selection (f statistics)
* make pipeline: feature selection + regression
* compute R-square using cross validation


# Data

In [1]:
import pandas as pd
import numpy as np

In [4]:
rnd = np.random.RandomState(seed=2020)
X = rnd.normal(size=(100,10000))
y = rnd.normal(size=(100, ))

# Information Leakage

In [5]:
from sklearn.feature_selection import SelectPercentile, f_regression

In [6]:
select = SelectPercentile(score_func=f_regression, percentile=5)
X_selected = select.fit_transform(X,y)
X_selected.shape

(100, 500)

In [7]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

In [8]:
cv_score = cross_val_score(
    Ridge(),
    X_selected,
    y,
    cv=5
)

In [9]:
cv_score

array([0.92165345, 0.89445668, 0.92827414, 0.93088559, 0.91044624])

# No Information Leakage

In [10]:
from sklearn.pipeline import Pipeline

In [11]:
select = SelectPercentile(score_func=f_regression, percentile=5)

pipe_model = Pipeline(
    [
        ('select', select),
        ('model', Ridge())
    ]
)

In [12]:
cv_score = cross_val_score(
    pipe_model,
    X,
    y,
    cv=5
)

In [13]:
cv_score

array([-0.0330359 , -0.07840338, -0.06731106, -0.04544685, -0.07314928])