## Cross-Validation
Splitting the data into train and test method doesn't guarantee unbias results, therefore, we need to use cross-validation
to see how the model reacts to it. It's useful to use cross validation in the following scenarios:
- trying out different algorithms and comparing their results -> finding the best model
- finding the best parameters for algorithms that require tuning their parameters (e.g, Lasso, Ridge)

<b>When to use it?</b> To evaluate parameters! It only estimates the accuracy of the model to predict unseen data,
and <b>doesn't improve the model!</b>

In [7]:
# load the modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

%matplotlib inline
sns.set()

In [4]:
# first, load the data
df = load_boston()
boston = pd.DataFrame(df.data, columns=df.feature_names)
boston['target'] = df.target

# instantiate x, y values
X = boston.iloc[:, :-1].values
y = boston.iloc[:, -1].values
labels = df.feature_names

In [4]:
# without cross-validation first
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)


## Cross-Validation w/ KFold & StratifiedKFold
<b>IMPORTANT!</b> StratifiedKFolds works only with categorical targets!

- both iterations
- KFold randomly draws the values
- StratifiedKFold takes account of the distribution of a target variable that you want
distributed in your training and test samples as if it were on the original set

Always provide
- count of observations & target for StratifiedKFold
- number of folds (standard: 10, decrease if many observations, increase otherwise)
- shuffle (RECOMMENDED)

<b>IMPORTANT</b>: don't use it a lot of times, just get the sense of parameter evaluation. Otherwise, it leads to
data snooping/leakage.



In [11]:
# KFold
lm = LinearRegression()
cv_iterator = KFold(n_splits=10, shuffle=True, random_state=101)
stratified_cv_iterator = StratifiedKFold(n_splits=10, shuffle=True, random_state=101)

second_order = PolynomialFeatures(degree=2, interaction_only=False)
third_order = PolynomialFeatures(degree=3, interaction_only=True)

over_param_X = second_order.fit_transform(X)
extra_over_param_X = third_order.fit_transform(X)
cv_score = cross_val_score(lm, over_param_X, y, cv=cv_iterator, scoring='neg_mean_squared_error', n_jobs=1)
print(cv_score)
print(f'CV score: mean = {np.mean(np.abs(cv_score))}, std = {np.std(np.abs(cv_score))}')

[-11.67358627 -22.84201607  -8.76179083 -16.13545457 -12.90825425
  -7.77900085 -12.97982147 -22.18260731 -35.93064666 -13.75241178]
CV score: mean = 16.494559006096857, std = 8.01415830187463
