# Supervised Learning with scikit-learn (cont.)

### CROSS VALIDATION
Cross-validation (CV) is an iterative process in which each datapoint or cluster of data is given an equal chance at becoming both a training and testing point for the model.

##### Why Cross Validation?
Say you have created and fitted a linear regression model and decide to evaluate its $R^2$ metric. Such metrics are heavily dependent on how the data is split during the train and test-splitting. Due to the random nature of the splitting of data, that $R^2$ metric may not be representative of the model's ability to generalize to unseen data. The solution to this is applying cross-validation which iteratively splits data into training and testing sets multiple times.


Cross-validation works by splitting the data into **folds**. One such fold is designated as test data and the rest as training data. This process is done in an iteration known as a **split**. For each split, the training and testing folds are updated. This ensures that the model has a different set of training and testing folds for each split iteration. At each step, the metric of interest (such as $R^2$) is also calculated. That is, if there have been $n$ splits, then there will be $n$ metrics computed as well. From these values we can compute statistics (e.g. mean, median, confidence intervals, etc.)

CV done with $k$ number of folds is called $k$-fold cross-validation. Also, note that **as $k$ increases**, CV becomes more **computationally expensive**.


In [1]:
import pandas as pd 
import numpy as np
from sklearn.linear_model import LinearRegression

# the KFold class and cross_val_scorer method are required for CV
from sklearn.model_selection import cross_val_score, KFold

In [5]:
# Using the same womens' health dataset
womens_health = pd.read_csv('../Data/diabetes_clean.csv')
womens_health.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [13]:
# Setting up multiple linear regression 
predictors_X = womens_health.drop('glucose', axis=1).values
target_y = womens_health['glucose'].values

# Create the CV generator
folds = KFold(n_splits=6, shuffle=True, random_state=12) # Default n_splits is 5. We want 6 folds so we use 6 splits

glc_reg = LinearRegression()

# Get the results of the cross-validation
# Pass the model, predictors, and the CV generator
cv_result = cross_val_score(glc_reg, predictors_X, target_y, cv=folds)

# Note that the score reported will be R^2 since it is the default score for linear regression

In [None]:
# Printing the R^2 scores from cross validation
print(cv_result)

[0.373619   0.27942465 0.38496821 0.33442763 0.24902435 0.29531594]


We can use the scores we've obtained to calculated statistics such as mean score, standard deviation, and the 95% confidence interval.

In [16]:
print(f'CV Average R^2: {np.mean(cv_result)}\nCV R^2 stddev.: {np.std(cv_result)}')
print(f'Score confidence interval: {np.quantile(cv_result, [0.025, 0.975])}')

CV Average R^2: 0.3194632957255576
CV R^2 stddev.: 0.04932122304665104
Score confidence interval: [0.25282439 0.38354956]
