### KFold Cross Validation -- The Verbose Way

In class we discussed cross validation, with the idea that it gives us a more thorough way of testing the validity of our model.  

To illustrate how it works, we used the function `cross_val_score()` from the `model_selection` module, which returns our individual validation scores from each round.  

The `cross_val_score()` function is useful and concise, but has the drawback that it doesn't allow you to see the index positions for each of your folds, which can be problematic if you want to examine the data in a specific validation set.  

This notebook will discuss the use of an alternative way to do KFold cross validation, using the `KFold` module, which is a more deliberate way of using the technique.

### Boston Housing Example

To demonstrate how the module works, we'll continue to use the boston housing dataset.

In [162]:
# import our modules
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression

# import the dataset
df = pd.read_csv('../data/housing.csv')

# declare X & y
X = df.iloc[:, :-1]
y = df['PRICE']

# initialize model
lreg = LinearRegression()

In class we used a randomized training and test set and then used 10-Fold cross validation to determine the sensitivity of our data to potential outliers and variation in out of sample data.  Sample code looked like this:

In [163]:
# define training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2019)

In [164]:
# use cross_val_score to get our validation scores
scores = cross_val_score(estimator=lreg, X=X_train, y=y_train, cv=10)
scores

array([0.74469313, 0.72965599, 0.78482663, 0.65675571, 0.66315517,
       0.78283033, 0.81788242, 0.79596427, 0.5480151 , 0.75188945])

What stands out about these scores is their variation.  0.55 - 0.82 is a fairly wide range of r-squared values for a dataset.  What also stands out is the validation score in the 9th fold, which is much lower than the others.  

A natural question then is "what's so different about the data in that particular fold?"  

If you limit yourself to the `cross_val_score()` method you won't be able to look into the issue further.

### KFold Module

The `KFold` module allows one to examine data within a particular fold, if the issue is warranted.

Let's take a look at how it works.

In [165]:
# initialize it
kfold = KFold(n_splits=10)

When initialized, KFold has three parameters that can be set:

 - **n_splits**: integer, the number of folds to use in cross validation
 - **shuffle**: boolean, whether or not to shuffle your indices or use them in sequential order
 - **random_state**: the seed value to use when shuffling indices.  Only used if `shuffle` is set to `TRUE`
 
The `KFold` module has only one method of consequence:  `split()`, which returns a generator object of the indices to be used in validation and test sets for each subsequent fold.

In [166]:
# we'll use kfold to split our data into its indices
splits = list(kfold.split(X_train))

Two notes about the above line:

 - the `list()` method is used because a generator object doesn't allow you to lookup values at an index position
 - you could also payy `y` into the `split()` method as well.  the method just needs to something with an index.

The `splits` variable allows you then to access individual index values for each fold.

In [167]:
# the list has 10 items -- a tuple containing index positions the validation and test set in each fold
len(splits)

10

In [168]:
# each item in the list is a collection of two lists - one with the index positions for the test set, and one for the validation set
splits[0]

(array([ 41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
         54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
         67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
         80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
         93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105,
        106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
        119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
        132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
        145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157,
        158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
        171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
        184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196,
        197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
        210, 211, 212, 213, 214, 215, 216, 217, 218

In [169]:
# these are the indices for the TEST set in ROUND 1 of KFold
splits[0][0]

array([ 41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
        54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
        67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105,
       106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
       119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
       132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
       145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157,
       158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
       171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
       184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196,
       197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
       210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 22

In [170]:
# and these are the indices for the validation set in round 1
splits[0][1]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40])

Let's make a small observation:  the index values from the `splits` variable are ordered, but the index values in our variables for `X_train` and `y_train` are not.  

For our results to be consistent with what we had from `cross_val_score()` we'll want to reset the index on our training set so it's monotonically increasing, like the index values in `splits`.

In [171]:
# notice how this has an unordered index, because it was shuffled?
X_train.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
249,0.19073,22.0,5.86,0,0.431,6.718,17.5,7.8265,7,330,19.1,393.74,6.56
51,0.04337,21.0,5.64,0,0.439,6.115,63.0,6.8147,4,243,16.8,393.97,9.43
151,1.49632,0.0,19.58,0,0.871,5.404,100.0,1.5916,5,403,14.7,341.6,13.28
486,5.69175,0.0,18.1,0,0.583,6.114,79.8,3.5459,24,666,20.2,392.68,14.98
235,0.33045,0.0,6.2,0,0.507,6.086,61.5,3.6519,8,307,17.4,376.75,10.88


In [172]:
# this will get the index values increasing stepwise from 0 to 403
X_train = X_train.reset_index()
y_train = y_train.reset_index()

# this will get rid of the 'index' column that pops up after you reset the index
X_train = X_train.drop('index', axis=1)
y_train = y_train.drop('index', axis=1)

How does one get this information to work for actual cross validation?  

The following for-loop would do the trick:

In [174]:
lreg = LinearRegression()
cross_val_scores = []

for test_idx, val_idx in kfold.split(X_train):
    lreg.fit(X_train.iloc[test_idx], y_train.iloc[test_idx])
    score = lreg.score(X_train.iloc[val_idx], y_train.iloc[val_idx])
    cross_val_scores.append(score)

Now if one looks at the validation scores returned from this, they exactly match what we had from the `cross_val_score()` method:

In [175]:
cross_val_scores

[0.7446931264241542,
 0.7296559896509521,
 0.7848266265881662,
 0.6567557105078086,
 0.6631551694581892,
 0.782830328560734,
 0.8178824192596486,
 0.7959642722807119,
 0.5480150991902277,
 0.7518894508603533]

So now, if we wanted to, we could use the values from the `splits` variable to get the index values for the 9th fold and go from there.

In [176]:
# these are the index values for the validation set in the 9th fold
val_idx = splits[8][1]

In [177]:
# and we can look at the original values in our dataset
X_train.iloc[val_idx]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
324,0.07978,40.0,6.41,0,0.447,6.482,32.1,4.1403,4,254,17.6,396.9,7.19
325,2.73397,0.0,19.58,0,0.871,5.597,94.9,1.5257,5,403,14.7,351.85,21.45
326,0.04203,28.0,15.04,0,0.464,6.442,53.6,3.6659,4,270,18.2,395.01,8.16
327,4.54192,0.0,18.1,0,0.77,6.398,88.0,2.5182,24,666,20.2,374.56,7.79
328,0.06162,0.0,4.39,0,0.442,5.898,52.3,8.0136,3,352,18.8,364.61,12.67
329,0.03113,0.0,4.39,0,0.442,6.014,48.5,8.0136,3,352,18.8,385.64,10.53
330,0.04684,0.0,3.41,0,0.489,6.417,66.1,3.0923,2,270,17.8,392.18,8.81
331,0.08187,0.0,2.89,0,0.445,7.82,36.9,3.4952,2,276,18.0,393.53,3.57
332,0.03932,0.0,3.41,0,0.489,6.405,73.9,3.0921,2,270,17.8,393.55,8.2
333,0.01432,100.0,1.32,0,0.411,6.816,40.5,8.3248,5,256,15.1,392.9,3.95


In [178]:
# and for the target variable
y_train.iloc[val_idx]

Unnamed: 0,PRICE
324,29.1
325,15.4
326,22.9
327,25.0
328,17.2
329,17.5
330,22.6
331,43.8
332,22.0
333,31.6


To give an example of how this is helpful for future diagnostics, let's take a look at our errors for the validation set with the low r-squared values.

In [179]:
# get the index values for the training and validation sets in the 9th fold
val_idx  = splits[8][1]
test_idx = splits[8][0]

In [180]:
# fit on the test index
lreg.fit(X_train.iloc[test_idx], y_train.iloc[test_idx])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [181]:
# get the predictions for the validation index
preds = lreg.predict(X_train.iloc[val_idx])

In [185]:
# let's take a look at the error column for the validation set
np.abs((y_train.loc[val_idx] - preds)).sort_values(by='PRICE', ascending=False)

Unnamed: 0,PRICE
342,30.353583
347,26.234526
331,6.868811
326,5.608272
332,5.259861
346,5.198142
362,4.968544
336,4.954204
330,4.530361
348,4.512531


Well, it's fairly obvious what's going on here.......about 80% of your squared error is being caused by two samples.  This is a natural point for follow up analysis.