### [Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)

Cross-validation is essential in model development - it allows us to compare the performance of alternative algorithms and different settings for model hyperparameters, *without* making use of the test data. This is very important so that we can obtain an accurate assessment of the final model performance.

`KFold` is a simple way to get the data indices for cross-validation, which we can loop over:

In [None]:
# Using only the first 100 data points
X = diabetes.data[:100]
y = diabetes.target[:100]

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5,shuffle=True,random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

for train, test in kf.split(X):
    print("training set indices:")
    print(train)
    print("test set indices:")
    print(test)
    lm.fit(X[train], y[train])
    y_pred = lm.predict(X[test])
    print("r2 = %.2f" % r2_score(y[test],y_pred))
    print()

If we just want to calculate a metric, there is another convenient function `cross_val_score`.

In [None]:
from sklearn.model_selection import cross_val_score
lm = LinearRegression()
score = cross_val_score( lm,X,y,cv=5,scoring='r2' )
print("Cross-validated r2:")
print(score)

We would usually quote the mean score under cross-validation:

In [None]:
print("mean r2 =", np.mean(score))

The standard deviation of the cross-validation scores is also useful as an estimate of the error compared to the true performance on unseen test data.

In [None]:
print("sd =", np.std(score))

In addition to the basic *k*-fold cross-validation, there are many alternative procedures that may be suitable depending on the structure of your particular data set. 

For example, there may be definable subgroups within the data that we might want to leave out of training one at a time, to assess how good the predictor is at extrapolating beyond known groups.