### KFold Cross validation

This notebook gives a quick demo of using KFold CV in sklearn.

We'll be trying different models (Logistic Regression, SVM, RandomForest) on the digits dataset and then use k-fold cross-validation to see which model gives us the best score.

Here's a quick example of what we're trying to do using k-fold cv:

Suppose your dataset has a 1000 samples, you divide into 5 separate chunks (called Folds) each of which have 200 samples. Now, you can train your model on the data from 4 folds and use the 5th fold for testing the model performance. You can do this 5 different times, each time you choose a different fold for testing and use the remaining 4 folds for training the model. You'll get 5 different test scores(one for each of the test folds) after doing this. You can then take the average of these scores to get a pretty robust estimate of that actually model performs.

Note: In this case, our k(the number of folds) was 5.

In [48]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

In [49]:
X, y = load_digits(return_X_y=True, as_frame=True)

#### Perform the train test split

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Trying the logistic regression model

In [64]:
# here the solver parameter specifies the algorithm to use in the optimization problem
# and the multi_class parameter specifies how the algorithm handles multiclass classification problems
logreg_model = LogisticRegression(solver='liblinear', multi_class='ovr')
logreg_model.fit(X_train, y_train)
logreg_model.score(X_test, y_test)

0.9611111111111111

#### Trying SVM

In [52]:
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_model.score(X_test, y_test)

0.9861111111111112

#### Trying RandomForestClassifier

In [53]:
randomforest_model = RandomForestClassifier()
randomforest_model.fit(X_train, y_train)
randomforest_model.score(X_test, y_test)

0.9777777777777777

#### Quick toy example of how the folds actually look like

In [65]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3) # 3-fold cross-validation
print(kf)

for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

KFold(n_splits=3, random_state=None, shuffle=False)
[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


### Performing the k-fold cross validation

In [56]:
from sklearn.model_selection import cross_val_score

#### Cross validation for the logistic regression model

In [61]:
scores = cross_val_score(logreg_model, X, y, cv=5)
print("Scores: ", scores)
print("Average score: ", scores.mean())

Scores:  [0.92222222 0.88333333 0.95264624 0.95821727 0.89415042]
Average score:  0.9221138966264315


#### Cross validation for the SVM model

In [62]:
scores = cross_val_score(svm_model, X, y, cv=5)
print("Scores: ", scores)
print("Average score: ", scores.mean())

Scores:  [0.96111111 0.94444444 0.98328691 0.98885794 0.93871866]
Average score:  0.9632838130609718


#### Cross validation for the random forest model

In [63]:
scores = cross_val_score(randomforest_model, X, y, cv=5)
print("Scores: ", scores)
print("Average score: ", scores.mean())

Scores:  [0.92777778 0.91388889 0.95264624 0.95821727 0.92479109]
Average score:  0.935464252553389


By performing 5-fold cross validation on our 3 different models, we can see that SVM is the best performing model out of the bunch.