KFold Cross Validation

In [1]:
""" 
Many times we get in a dilemma of which machine learning model should we use for a given problem. KFold cross validation allows us to evaluate performance of a model by creating K folds of given dataset. This is better then traditional train_test_split. In this tutorial, we will cover basics of cross validation and kfold. We will also look into cross_val_score function of sklearn library which provides convenient way to run cross validation on a model.

- Cross validation is a technique to evaluate model performance.

- Three ways for traditional approach
    1. Use all available data for training and test on same dataset
    2. Split available dataset into training and test sets [70:30]
        - pbm: If we train 70% maths simple questions, but test 30% advance questions
    3. Kfold cross validation
        - We can do 10 fold cross validation or 15 fold cross validation based on problem
"""

' \nMany times we get in a dilemma of which machine learning model should we use for a given problem. KFold cross validation allows us to evaluate performance of a model by creating K folds of given dataset. This is better then traditional train_test_split. In this tutorial, we will cover basics of cross validation and kfold. We will also look into cross_val_score function of sklearn library which provides convenient way to run cross validation on a model.\n\n- Cross validation is a technique to evaluate model performance.\n\n- Three ways for traditional approach\n    1. Use all available data for training and test on same dataset\n    2. Split available dataset into training and test sets [70:30]\n        - pbm: If we train 70% maths simple questions, but test 30% advance questions\n    3. Kfold cross validation\n        - We can do 10 fold cross validation or 15 fold cross validation based on problem\n'

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [3]:
import numpy as np
import matplotlib.pyplot as plt

In [4]:
from sklearn.datasets import load_digits
digits = load_digits()

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3)

In [6]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9629629629629629

In [7]:
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.9888888888888889

In [8]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)    
rf.score(X_test, y_test)

0.9777777777777777

In [9]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [10]:
# Basic example
kf.split([1,2,3,4,5,6,7,8,9]) # it returns fold 

<generator object _BaseKFold.split at 0x000001E5882F4890>

In [11]:
# We can receive generator using for loop 
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [12]:
# Shorten above code using utility function
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [13]:
get_score(LogisticRegression(), X_train, X_test, y_train, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9629629629629629

In [14]:
get_score(SVC(), X_train, X_test, y_train, y_test)

0.9888888888888889

In [15]:
get_score(RandomForestClassifier(), X_train, X_test, y_train, y_test)

0.9814814814814815

Now, we evaluate the performance of all the above algorithms using kfold.

In [16]:
""" 
The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.
"""

' \nThe splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.\n'

In [17]:
# This code is just for demonstration
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []        
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data, digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(), X_train, X_test, y_train, y_test))
    scores_svm.append(get_score(SVC(), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [18]:
scores_logistic

[0.9215358931552587, 0.9415692821368948, 0.9165275459098498]

In [19]:
scores_svm

[0.9649415692821369, 0.9799666110183639, 0.9649415692821369]

In [20]:
scores_rf

[0.9348914858096828, 0.9515859766277128, 0.9248747913188647]

cross_val_score() 

In [21]:
# It uses stratifield kfold by default
from sklearn.model_selection import cross_val_score

In [22]:
cross_val_score(SVC(), digits.data, digits.target)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

In [23]:
# cv = 7 means cross-validation generator bydefault cv = 5
cross_val_score(RandomForestClassifier(), digits.data, digits.target, cv=7)

array([0.93385214, 0.95719844, 0.89494163, 0.94941634, 0.9766537 ,
       0.93359375, 0.921875  ])

In [24]:
""" 
Uses of kfold:
    - To decide which machine learning model is best 
    - For particuler model parameter tuning like to decide no. of trees in random forest algorithm
"""

' \nUses of kfold:\n    - To decide which machine learning model is best \n    - For particuler model parameter tuning like to decide no. of trees in random forest algorithm\n'

In [25]:
score_1 = cross_val_score(RandomForestClassifier(n_estimators=5), digits.data, digits.target, cv=10)
np.average(score_1)

0.879810676598386

In [26]:
score_2 = cross_val_score(RandomForestClassifier(n_estimators=20), digits.data, digits.target, cv=10)
np.average(score_2)

0.9426877715704529

In [27]:
score_3 = cross_val_score(RandomForestClassifier(n_estimators=30), digits.data, digits.target, cv=10)
np.average(score_3)

0.9398944754810674

In [28]:
score_4 = cross_val_score(RandomForestClassifier(n_estimators=40), digits.data, digits.target, cv=10)
np.average(score_4)

0.9476753569211669

In [29]:
"""
Conclusion: 
    Here we used cross_val_score to fine tune our random forest classifier and figured that having around 40 trees in random forest gives best result.
"""

'\nConclusion: \n    Here we used cross_val_score to fine tune our random forest classifier and figured that having around 40 trees in random forest gives best result.\n'