# ML Tutorial Day 12

## K-Fold Cross Validation

K-fold cross validation helps us in evaluating various model performances. The basic procedure of creating a machine learning model is:
1. Preparing the data
2. Training the model
3. Testing the model

We can do the training and testing in various ways:
1. Use all available data for training and then test on the same dataset. In this the model's score can't be trusted because the model has already seen the entire data, and thus, it could've learned the data. If a new datapoint is fed, it might happen the model performs poorly.

2. Split the available dataset into training and testing sets
We divide the dataset into two sets, where the model is trained on the dataset and then tested on the unseen test dataset. This way, the model's score can be trusted but this again has an issue that the test dataset might have little to no similarity with the testing data, and thus, the model might perform poorly.

3. K-fold cross validation
We divide our entire dataset into various folds (smaller datasets) and then successively train and test the model on different datasets. Supopse we divide our dataset into 10 smaller sets. In the first iteration, we train the model on dataset 1 to 9 and test it on dataset 10 and note the score. Next we train the model on dataset 2 to 10 and test it on dataset 1 and note the score. We continue in this fashion and note down the score each time. In the end, we take the average of all the scores, to get the actual model score. This technique is good because we're supplying a variety of data to the model and calculating an average score for the model.

In [101]:
# importing relevant libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits

# loading the dataset
digit = load_digits()

In [102]:
from sklearn.model_selection import train_test_split as tts

# creating training-testing split
X_train, X_test, y_train, y_test = tts(digit.data, digit.target, test_size = 0.2)

In [103]:
# training and testing a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)

  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  ret = a @ b
  ret = a @ b
  ret = a @ b


0.9833333333333333

In [104]:
# training and testing a support vector machine model
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.9805555555555555

In [105]:
# training and testing a random forest model
randfor = RandomForestClassifier()
randfor.fit(X_train, y_train)
randfor.score(X_test, y_test)

0.9722222222222222

In [106]:
# implementing the kfold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits = 3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [107]:
# creating example dataset to demonstrate kfold split
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [108]:
# function to get the score for a model
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [109]:
get_score(svm, X_train, X_test, y_train, y_test)

0.9805555555555555

In [110]:
# using stratifiedkfold as when we are dividing the dataset into folds, it will divide every classification category in a uniform way. Supopse if we are dividing the dataset into 3 folds, and one fold has only flowers of 3rd type, then it might cause issues
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits = 3)

In [111]:
# storing the scores of kfold cross validation for each model
scores_l = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(digit.data):
    X_train, X_test, y_train, y_test = digit.data[train_index], digit.data[test_index], digit.target[train_index], digit.target[test_index]
    scores_l.append(get_score(LogisticRegression(), X_train, X_test, y_train, y_test))
    scores_svm.append(get_score(SVC(), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators = 40), X_train, X_test, y_train, y_test))

  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  ret = a @ b
  ret = a @ b
  ret = a @ b
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  

In [112]:
# average scores for logistic regression model
scores_l
sum(scores_l)/len(scores_l)

0.9265442404006677

In [113]:
# average scores for support vector machine model
scores_svm
sum(scores_svm)/len(scores_svm)

0.9677239844184752

In [114]:
# average score for random forest model
scores_rf
sum(scores_rf)/len(scores_rf)

0.9337785197551475

In [115]:
# we want to know which model performed the best, we can use cross val score method
from sklearn.model_selection import cross_val_score

# cross validation score for logistic regression
a = cross_val_score(LogisticRegression(), digit.data, digit.target)
print(sum(a)/len(a))

0.9137650882079852


  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  ret = a @ b
  ret = a @ b
  ret = a @ b
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  

In [116]:
# cross validation score for support vector machine
a = cross_val_score(SVC(), digit.data, digit.target)
print(sum(a)/len(a))

0.9632838130609718


In [117]:
# cross validation score for random forest
a = cross_val_score(RandomForestClassifier(), digit.data, digit.target)
print(sum(a)/len(a))

0.9382544103992572


We can use `cross_val_score` to not only test various models, but also test the same model with different parameters, which is also called parameter tuning.