## Cross Validation: Methods to evaluate model performance

### Option 1: 
Use all available data for training and test on same dataset. Not very good for testing/evaluation because the model will have already seen the test cases.

### Option 2: 
Split available dataset into training and test sets. Solves issues with Option 1, however, you could end up with a random training sample set that is significantly different from the testing sample set.

### Option 3: K Fold Cross Validation:

Divide dataset into K equal sized sets.
Perform K iterations:
- Use each set as the testing sample set, and the rest as the training set.
- Average the results of each iteration

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()

## Perform train/test split and train Logistic Regression, Support Vector Machine, and Random Forest models. Model accuracy will change each time you update the training/testing sample sets, which is what K-Fold Cross Validation helps prevent.

In [16]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3)

In [17]:
lr = LogisticRegression(solver="newton-cg", multi_class="auto")
lr.fit(x_train, y_train)
lr.score(x_test, y_test)

0.9666666666666667

In [18]:
svm = SVC()
svm.fit(x_train, y_train)
svm.score(x_test, y_test)

0.9851851851851852

In [19]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(x_train, y_train)
rf.score(x_test, y_test)

0.9722222222222222

## Perform train/test split using KFold and StratifiedKFold.

In [22]:
def get_score(model, x_train, x_test, y_train, y_test):
    model.fit(x_train, y_train)
    return model.score(x_test, y_test)

In [24]:
get_score(LogisticRegression(solver="newton-cg", multi_class="auto"), x_train, x_test, y_train, y_test)

0.9666666666666667

**StratifiedKFold**<br>
If dataset is  divided  into 5 fold. Then each fold will contains 10 instance from each class, i.e. number of instance per class is equal and follow  uniform distribution.

**KFold**<br>
it will randomly took 30 instance and no of instance per class may or may not be equal or uniform.

**When to use**<br>
Classification task: StratifiedKFold.
Regression task: use Kfold.
 
But if dataset contains  large number of instances, both StratifiedKFold and Kfold can be used in classification task.

In [20]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

In [21]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [25]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

In [30]:
scores_lr = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(digits.data):
    x_train, x_test, y_train, y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]
    scores_lr.append(get_score(LogisticRegression(solver="newton-cg", multi_class="auto"), x_train, x_test, y_train, y_test))
    scores_svm.append(get_score(SVC(), x_train, x_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), x_train, x_test, y_train, y_test))

In [31]:
scores_lr

[0.9282136894824707, 0.9415692821368948, 0.9165275459098498]

In [32]:
scores_svm

[0.9666110183639399, 0.9816360601001669, 0.9549248747913188]

In [33]:
scores_rf

[0.9298831385642737, 0.9348914858096828, 0.9198664440734557]

## You can use 'cross_val_score' to do the above cross validation in a single function. Look at the resulting arrays to evaluate model performance. Defaults to K = 5 Folds.

In [38]:
from sklearn.model_selection import cross_val_score

In [39]:
cross_val_score(LogisticRegression(solver="newton-cg", multi_class="auto"), digits.data, digits.target)

array([0.925     , 0.87777778, 0.93871866, 0.93314763, 0.89693593])

In [36]:
cross_val_score(SVC(), digits.data, digits.target)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

In [37]:
cross_val_score(RandomForestClassifier(n_estimators=40), digits.data, digits.target)

array([0.93611111, 0.90833333, 0.95543175, 0.95821727, 0.91086351])