Model Traning - Cross Validation

![](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

source: scikit-learn

To prevent Overfitting (inability to predict correctly and accurately) of the model we train, cross-validation has been proposed

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
import pandas as pd

In [2]:
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", header=None)
iris.columns = ['sep_len','sep_width','pet_len','pet_width','flower']
iris

Unnamed: 0,sep_len,sep_width,pet_len,pet_width,flower
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [3]:
iris['flower'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: flower, dtype: int64

In [5]:
iris.describe()

Unnamed: 0,sep_len,sep_width,pet_len,pet_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [6]:
iris.isna().sum()

sep_len      0
sep_width    0
pet_len      0
pet_width    0
flower       0
dtype: int64

In [7]:
iris.columns

Index(['sep_len', 'sep_width', 'pet_len', 'pet_width', 'flower'], dtype='object')

In [8]:
from sklearn.model_selection import train_test_split

X = iris[['sep_len', 'sep_width', 'pet_len', 'pet_width']]
Y = iris['flower']

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=30, random_state=100)

print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

(120, 4) (120,)
(30, 4) (30,)


build a classification model

In [9]:
from sklearn.linear_model import LogisticRegression
class_model_1 = LogisticRegression()
class_model_1

In [10]:
class_model_1.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
class_model_1.score(X_test, Y_test)

0.9666666666666667

## Export Model

In [12]:
#pickle
import pickle

pickle.dump(class_model_1, open('models/iris-flower-model.pkl', 'wb'))

In [13]:
#if we want to use the model anywhere
#import the pickle library
#load the pkl file

modelPickled =  pickle.load(open("models/iris-flower-model.pkl", 'rb'))

In [15]:
modelPickled.predict([[1,1,1,1]])



array(['Iris-setosa'], dtype=object)

### **Cross Validation**

using the training data to get good estimates on how well our model will perform on data we haven't seen before - extrapolation

In [76]:
from sklearn.model_selection import cross_val_score
from sklearn import metrics

logRegModel = LogisticRegression()

cv_score = cross_val_score(logRegModel, X_train, Y_train, cv=8)
cv_score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.93333333, 1.        , 0.86666667, 1.        , 0.93333333,
       0.93333333, 1.        , 1.        ])

In [77]:
#compute mean and standard deviation of scores
print("Accuracy: ", cv_score.mean(), cv_score.std())

Accuracy:  0.9583333333333333 0.04639803635691684


In [33]:
cv_score_b = cross_val_score(class_model_1, X, Y, cv=4,
                             scoring='f1_macro') 
#f1 score -  a measure of a test's accuracy
#f1 macro - computes f1 scores for each class and returns the average of those scores.
cv_score_b

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([0.97333333, 0.97316157, 0.9465812 , 1.        ])

In [34]:
#compute mean and standard deviation of scores
print("Accuracy: ", cv_score_b.mean(), cv_score_b.std())

Accuracy:  0.9732690243197489 0.018886509026731838


In [36]:
from sklearn.model_selection import cross_val_predict

predicted_a = cross_val_predict(class_model_1, X, Y, cv=2)


predicted_b = cross_val_predict(class_model_1, X, Y, cv=4)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [37]:
print("2 Fold - ", metrics.accuracy_score(Y, predicted_a))
print("4 Fold - ", metrics.accuracy_score(Y, predicted_b))

2 Fold -  0.96
4 Fold -  0.9733333333333334


### **K-Fold**

In [42]:

from sklearn.model_selection import KFold

In [81]:
kf = KFold(n_splits=3, shuffle=True)

a = [1,2,3,4,5,6,7,8]
#counter = 1
for train, test in kf.split(a):

    #print('Iteration ---', counter)
    #print('Train[index]--Test[index]')
    print(train, test,'\n')

    #counter=counter+1

[0 4 5 6 7] [1 2 3] 

[0 1 2 3 4] [5 6 7] 

[1 2 3 5 6 7] [0 4] 



In [82]:
kf = KFold(n_splits=4, shuffle=True)

counter = 1

for train, test in kf.split(a):
    
    print('Iteration ---',counter)
    print('Train[index]--Test[index]')
    print(train, test, '\n')

    counter=counter+1

Iteration --- 1
Train[index]--Test[index]
[0 1 2 3 4 6] [5 7] 

Iteration --- 2
Train[index]--Test[index]
[0 2 3 4 5 7] [1 6] 

Iteration --- 3
Train[index]--Test[index]
[1 3 4 5 6 7] [0 2] 

Iteration --- 4
Train[index]--Test[index]
[0 1 2 5 6 7] [3 4] 



**Stratified**

balance out the classes in splits

In [85]:
from sklearn.model_selection import StratifiedKFold

a = [1,2,3,4,6,7,8,9,10]
b = [2,2,1,1,1,2,1,1,1]

sk_fold = StratifiedKFold(n_splits=3)

counter = 1

for train, test in sk_fold.split(a, b):
    
    print('Iteration ---',counter)
    print('Train[index]--Test[index]')
    print(train, test, '\n')

    counter=counter+1

Iteration --- 1
Train[index]--Test[index]
[1 4 5 6 7 8] [0 2 3] 

Iteration --- 2
Train[index]--Test[index]
[0 2 3 5 7 8] [1 4 6] 

Iteration --- 3
Train[index]--Test[index]
[0 1 2 3 4 6] [5 7 8] 



**Grouped K-Fold**

don't use training data from a group to predict that particular group

In [70]:
from sklearn.model_selection import GroupKFold

a = [1,2,3,4,6,7,8,9,10]
b = [2,2,1,1,1,2,1,1,1]
grp = [1,1,1,1,1,2,2,2,1]

grpk_fold = GroupKFold(n_splits=2) #split acccording to number of groups

counter = 1

for train, test in grpk_fold.split(a, b, groups=grp):
    
    print('Iteration ---',counter)
    print('Train[index]--Test[index]')
    print(train, test, '\n')

    counter=counter+1

Iteration --- 1
Train[index]--Test[index]
[5 6 7] [0 1 2 3 4 8] 

Iteration --- 2
Train[index]--Test[index]
[0 1 2 3 4 8] [5 6 7] 

