### Cross Validation Techniques 
Comparing the RandomForest, Svm, Logistic Regression algorithms for classifying the flowers in the
load_iris() dataset.

### KFold vs StratifiedKFold 

KFold is a cross-validator that divides the dataset into k folds.

Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

So, it means that StratifiedKFold is the improved version of KFold

Therefore,  we should prefer StratifiedKFold over KFold especially when dealing with classification tasks with imbalanced class distributions.

In [1]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np


In [2]:
iris_dataset = load_iris()
dir(iris_dataset)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [3]:
print(f'iris dataset feature names: {iris_dataset.feature_names}')
print()
print(f'iris dataset target names: {iris_dataset.target_names}')

iris dataset feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

iris dataset target names: ['setosa' 'versicolor' 'virginica']


In [4]:
print('iris dataset data values :', iris_dataset.data)
print()
print('iris datset target values :', iris_dataset.target)

iris dataset data values : [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.

In [5]:
df_iris = pd.DataFrame(iris_dataset.data , columns = iris_dataset.feature_names)
df_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [6]:
y = iris_dataset.target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

#### As we have seen that when using train_test_split method we are not able to not obtain the score which is stable and it keeps on changing but we can use random_state parameter to produce same output everytime, but even then we have a problem i.e., we are not able to obtain the average score of taking all different datasets. 

In [7]:
def accu_score(model,X_train,X_test,y_train,y_test):
    
    model.fit(X_train,y_train)
    return model.score(X_test,y_test)


### KFold vs StratifiedKFold Testing:
    

In [8]:
kf = KFold(n_splits=3)
skf = StratifiedKFold(n_splits=3)
re_skf = RepeatedStratifiedKFold(n_splits=3,n_repeats = 3)
#models creation
L_Reg = LogisticRegression(n_jobs=3)
svmc = SVC()
rfc = RandomForestClassifier(n_jobs=3)
lst_L_Reg = list()
lst_svmc = list()
lst_rfc = list()
i=0
for train_index,test_index in kf.split(iris_dataset.data):
    X_train,X_test,y_train,y_test = iris_dataset.data[train_index],iris_dataset.data[test_index],\
    iris_dataset.target[train_index],iris_dataset.target[test_index]
    i=i+1
    print(f'y_train iteration {i}: \n {y_train}')
    print(f'y_test iteration {i}: \n {y_test}')
    lst_L_Reg.append(accu_score(L_Reg,X_train,X_test,y_train,y_test))
    lst_svmc.append(accu_score(svmc,X_train,X_test,y_train,y_test))
    lst_rfc.append(accu_score(rfc,X_train,X_test,y_train,y_test))
print(f'mean accuarcy score of Logistic Regression model : {np.mean(lst_L_Reg)}')
print(f'mean accuarcy score of SVM classifier model : {np.mean(lst_svmc)}')
print(f'mean accuarcy score of RandomForest Classifier model : {np.mean(lst_rfc)}')

y_train iteration 1: 
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
y_test iteration 1: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
y_train iteration 2: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
y_test iteration 2: 
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]
y_train iteration 3: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
y_test iteration 3: 
 [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 

In [17]:
i=0
lst_L_Reg = list()
lst_svmc = list()
lst_rfc = list()
for train_index,test_index in skf.split(iris_dataset.data,iris_dataset.target):
    X_train,X_test,y_train,y_test = iris_dataset.data[train_index],iris_dataset.data[test_index],\
    iris_dataset.target[train_index],iris_dataset.target[test_index]
    i=i+1
    print(f'y_train iteration {i}: \n {y_train} \n counts of each label: \n {pd.Series(y_train).value_counts()}')
    print(f'y_test iteration {i}: \n {y_test} \n counts of each label: \n {pd.Series(y_test).value_counts()}')
    
    lst_L_Reg.append(accu_score(L_Reg,X_train,X_test,y_train,y_test))
    lst_svmc.append(accu_score(svmc,X_train,X_test,y_train,y_test))
    lst_rfc.append(accu_score(rfc,X_train,X_test,y_train,y_test))
print(f'mean accuarcy score of Logistic Regression model : {np.mean(lst_L_Reg)}')
print(f'mean accuarcy score of SVM classifier model : {np.mean(lst_svmc)}')
print(f'mean accuarcy score of RandomForest Classifier model : {np.mean(lst_rfc)}')

y_train iteration 1: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 
 counts of each label: 
 2    34
0    33
1    33
dtype: int64
y_test iteration 1: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2] 
 counts of each label: 
 0    17
1    17
2    16
dtype: int64
y_train iteration 2: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 
 counts of each label: 
 1    34
0    33
2    33
dtype: int64
y_test iteration 2: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2] 
 counts of each label: 
 0    17
2    17
1    16
dtype: int64
y_train iteration 3: 
 [0 0 0 0 0 0 0 0 0 0 

#### StratifiedKFold vs KFold

Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.

RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system

In [18]:
i=0
lst_L_Reg = list()
lst_svmc = list()
lst_rfc = list()
for train_index,test_index in re_skf.split(iris_dataset.data,iris_dataset.target):
    X_train,X_test,y_train,y_test = iris_dataset.data[train_index],iris_dataset.data[test_index],\
    iris_dataset.target[train_index],iris_dataset.target[test_index]
    i=i+1
    print(f'y_train iteration {i}: \n {y_train} \n counts of each label: \n',pd.Series(y_train).value_counts())
    print(f'y_test iteration {i}: \n  {y_test}\n counts of each label: \n{pd.Series(y_test).value_counts()}')
    
    lst_L_Reg.append(accu_score(L_Reg,X_train,X_test,y_train,y_test))
    lst_svmc.append(accu_score(svmc,X_train,X_test,y_train,y_test))
    lst_rfc.append(accu_score(rfc,X_train,X_test,y_train,y_test))
print(f'mean accuarcy score of Logistic Regression model : {np.mean(lst_L_Reg)}')
print(f'mean accuarcy score of SVM classifier model : {np.mean(lst_svmc)}')
print(f'mean accuarcy score of RandomForest Classifier model : {np.mean(lst_rfc)}')

y_train iteration 1: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 
 counts of each label: 
 2    34
0    33
1    33
dtype: int64
y_test iteration 1: 
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2]
 counts of each label: 
0    17
1    17
2    16
dtype: int64
y_train iteration 2: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 
 counts of each label: 
 1    34
0    33
2    33
dtype: int64
y_test iteration 2: 
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2]
 counts of each label: 
0    17
2    17
1    16
dtype: int64
y_train iteration 3: 
 [0 0 0 0 0 0 0 0 0 0 0 

As we can see that here StratifiedKFold considers the y-labels and does the splitting of the dataset based on them ensuring that there is equal distribution of them in each of the splitting, i.e., it tries to maintain almost same no. of labels from each class in the dataset. Where as in KFold we can see that there is imbalance of the labels in the dataset as they have only labels of 2 classes in the train and remaining 1 label in the testing dataset.

### class sklearn.model_selection.KFold(n_splits=5, *, shuffle=False, random_state=None)

K-Folds cross-validator

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

*Parameters*:

*n_splits* : int, default=5

Number of folds. Must be at least 2.

Changed in version 0.22: n_splits default value changed from 3 to 5.

*shuffle* : bool, default=False

Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.

*random_state* :int, RandomState instance or None, default=None

When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls. See Glossary.


split method:

*split*:(X, y=None, groups=None)
Generate indices to split data into training and test set.

Parameters:
'X':array-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.

'y':array-like of shape (n_samples,), default=None
The target variable for supervised learning problems.

'groups':array-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into train/test set.

Yields:
'train':ndarray
The training set indices for that split.

'test':ndarray
The testing set indices for that split.

### Stratified K-Folds cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Parameters:	

*n_splits* : int, default=3

Number of folds. Must be at least 2.

*shuffle* : boolean, optional

Whether to shuffle each stratification of the data before splitting into batches.

*random_state* : int, RandomState instance or None, optional, default=None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when shuffle == True.

*split method*

split(X, y, groups=None)
Generate indices to split data into training and test set.

Parameters:	
*X* : array-like, shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features.

Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

*y* : array-like, shape (n_samples,)

The target variable for supervised learning problems. Stratification is done based on the y labels.

*groups* : object

Always ignored, exists for compatibility.

Returns:	

*train* : ndarray

The training set indices for that split.

*test* : ndarray

The testing set indices for that split.

*if cv has int value and estimator is a classifier then it uses stratified k fold otherwise it uses k fold stratergy*

*cv*:int, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use the default 5-fold cross validation,

int, to specify the number of folds in a (Stratified)KFold,

CV splitter,

An iterable that generates (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, 
StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with 
shuffle=False so the splits will be the same across calls.

In [20]:
print(f'Logistic Regression accuarcy scores: {cross_val_score(L_Reg,iris_dataset.data,iris_dataset.target)} mean score: {np.mean(cross_val_score(L_Reg,iris_dataset.data,iris_dataset.target))}')
print(f'svm classifier accuracy scores: {cross_val_score(svmc,iris_dataset.data,iris_dataset.target)} mean score: {np.mean(cross_val_score(svmc,iris_dataset.data,iris_dataset.target))}')
print(f'rf classifier accuarcy scores: {cross_val_score(rfc,iris_dataset.data,iris_dataset.target)} mean score: {np.mean(cross_val_score(rfc,iris_dataset.data,iris_dataset.target))}')

#as it has default 5-fold cross validation which is stratifiedkfold it has slight improvement of 
#accuracy for L_reg and svm models

Logistic Regression accuarcy scores: [0.96666667 1.         0.93333333 0.96666667 1.        ] mean score: 0.9733333333333334
svm classifier accuracy scores: [0.96666667 0.96666667 0.96666667 0.93333333 1.        ] mean score: 0.9666666666666666
rf classifier accuarcy scores: [0.96666667 0.96666667 0.93333333 0.9        1.        ] mean score: 0.9666666666666668


### Conclusion:

Logistic Regression performed the better than svmc and rfc with an accuracy 97.34 percentage in classifying flowers of load_iris dataset and then svm has achieved accuracy of 96.67 percentage and random forest classifier has achieved an accuarcy of 96.67 percentage. 