# Scikit Learn Tutorial #10 - Cross Validation and Model Selection

<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/10.%20Cross%20Validation%20and%20Model%20Selection.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/10.%20Cross%20Validation%20and%20Model%20Selection.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

![Scikit Learn Logo](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

## What is Model Selection?

Model Selection is the process of choosing one model over the others. For this we need to have some metric which defines when a model is better then another. This metric will vary depending on your problem

## Model Selection using Scikit Learn

### Loading in Datasets

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label'])
le = LabelEncoder()
iris['label'] = le.fit_transform(iris['label'])
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Splitting data into independent and dependent variables

In [2]:
X = np.array(iris.drop(['label'], axis=1))
y = np.array(iris['label'])

### Importing Models
For the comparison we will load in a few different models ranging from a simple linear model like logistic regression to more mathematically advanced once like a SVM.

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

models = [
    ('LR', LogisticRegression()),
    ('NB', GaussianNB()),
    ('SVM', SVC()),
    ('KNN', KNeighborsClassifier()),
    ('DT', DecisionTreeClassifier()),
]

### Comparing Models 
The metric for this commparison is the accuracy score which is a very naive metric that shouldn't be used for most problems.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

for name, model in models:
        clf = model
        clf.fit(X_train, y_train)
        accuracy = clf.score(X_test, y_test)
        print(name, accuracy)

LR 0.9666666666666667
NB 0.9666666666666667
SVM 1.0
KNN 0.9833333333333333
DT 0.9666666666666667


### Cross Validation

In the example above we used <i>train_test_split</i> to create our training and testing set eventhough this is a totaly legit method of doing so it has a few drawbacks. Because we are splitting the original dataset we have less data to train on which could be a problem. A solution to this problem is a procedure called cross-validation (CV for short). A basic approach of CV called k-fold CV works as followed.

1. The dataset is split into k smaller sets
2. A model is trained using k-1 of the folds(smaller sets) as training data
3. The remaining fold is used for validation
4. Step 2 and 3 are repeated until all folds where used for validation once.

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach doesn't "waste data" but it is a lot more computationally expensive than something like <i>train_test_split</i>.

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5)
scores

array([1.        , 0.96666667, 0.93333333, 0.9       , 1.        ])

In [6]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.96 (+/- 0.08)


By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter.

In [7]:
from sklearn import metrics

scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')
scores

array([1.        , 0.96658312, 0.93333333, 0.89769821, 1.        ])

We can also use more than one scoring method.

In [8]:
from sklearn.model_selection import cross_validate # allows us to use multiple scoring metrics
#from sklearn.metrics import recall_score

scoring = ['precision_macro', 'recall_macro']
scores = cross_validate(clf, X, y, cv=5, scoring=scoring, 
                          return_train_score=False) # cv can also return train score but we set it to false
scores

{'fit_time': array([0.0010004 , 0.00100017, 0.00099993, 0.00100017, 0.        ]),
 'score_time': array([0.00099969, 0.00099969, 0.00100017, 0.00099993, 0.00200009]),
 'test_precision_macro': array([1.        , 0.96969697, 0.93333333, 0.92307692, 1.        ]),
 'test_recall_macro': array([1.        , 0.96666667, 0.93333333, 0.9       , 1.        ])}

KFold CV:

In [9]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=3, shuffle=True, random_state=42)

for train, test in kfold.split(X):
    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]
    print(X_train[:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.4 3.9 1.7 0.4]]
[[4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [4.6 3.4 1.4 0.3]]
[[5.1 3.5 1.4 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]]


## Resources

<ul>
    <li><a href="http://scikit-learn.org/stable/modules/cross_validation.html">Cross Validation (Scikit Learn Documentation)</a></li>
    <li><a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)">Cross Validation (Wikipedia)</a></li>
</ul>

## Conclusion

That was a quick overview of model selection and cross validation and how to implement them in Scikit Learn. 
I hope you liked this tutorial if you did consider subscribing on my <a href="https://www.youtube.com/channel/UCBOKpYBjPe2kD8FSvGRhJwA">Youtube Channel</a> or following me on Social Media. If you have any question feel free to contact me.