# How can we choose the best model using Cross-Validation

Based on Machine Learning Practices by Kevin Markham

- What is the __drawback of using the train/test split procedure__ for model evaluation?
- How does __K-fold cross-validation__ overcome this limitation?
- How can cross-validation be used for selecting __tuning parameters__, choosing between models, and selecting features?
- What are some possible __improvements__ to cross-validation?

_Let's see this using Iris Dataset_

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#### Using KNN Model

__This is the accuracy for this specific Train/Test Split Model random_state= 4...__

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 4) 

In [5]:
# KNN with K = 5
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
from sklearn import metrics 
metrics.accuracy_score(y_test, y_pred) 

0.9736842105263158

__This is the accuracy for this specific Train/Test Split Model random_state= 3...__

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 3) 
# KNN with K = 5
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
from sklearn import metrics 
metrics.accuracy_score(y_test, y_pred)

0.9473684210526315

___Testing Accuracy has changed!!!___

In [13]:
for num in range(20):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = num) 
    # KNN with K = 5
    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_test)
    from sklearn import metrics 
    print('Random_State: %d Testing Accuracy: %s'% (num,metrics.accuracy_score(y_test, y_pred)))

Random_State: 0 Testing Accuracy: 0.9736842105263158
Random_State: 1 Testing Accuracy: 1.0
Random_State: 2 Testing Accuracy: 1.0
Random_State: 3 Testing Accuracy: 0.9473684210526315
Random_State: 4 Testing Accuracy: 0.9736842105263158
Random_State: 5 Testing Accuracy: 0.9473684210526315
Random_State: 6 Testing Accuracy: 0.9736842105263158
Random_State: 7 Testing Accuracy: 0.8947368421052632
Random_State: 8 Testing Accuracy: 0.9210526315789473
Random_State: 9 Testing Accuracy: 1.0
Random_State: 10 Testing Accuracy: 0.9736842105263158
Random_State: 11 Testing Accuracy: 0.9736842105263158
Random_State: 12 Testing Accuracy: 0.9736842105263158
Random_State: 13 Testing Accuracy: 0.8947368421052632
Random_State: 14 Testing Accuracy: 0.9736842105263158
Random_State: 15 Testing Accuracy: 0.9736842105263158
Random_State: 16 Testing Accuracy: 0.9210526315789473
Random_State: 17 Testing Accuracy: 0.9473684210526315
Random_State: 18 Testing Accuracy: 1.0
Random_State: 19 Testing Accuracy: 0.9473684

___Then we can create a bunch of Train/Test Split Models, calculating the testing accuracy for each and averaging the results___

### Cross-Validation

![alt](https://github.com/emunozlorenzo/MasterDataScience/blob/master/06_Machine_Learning_on_my_own/Img/cross.png)