<a href="https://colab.research.google.com/github/dubeyabhi07/hands-on-scikit-learn/blob/master/crossValidation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train/test split
- Split the dataset into two pieces: a training set and a testing set.
- Train the model on the training set.
- Test the model on the testing set, and evaluate how well we did.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# read in the iris data
iris = load_iris()

# create X (features) and y (response)
X = iris.data
y = iris.target

# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

# check classification accuracy of KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.9736842105263158


# K-fold cross-validation
- Split the dataset into K equal partitions (or "folds").
- Use fold 1 as the testing set and the union of the other folds as the training set.
- Calculate testing accuracy.
- Repeat steps 2 and 3 K times, using a different fold as the testing set each time.
- Use the average testing accuracy as the estimate of out-of-sample accuracy.

In [3]:
from sklearn.model_selection import cross_val_score

# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[1.         0.93333333 1.         1.         0.86666667 0.93333333
 0.93333333 1.         1.         1.        ]
Accuracy: 0.97 (+/- 0.09)


In [6]:
#using cross-validation strategy = ShuffleSplit

from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
scores = cross_val_score(knn, X, y, cv=cv,scoring ='accuracy')
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.97777778 0.95555556 0.95555556 0.93333333 0.97777778 0.95555556
 0.97777778 0.97777778 0.97777778 0.97777778]
Accuracy: 0.97 (+/- 0.03)


# Another way to use K-fold cross validation.
##### Using `cross_validate` function
- It differs from `cross_val_score` as :
  - It allows specifying multiple metrics for evaluation.
  - It returns a dict containing fit-times, score-times etc. along with test scores

In [17]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['accuracy', 'recall_macro']
scores = cross_validate(knn, X, y, cv=4,scoring=scoring)
print(scores.keys())
print(scores['fit_time'])
print(scores['score_time'])
print(scores['test_accuracy'])
print(scores['test_recall_macro'])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores['test_accuracy'].mean(), scores['test_accuracy'].std() * 2))
print("recall_macro: %0.2f (+/- %0.2f)" % (scores['test_recall_macro'].mean(), scores['test_recall_macro'].std() * 2))



dict_keys(['fit_time', 'score_time', 'test_accuracy', 'test_recall_macro'])
[0.00080967 0.00069928 0.0004921  0.00042677]
[0.00498986 0.0045197  0.0036006  0.00251913]
[0.97368421 0.94736842 0.94594595 1.        ]
[0.97435897 0.94444444 0.94871795 1.        ]
Accuracy: 0.97 (+/- 0.04)
recall_macro: 0.97 (+/- 0.04)


# Comparing cross-validation to train/test split

### Advantages of cross-validation:

- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)

### Advantages of train/test split:
- Runs K times faster than K-fold cross-validation