# Cross Validation

In [12]:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import neighbors
from sklearn import metrics

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

## We can change the scoring parameter
We can use the [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html#sklearn.cross_validation.cross_val_score) function
which computes a score based on cross validation


In [8]:
knn = neighbors.KNeighborsClassifier()
scores = cross_validation.cross_val_score(knn, iris.data, iris.target)

In [9]:
print(scores)
print("Accuracy: {} Standard Deviation: {})".format(scores.mean(), scores.std()))

[ 0.98039216  0.98039216  1.        ]
Accuracy: 0.986928104575 Standard Deviation: 0.00924322589786)


By default the score type is the score method of the estimator. This can by changed by using the [scoring parameter](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [15]:
knn = neighbors.KNeighborsClassifier()
scores = cross_validation.cross_val_score(knn, iris.data, iris.target, cv=5, scoring = 'f1_weighted')

[cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_predict.html#sklearn.cross_validation.cross_val_predict) generates cross-validated estimates for each input data point. For each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).

In [12]:
predicted = cross_validation.cross_val_predict(knn, iris.data,iris.target, cv=10)

In [13]:
print(predicted)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1
 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


# Cross Validation Iterators

## K-Fold

In [14]:
from sklearn.cross_validation import KFold

In [15]:
kf = KFold(4, n_folds=2)
for i, j in kf:
    print(i)
    print(j)
    print("------")

[2 3]
[0 1]
------
[0 1]
[2 3]
------


# Grid Search

to find the names and current values for all parameters for a given estimator, use:
estimator.get_params()

A search consists of:
an estimator (regressor or classifier such as sklearn.svm.SVC());
* a parameter space;
* a method for searching or sampling candidates;
* a cross-validation scheme; and
* a score function.

## Grid Search CV
[GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV) Exhastively generates candidates from a grid of parameters

In [26]:
from sklearn import grid_search
param_grid = [{'n_neighbors':[1,2,3,4,5,6,7,8,9,10]}]

In [28]:
knn = neighbors.KNeighborsClassifier()
rgr = grid_search.GridSearchCV(knn, param_grid)
rgr.fit(iris.data, iris.target)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [29]:
rgr.best_params_

{'n_neighbors': 5}

In [30]:
rgr.best_score_

0.98666666666666669

In [31]:
rgr.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')

In [33]:
rgr.grid_scores_

[mean: 0.96667, std: 0.03333, params: {'n_neighbors': 1},
 mean: 0.95333, std: 0.00872, params: {'n_neighbors': 2},
 mean: 0.98000, std: 0.01601, params: {'n_neighbors': 3},
 mean: 0.97333, std: 0.00897, params: {'n_neighbors': 4},
 mean: 0.98667, std: 0.00924, params: {'n_neighbors': 5},
 mean: 0.97333, std: 0.00897, params: {'n_neighbors': 6},
 mean: 0.97333, std: 0.00897, params: {'n_neighbors': 7},
 mean: 0.98000, std: 0.00058, params: {'n_neighbors': 8},
 mean: 0.97333, std: 0.00897, params: {'n_neighbors': 9},
 mean: 0.97333, std: 0.00897, params: {'n_neighbors': 10}]

# Lab
* You'll be building a regressor on the advertising data set
* Build a Linear and KNN regressor on the advertising dataset
* Use cross validation to select the best parameters and features to use
* Build a KNN classifier to determine survivers vs. non-survivors on the titanic data set
* Read in the sklearn handwritten digits datasets and build a classifier