## Questions#1: Load up one of our favorite datasets, the digits dataset, and let's go step by step through it using grid search to pipelines to create the optimal classifier

Load up the digits dataset, and create matrix X and vector y containing the predictor and response variables

In [4]:
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target

Break up the dataset into a training and testing set

In [5]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)

Before we use an SVM, we generally want to scale our data, and then fit it with the classifier. However, we don't want to manually do it, so let's put it into a Pipeline.

Import Pipeline from the sklearn.pipeline, as well as StandardScaler and SVC

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

Read the help documents with Pipeline??

In [10]:
Pipeline??

Following the instructions, pass in the pipeline you want performed. (The argument should be a list of tuples(a,b) where a is anything you want to call it, and b is the actual object)

In [12]:
pipeline = Pipeline([("scaler", StandardScaler()),
                     ("svc", SVC())])

Excellent! Now fit the pipeline just like you would a regular classifier on the training set

In [13]:
pipeline.fit(X_train,y_train)

Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

###### Now score it on the testing set

In [16]:
pipeline.score(X_test,y_test)

0.98444444444444446

Now, perhaps we want to optimize the _hyperparameters_ of the various transformers and classifiers, we'll want to use GridSearchCV. Import that now from sklearn.grid_search

In [17]:
from sklearn.grid_search import GridSearchCV

Remember, you perform a GridSearch by passing in a dictionary of all the parameters you might want to search on, as well as your classifier. You can even use a pipeline as a classifier! 

So here, define your param_grid with the parameters you might want to search on. Let's start with searching just over the C variable in the SVC. When you declare your grid_search dictionary, the _key_ of the dictionary is the arbitrary name you called the object in the pipeline followed by two underscores followed by the variable name, and the _value_ is a list of all possible values. Let's search through [.001, .01, .1, 1, 10, 100, 1000]

In [40]:
import numpy as np
param_grid = {'svc__C': 10.**np.arange(-3,3)}

Great, now instantiate the grid search object and fit it on the training set 

In [37]:
clf_grid = GridSearchCV(pipeline, param_grid)
clf_grid.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'svc__gamma': array([  0.01,   0.1 ,   1.  ,  10.  ]), 'svc__C': array([  1.00000e-03,   1.00000e-02,   1.00000e-01,   1.00000e+00,
         1.00000e+01,   1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

Score it on the testing set. How does it do vs the basic classifier? Why do you think? What are the parameters of the best fit classifier?

In [38]:
print clf_grid.score(X_test,y_test)
print clf_grid.best_params_

0.984444444444
{'svc__gamma': 0.01, 'svc__C': 10.0}


Improve the grid search by searching over more parameters. Perhaps choose either _gamma_, which is a parameter for the rbf kernel (the default for SVC), or the various kernel types. Note: you cannot search over various kernels and their parameters yet in scikit.

Train the new gridsearch and score on the test set. How does it perform?

In [42]:
param_grid = {'svc__C': 10.**np.arange(-3,3),
             'svc__gamma': 10.**np.arange(-2,2)}
clf_grid2 = GridSearchCV(pipeline, param_grid)
clf_grid2.fit(X_train,y_train)


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'svc__gamma': array([  0.01,   0.1 ,   1.  ,  10.  ]), 'svc__C': array([  1.00000e-03,   1.00000e-02,   1.00000e-01,   1.00000e+00,
         1.00000e+01,   1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)