# Finding optimal conditions of SVM using Grid Search CV

**Methods:**
>1. Load data
>2. Create param grid
>3. Run grid search cv with verbosity in SVM

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.svm as skl_svm
import sklearn.cross_validation as skl_cv
import sklearn.ensemble as skl_ensemble
import seaborn as sns
import os

import time
from sklearn.grid_search import GridSearchCV

## 1. Load data

In [2]:
digit_data = pd.read_csv('Data/train.csv')
X = digit_data.ix[:,1:digit_data.shape[1]]
y = digit_data['label']
X_subset = X.ix[0:5000,:]
y_subset = y.ix[0:5000]

X_train, X_test, y_train, y_test = skl_cv.train_test_split(X_subset, y_subset, test_size=0.2)

## 2. Create param grid

In [7]:
svm_gs_params = [
    {'C': np.logspace(-5,5, num=10), 'kernel': ['rbf'], 'gamma':np.logspace(-3,3,num=6)}, 
]

## 3. Run grid search cv with verbosity in SVM

In [10]:
svm_classifier = skl_svm.SVC(kernel='rbf', verbose=True)

svm_gs_clf = GridSearchCV(svm_classifier, param_grid = svm_gs_params)

svm_classifier.fit(X_train, y_train)

[LibSVM]

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=True)

In [14]:
(svm_classifier.predict(X_test) == y_test.values).mean()

0.10589410589410589

10% accuracy...This seems wrong. 

In [13]:
(svm_classifier.predict(X_train) == y_train.values).mean()

1.0

The training data is predicted perfectly. It looks like the SVM is overfitting the training data. Let's change the C value:

In [16]:
svm_classifier = skl_svm.SVC(kernel='rbf', verbose=True, C=0.0001)

svm_classifier.fit(X_train, y_train)

(svm_classifier.predict(X_test) == y_test.values).mean()

[LibSVM]

0.10589410589410589

This still does a poor job of predicting test data. I will switch to poly data. I will take my best conditions found previously:

## Poly SVM

In [21]:
svc_clf = skl_svm.SVC(
    C=2.8e-5, 
    degree=2, 
    gamma='auto', 
    kernel='poly', 
    tol=0.001
    )

svc_poly_gs_params = [
    {'C': np.logspace(-10, 10, num=20)} 
]
start_time = time.time()
gs_svc_poly_clf = GridSearchCV(svc_clf, param_grid = svc_poly_gs_params, cv=8, n_jobs=-1)
gs_svc_poly_clf.fit(X_subset, y_subset);
end_time = time.time()

print 'Elapsed Time: ', end_time - start_time, ' seconds'
(gs_svc_poly_clf.predict(X_test) == y_test.values).mean()

KeyboardInterrupt: 