<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Codealong Exploring SVMs Using Sklearn

_Authors: Joseph Nelson (DC)_

---

In [3]:
from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, linear_model, datasets
from sklearn.model_selection import cross_val_score

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Load the handwritten digits dataset.

In [4]:
data = datasets.load_digits()
y = data.target
X = data.data

In [5]:
y.shape

(1797L,)

In [6]:
X.shape

(1797L, 64L)

In [7]:
print y[:10]
print X[[0]]

[0 1 2 3 4 5 6 7 8 9]
[[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
  15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
   0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
   0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]]


In [8]:
import pandas as pd
y_pd = pd.DataFrame(y)
y_pd[0].value_counts()/len(y)

3    0.101836
5    0.101280
1    0.101280
6    0.100723
4    0.100723
9    0.100167
7    0.099610
0    0.099054
2    0.098497
8    0.096828
Name: 0, dtype: float64

### 2. Cross-validate a logistic regression on the data.

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
#sklearn with CV test score
logreg = LogisticRegression()
accs = cross_val_score(logreg, X, y, cv=10)
print accs
print np.mean(accs)

[0.90810811 0.95081967 0.89502762 0.91111111 0.94972067 0.96648045
 0.97765363 0.9494382  0.85875706 0.94318182]
0.9310298346839012


### 3. Cross-validate a SVM on the data.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [38]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm, grid_search
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

Cs = [0.001, 0.01, 0.1, 1, 10]
gammas = [0.001, 0.01, 0.1, 1]
param_grid = {'C': Cs, 'gamma' : gammas, 'kernel':['linear','rbf']}
grid_search = GridSearchCV(svm.SVC(), param_grid, cv=5, scoring='f1_weighted')

grid_search.fit(X_train, y_train)
grid_search.best_params_

{'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}

In [39]:
best_svm = grid_search.best_estimator_
best_svm.score(X_test, y_test)

0.9882154882154882

In [40]:
from sklearn.metrics import classification_report

def print_cm_cr(y_true, y_pred):
    """prints the confusion matrix and the classification report"""
    confusion = pd.crosstab(y_true, 
                            y_pred, 
                            rownames=['Actual'], 
                            colnames=['Predicted'], 
                            margins=True)
    print confusion
    print
    print classification_report(y_true, y_pred)

In [41]:
y_pred = best_svm.fit(X_train, y_train).predict(X_test)
print_cm_cr(y_test, y_pred)

Predicted   0   1   2   3   4   5   6   7   8   9  All
Actual                                                
0          55   0   0   0   0   0   0   0   0   0   55
1           0  55   0   0   0   0   0   0   0   0   55
2           0   0  67   0   0   0   0   0   0   0   67
3           0   0   0  63   0   0   0   1   0   0   64
4           0   0   0   0  52   0   0   0   0   2   54
5           0   0   0   0   0  52   0   0   0   1   53
6           0   0   0   0   0   1  69   0   0   0   70
7           0   0   0   0   0   0   0  53   0   0   53
8           0   2   0   0   0   0   0   0  60   0   62
9           0   0   0   0   0   0   0   0   0  61   61
All        55  57  67  63  52  53  69  54  60  64  594

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        55
          1       0.96      1.00      0.98        55
          2       1.00      1.00      1.00        67
          3       1.00      0.98      0.99        64
          4       

In [23]:
SVM = svm.SVC(kernel='rbf',C=10, gamma=0.001)
accs = cross_val_score(SVM, X, y, cv=5)
print accs
print np.mean(accs)

[0.97802198 0.95027624 0.98328691 0.98879552 0.96338028]
0.9727521858182637


In [30]:
SVM.fit(X_train, y_train).predict(X)

array([0, 1, 2, ..., 8, 9, 8])

## Gaussian SVM has two parameters, gamma and C

---

### gamma

Intuitively, the gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. 

The higher the value of gamma, the more it will try to exactly fit the training data set. Will cause over-fitting problem.
- small gamma: the model is constrained, can under-fit!  high bias and low variance.
- big gamma: Tries to capture the shape too well: can over-fit!  low bias and high variance.

<img src="http://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_15.png">



### C

Penalty parameter of the error term. It controls the trade off between smooth decision boundary and classifying the training points correctly. C can be thought of as the parameter for the soft margin cost function, which controls the influence of each individual support vector

- small C: makes the decision surface smooth and simple, softer margin can under-fit! high bias and low variance.
- big C: selects more support vectors: can over-fit! harder margin. low bias and high variance.

<img src="http://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_18.png">


### 4. Fit an SVM modifying the default gamma and C.

In [5]:
# A:

### 5. Gridsearch an optimal gamma with C=1.

In [6]:
# A:

### 6. Gridsearch the optimal C, gamma, and kernel.

In [7]:
# A:

### 7. Import the iris dataset.

In [58]:
iris = datasets.load_iris()
y = iris.target
X = iris.data
print y.shape
print X.shape

(150L,)
(150L, 4L)


### 8. Cross-validate a default logistic regression and default SVM on the iris data.

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
#sklearn with CV test score
logreg = LogisticRegression()
accs = cross_val_score(logreg, X, y, cv=10)
print accs
print np.mean(accs)

[1.         1.         1.         0.93333333 0.93333333 0.93333333
 0.8        0.93333333 1.         1.        ]
0.9533333333333334


In [65]:
SVM1 = svm.SVC(kernel='rbf')
accs = cross_val_score(SVM1, X, y, cv=5)
print accs
print np.mean(accs)

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


In [66]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm, grid_search

Cs = [0.001, 0.01, 0.1, 1, 10]
gammas = [0.001, 0.01, 0.1, 1]
param_grid = {'C': Cs, 'gamma' : gammas}
grid_search = GridSearchCV(svm.SVC(kernel='rbf'), param_grid, cv=5)
grid_search.fit(X, y)
grid_search.best_params_

{'C': 1, 'gamma': 0.1}

In [67]:
SVM2 = svm.SVC(kernel='rbf',C=1, gamma=0.1)
accs = cross_val_score(SVM2, X, y, cv=5)
print accs
print np.mean(accs)

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


### 9. [Bonus] Compare three SVMs with different kernels on the iris data visually.
- Gaussian
- Linear
- Poly of degree 3

In [68]:
SVM1 = svm.SVC(kernel='linear')
accs = cross_val_score(SVM1, X, y, cv=5)
print accs
print np.mean(accs)

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


In [73]:
SVM1 = svm.SVC(kernel='poly', degree=3)
accs = cross_val_score(SVM1, X, y, cv=5)
print accs
print np.mean(accs)

[1.         1.         0.9        0.93333333 1.        ]
0.9666666666666666


### 10. [Bonus] Compare SVM kernels visually on fake data using sklearn's `make_circles`.

Load `make_circles` from here:
```python
from sklearn.datasets import make_circles
```

Compare the linear, rbf, and poly kernels.

In [11]:
# A: