<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

## Support Vector Machines Lab

Week 6 | 4.2

---

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    ./DSI-SF-4/datasets/breast_cancer_wisconsin

**Spambase**

    ./DSI-SF-4/datasets/spam

**Car evaluation**

    ./DSI-SF-4/datasets/car_evaluation
    
**Mushroom**

    ./DSI-SF-4/datasets/mushroom


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## 1.: Breast Cancer



### Load the Data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.

In [2]:
cancer = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-4/datasets/breast_cancer_wisconsin/breast_cancer.csv')

In [3]:
cancer.isnull().sum()

Sample_code_number             0
Clump_Thickness                0
Uniformity_of_Cell_Size        0
Uniformity_of_Cell_Shape       0
Marginal_Adhesion              0
Single_Epithelial_Cell_Size    0
Bare_Nuclei                    0
Bland_Chromatin                0
Normal_Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [4]:
cancer.Clump_Thickness.unique()

array([ 5,  3,  6,  4,  8,  1,  2,  7, 10,  9])

In [5]:
cancer.dtypes

Sample_code_number              int64
Clump_Thickness                 int64
Uniformity_of_Cell_Size         int64
Uniformity_of_Cell_Shape        int64
Marginal_Adhesion               int64
Single_Epithelial_Cell_Size     int64
Bare_Nuclei                    object
Bland_Chromatin                 int64
Normal_Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

In [6]:
cancer.Bare_Nuclei.unique()

array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'], dtype=object)

In [7]:
cancer.Bare_Nuclei.value_counts()

1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare_Nuclei, dtype: int64

In [8]:
cancer = cancer[cancer.Bare_Nuclei != '?']
cancer.Bare_Nuclei = cancer.Bare_Nuclei.map(lambda x: int(x))

In [9]:
cancer.Class = cancer.Class.map(lambda x: 1 if x == 4 else 0)

## 2. Modeling

For details on the SVM classifier, see here:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training & testing data.

Classification report:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

Confusion matrix:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [10]:
baseline_acc = 1. - cancer.Class.mean()

In [11]:
from sklearn.svm import SVC

In [12]:
linear_svm = SVC(kernel='linear')

In [14]:
y = cancer.Class.values
X = cancer[[col for col in cancer.columns if not col in ['Class','Sample_code_number']]]

In [15]:
linear_svm.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [20]:
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split

In [17]:
linear_svm_scores = cross_val_score(linear_svm, X, y, cv=5)

In [18]:
print linear_svm_scores
print np.mean(linear_svm_scores)

[ 0.94890511  0.94890511  0.97810219  0.97080292  0.98518519]
0.96638010273


In [19]:
rbf_svm = SVC(kernel='rbf')
rbf_scores = cross_val_score(rbf_svm, X, y, cv=5)
print rbf_scores
print np.mean(rbf_scores)

[ 0.90510949  0.91240876  0.96350365  0.98540146  0.98518519]
0.95032170857


In [21]:
Xtrain, Xtest, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)

In [28]:
linear_svm = SVC(kernel='linear', probability=True)
linear_svm.fit(Xtrain, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [29]:
y_pred = linear_svm.predict(Xtest)

In [30]:
df_confusion = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
df_confusion

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,144,3,147
1,6,73,79
All,150,76,226


In [33]:
Xtest_pp = linear_svm.predict_proba(Xtest)

In [39]:
def custom_predict(pred_prob, threshold_for_positive=0.5):
    predicted = [1 if x[1] >= threshold_for_positive else 0 for x in pred_prob]
    return np.array(predicted)

In [49]:
y_pred_lenient = custom_predict(Xtest_pp, threshold_for_positive=0.05)

In [50]:
len(y_pred_lenient), len(y_test)

(226, 226)

In [51]:
df_confusion = pd.crosstab(y_test, y_pred_lenient, rownames=['Actual'], colnames=['Predicted'], margins=True)
df_confusion

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,122,25,147
1,0,79,79
All,122,104,226


## 2. Perform the steps above with the car or mushroom dataset

Repeat each step.

## 3. Compare SVM, kNN and logistic regression using spam data

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

Bonus: 

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Look at the documentation for how to use this function in sklearn.

http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves

In [52]:
spam = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-4/datasets/spam/spam_words_wide.csv')

In [53]:
spam.shape

(5572, 1001)

In [54]:
spam.head()

Unnamed: 0,is_spam,getzed,86021,babies,sunoco,ultimately,thk,voted,spatula,fiend,...,itna,borin,thoughts,iccha,videochat,freefone,pist,reformat,strict,69698
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
y = spam.is_spam
X = spam.iloc[:, np.random.choice(range(1, spam.shape[1]), replace=False, size=50)]

In [63]:
X.head()

Unnamed: 0,greatness,blame,duo,apologetic,teresa,room,hoped,shuhui,handsomes,pulls,...,weddin,flute,84,nichols,09050090044,regular,becz,cliffs,reformat,devouring
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [64]:
svc_params = {
    'C':np.logspace(-3,1,5),
    'kernel':['linear','rbf','sigmoid'],
    'probability':[True]
}

svc_gs = GridSearchCV(SVC(), svc_params, cv=5, verbose=2)
svc_gs.fit(X, y)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV] kernel=linear, C=0.001, probability=True ........................
[CV] ......... kernel=linear, C=0.001, probability=True, total=   1.7s
[CV] kernel=linear, C=0.001, probability=True ........................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.0s remaining:    0.0s


[CV] ......... kernel=linear, C=0.001, probability=True, total=   1.8s
[CV] kernel=linear, C=0.001, probability=True ........................
[CV] ......... kernel=linear, C=0.001, probability=True, total=   1.7s
[CV] kernel=linear, C=0.001, probability=True ........................
[CV] ......... kernel=linear, C=0.001, probability=True, total=   2.3s
[CV] kernel=linear, C=0.001, probability=True ........................
[CV] ......... kernel=linear, C=0.001, probability=True, total=   1.8s
[CV] kernel=rbf, C=0.001, probability=True ...........................
[CV] ............ kernel=rbf, C=0.001, probability=True, total=   2.3s
[CV] kernel=rbf, C=0.001, probability=True ...........................
[CV] ............ kernel=rbf, C=0.001, probability=True, total=   2.1s
[CV] kernel=rbf, C=0.001, probability=True ...........................
[CV] ............ kernel=rbf, C=0.001, probability=True, total=   2.2s
[CV] kernel=rbf, C=0.001, probability=True ...........................
[CV] .

[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:  3.0min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ['linear', 'rbf', 'sigmoid'], 'C': array([  1.00000e-03,   1.00000e-02,   1.00000e-01,   1.00000e+00,
         1.00000e+01]), 'probability': [True]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=2)

In [65]:
svc_gs.best_params_

{'C': 1.0, 'kernel': 'linear', 'probability': True}

In [66]:
best_svc = svc_gs.best_estimator_

In [67]:
svc_gs.best_score_

0.86826992103374012

In [68]:
y.mean()

0.13406317300789664

In [69]:
1 - y.mean()

0.8659368269921034

In [None]:
knn_params = {
    'n_neighbors':range(1,100,10),
    'weights':['uniform','distance']
}

from sklearn.neighbors import KNeighborsClassifier

knn_gs = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, verbose=2)
knn_gs.fit(X, y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] n_neighbors=1, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.5s
[CV] n_neighbors=1, weights=uniform ..................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.0s remaining:    0.0s


[CV] ................... n_neighbors=1, weights=uniform, total=   0.4s
[CV] n_neighbors=1, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.4s
[CV] n_neighbors=1, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.4s
[CV] n_neighbors=1, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.4s
[CV] n_neighbors=1, weights=distance .................................
[CV] .................. n_neighbors=1, weights=distance, total=   0.4s
[CV] n_neighbors=1, weights=distance .................................
[CV] .................. n_neighbors=1, weights=distance, total=   0.4s
[CV] n_neighbors=1, weights=distance .................................
[CV] .................. n_neighbors=1, weights=distance, total=   0.5s
[CV] n_neighbors=1, weights=distance .................................
[CV] .