<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

## Support Vector Machines Lab

Week 6 | 4.2

---

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    ./DSI-SF-4/datasets/breast_cancer_wisconsin

**Spambase**

    ./DSI-SF-4/datasets/spam

**Car evaluation**

    ./DSI-SF-4/datasets/car_evaluation
    
**Mushroom**

    ./DSI-SF-4/datasets/mushroom


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## 1.: Breast Cancer



### Load the Data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.

In [18]:
b_cancer = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/breast_cancer_wisconsin/breast_cancer.csv')

In [22]:
#No Missing Values
#Need to clean Bare_Nuclei
Bare_Nuclei_Cleaned = b_cancer['Bare_Nuclei'].map(lambda x: np.nan if x == '?' else float(x))
b_cancer['Bare_Nuclei'] = b_cancer['Bare_Nuclei'].map(lambda x: np.mean(Bare_Nuclei_Cleaned) if x == '?' else float(x))

In [23]:
b_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null float64
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: float64(1), int64(10)
memory usage: 60.1 KB


In [None]:
#b_cancer['Class] = y :2 = Bin, 4= malignant


In [84]:
X = b_cancer.drop(['Class', 'Sample_code_number'], 1)
y = b_cancer['Class']

print len(X), len(y)
print X.shape, y.shape

699 699
(699, 9) (699,)


## 2. Modeling

For details on the SVM classifier, see here:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training & testing data.

Classification report:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

Confusion matrix:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [85]:
#2-458
#4-241
#699 Total

baseline = 458. / 699.
baseline

0.6552217453505007

In [86]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

In [87]:
from sklearn.svm import LinearSVC

LinearSVC = LinearSVC()
LinearSVC.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [88]:
scores = cross_val_score(LinearSVC, X_train, y_train, cv=5) #Needs both X and y
print scores, '\n'
print np.mean(scores)

[ 0.96190476  0.95238095  0.95238095  0.98095238  0.97115385] 

0.963754578755


In [89]:
from sklearn.svm import SVC

SVC = SVC()
SVC.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [90]:
scores = cross_val_score(SVC, X_train, y_train, cv=5) #Needs both X and y
print scores, '\n'
print np.mean(scores)

[ 0.94285714  0.96190476  0.96190476  0.97142857  0.98076923] 

0.963772893773


In [91]:
y_pred = SVC.predict(X_test)

In [92]:
#Confusion Matrix
#df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

def_confusion = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
def_confusion

Predicted,2,4,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,107,11,118
4,1,56,57
All,108,67,175


In [93]:
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          2       0.99      0.91      0.95       118
          4       0.84      0.98      0.90        57

avg / total       0.94      0.93      0.93       175



## 2. Perform the steps above with the car or mushroom dataset

Repeat each step.

In [65]:
mush = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/mushroom/mushroom.csv')

In [66]:
mush.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [82]:
from patsy import 

X, y = dmatrices('class ~ odor', mush)

#X = mush[['odor']]
#y = mush['class']

SyntaxError: invalid syntax (<unknown>, line 1)

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

In [79]:
from sklearn.svm import SVC

SVC = SVC()
SVC.fit(X_train, y_train)

ValueError: could not convert string to float: l

## 3. Compare SVM, kNN and logistic regression using spam data

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

Bonus: 

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Look at the documentation for how to use this function in sklearn.

http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves