# DAT 19: Homework 4 Assignment - SVMs, Trees, RF

## Instructions

The goal of this homework is to review and bring together what we have learned about Support Vector Machines, Decision Trees, Random Forests, and ensembles. 

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:30PM on Wednesday, February 17.**

## About the Data

Use the cancer_uci.csv dataset in the Data directory of our course repo. This is the [Breast Cancer Wisconsin](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29) dataset from the UCI ML Repository.

## Homework Assignment

**1) Load the data and check for balance between the two classes. If the ratio is less than 60/40 rebalance the classes to 50/50. We've provided some help here to get you started.**

In [1]:
# imports
import numpy as np
import pandas as pd

from sklearn.svm import SVC 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split

In [2]:
canc = pd.read_csv("../data/cancer_uci.csv", index_col=0)
canc.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,Benign
1,1002945,5,4,4,5,7,10,3,2,1,Benign
2,1015425,3,1,1,1,2,2,3,1,1,Benign
3,1016277,6,8,8,1,3,4,3,7,1,Benign
4,1017023,4,1,1,3,2,1,3,1,1,Benign


In [3]:
canc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null object
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null object
dtypes: int64(9), object(2)
memory usage: 65.5+ KB


In [4]:
canc.Bare_Nuclei.value_counts()

1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare_Nuclei, dtype: int64

In [5]:
canc.Class.value_counts()
# What are the frequencies of each class?

Benign       458
Malignant    241
Name: Class, dtype: int64

We have imbalanced classes so we need to decide if we want to undersample, and take only 241 values from the Benign category, or oversample, and artificially inflate the volume of malignant data. First, let's convert to binary 1,0 for classification.

In [6]:
canc.Class = canc.Class.map({'Benign':0,'Malignant':1})
canc.Class.value_counts()

0    458
1    241
Name: Class, dtype: int64

In [7]:
#clean up ?s in Bare Nucelei
mask = canc.Bare_Nuclei != '?'
canc = canc[mask]
canc.Class.value_counts()

0    444
1    239
Name: Class, dtype: int64

To undersample, we would throw away almost half of our benign examples, which would greatly alter our dataset and we don't want to lose that much info. So let's oversample! Here is a pattern for how to oversample:

In [8]:
# Separate your two classes:
mal_example = canc[canc.Class == 1]
benign_example = canc[canc.Class == 0]

# Oversample the malignant class to have a 50/50 ratio:
mal_over_example = mal_example.sample(444,replace=True)

# Recombine the two frames:
over_sample = pd.concat([mal_over_example,benign_example])

# Sanity check the length:
print len(over_sample)
print over_sample.Class.value_counts()

888
1    444
0    444
Name: Class, dtype: int64


**2) Are the features normalized? If not, use the scikit-learn standard scaler to normalize them.**

In [9]:
canc.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Sample_code_number,683,1076720.22694,620644.047655,63375,877617,1171795,1238705,13454352
Clump_Thickness,683,4.442167,2.820761,1,2,4,6,10
Uniformity_of_Cell_Size,683,3.150805,3.065145,1,1,1,5,10
Uniformity_of_Cell_Shape,683,3.215227,2.988581,1,1,1,5,10
Marginal_Adhesion,683,2.830161,2.864562,1,1,1,4,10
Single_Epithelial_Cell_Size,683,3.234261,2.223085,1,2,2,4,10
Bland_Chromatin,683,3.445095,2.449697,1,2,3,5,10
Normal_Nucleoli,683,2.869693,3.052666,1,1,1,4,10
Mitoses,683,1.603221,1.732674,1,1,1,1,10
Class,683,0.349927,0.477296,0,0,0,1,1


In [10]:
data = over_sample.iloc[:,:-1]
labels = over_sample.Class

In [11]:
datax = StandardScaler().fit_transform(data)

**3) Train a linear SVM, using the cross validated accuracy as the score (use the scikit-learn method).**

In [12]:
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.33, random_state=42)

In [13]:
model = SVC(kernel='linear',C=1).fit(X_train,y_train)

In [22]:
model.score(X_test,y_test)

0.58503401360544216

**4) Display the confusion matrix, classification report, and AUC.**

In [14]:
z = model.predict(X_test)
z

array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], dtype=int64)

In [15]:
print classification_report(y_test, z)

             precision    recall  f1-score   support

          0       0.57      0.74      0.64       148
          1       0.62      0.43      0.51       146

avg / total       0.59      0.59      0.58       294



In [16]:
cm = confusion_matrix(y_test, z)

In [17]:
cm_df = pd.DataFrame(cm, index=['Predicted benign', 'Predicted malignant'], 
                     columns=['Actual benign', 'Actual malignant'])

cm_df

Unnamed: 0,Actual benign,Actual malignant
Predicted benign,109,39
Predicted malignant,83,63


The ROC curve illustrates the performance of a binary classifier system as its discrimination threshold is varied. We use ROC to compare models using the Area Under Curve (AUC) statistic. Generally, 1 is a perfect model and < 0.5 is worse than guessing. 

In [None]:
# display AUC here

**5) Repeat steps 2 through 4 using a Decision Tree model. Are the results better or worse than the SVM?**

In [24]:
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
treeclf.fit(X_train,y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=1, splitter='best')

**6) Repeat steps 2 through 4 using a Random Forest model. Are the results better or worse than the SVM?**

In [19]:
#your code here

### Extra Credit Questions
**The following questions are strongly encouraged, but not required for this homework assignment.**

**7) Combine the SVM and the Decision Tree model using the [Voting Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). Are the results better than either of these base classifiers alone?**

In [20]:
#your code here

**8) Train an SVM using the RBF kernel. Is this model better or worse?**

In [21]:
#your code here