# DAT 19: Homework 4 Assignment - SVMs, Trees, RF

## Instructions

The goal of this homework is to review and bring together what we have learned about Support Vector Machines, Decision Trees, Random Forests, and ensembles. 

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:30PM on Wednesday, February 17.**

## About the Data

Use the cancer_uci.csv dataset in the Data directory of our course repo. This is the [Breast Cancer Wisconsin](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)) dataset from the UCI ML Repository.

## Homework Assignment

**1) Load the data and check for balance between the two classes. If the ratio is less than 60/40 rebalance the classes to 50/50. We've provided some help here to get you started.**

In [1]:
import pandas as pd
canc = pd.read_csv("../data/cancer_uci.csv", index_col=0)
canc.head()
print(str(canc.Class.value_counts()))
print(str(canc.Class.value_counts('Benign')))
#I have no idea how the second statement gave me that

Benign       458
Malignant    241
Name: Class, dtype: int64
Benign       0.655222
Malignant    0.344778
Name: Class, dtype: float64


We have imbalanced classes so we need to decide if we want to undersample, and take only 241 values from the Benign category, or oversample, and artificially inflate the volume of malignant data. First, let's convert to binary 1,0 for classification.

In [2]:
canc.Class = canc.Class.map({'Benign':0,'Malignant':1})
canc.Class.value_counts()

0    458
1    241
Name: Class, dtype: int64

To undersample, we would throw away almost half of our benign examples, which would greatly alter our dataset and we don't want to lose that much info. So let's oversample! Here is a pattern for how to oversample:

In [3]:
# Separate your two classes:
mal_example = canc[canc.Class == 1]
benign_example = canc[canc.Class == 0]

# Oversample the malignant class to have a 50/50 ratio:
mal_over_example = mal_example.sample(458,replace=True)

# Recombine the two frames:
over_sample = pd.concat([mal_over_example,benign_example])

# Sanity check the length:
print len(over_sample)
over_sample.Class.value_counts()

916


1    458
0    458
Name: Class, dtype: int64

In [4]:
over_sample.info()

#'Bare_Nuclei' has 17 instances of non-numerical '?' 
over_sample['Bare_Nuclei'].value_counts()

#Not sure if should drop or convert to mean... Gonna drop them
over_sample = over_sample[over_sample['Bare_Nuclei'] != '?']
over_sample['Bare_Nuclei'] = over_sample['Bare_Nuclei'].astype(float)

#over_sample['Bare_Nuclei'].replace(to_replace='nan')
#over_sample['Bare_Nuclei'] = over_sample['Bare_Nuclei'].fillna(over_sample['Bare_Nuclei'].mean())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 916 entries, 184 to 695
Data columns (total 11 columns):
Sample_code_number             916 non-null int64
Clump_Thickness                916 non-null int64
Uniformity_of_Cell_Size        916 non-null int64
Uniformity_of_Cell_Shape       916 non-null int64
Marginal_Adhesion              916 non-null int64
Single_Epithelial_Cell_Size    916 non-null int64
Bare_Nuclei                    916 non-null object
Bland_Chromatin                916 non-null int64
Normal_Nucleoli                916 non-null int64
Mitoses                        916 non-null int64
Class                          916 non-null int64
dtypes: int64(10), object(1)
memory usage: 85.9+ KB


**2) Are the features normalized? If not, use the scikit-learn standard scaler to normalize them.**

In [5]:
#your code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
bc_feature_cols = over_sample.drop(['Class','Sample_code_number'], axis=1)
bc_features = scaler.fit_transform(bc_feature_cols)

**3) Train a linear SVM, using the cross validated accuracy as the score (use the scikit-learn method).**

In [6]:
from sklearn import svm
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import roc_auc_score

#bc_features = over_sample.drop(['Class','Sample_code_number'], axis=1)
bc_target = over_sample['Class']

X_train, X_test, y_train, y_test = train_test_split(bc_features, bc_target, test_size=.20, random_state=0)


svc_classifier = svm.SVC(kernel='linear', C=1)
svc_clf_fit = svc_classifier.fit(X_train,y_train)
predictions = svc_clf_fit.predict(X_test)


cv_accuracy_score = cross_val_score(svc_classifier, bc_features, bc_target, cv=10).mean()
cv_accuracy_score

0.97452854261843025

**4) Display the confusion matrix, classification report, and AUC.**

In [7]:
#your code here
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print(classification_report(y_test,predictions))
print(roc_auc_score(y_test,predictions))
cm = confusion_matrix(y_test,predictions)
cm_df = pd.DataFrame(cm, index=['Predicted Benign', 'Predicted Malignant'], 
                     columns=['Actual Benign', 'Actual Malignant'])
cm_df

             precision    recall  f1-score   support

          0       0.97      0.98      0.97        93
          1       0.98      0.97      0.97        88

avg / total       0.97      0.97      0.97       181

0.972201857283


Unnamed: 0,Actual Benign,Actual Malignant
Predicted Benign,91,2
Predicted Malignant,3,85


**5) Repeat steps 2 through 4 using a Decision Tree model. Are the results better or worse than the SVM?**

In [8]:
#your code here
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(max_depth=3, random_state=1)
tree_clf_fit = tree_clf.fit(X_train, y_train)
tree_predictions = tree_clf_fit.predict(X_test)

tree_cross_val_accuracy = cross_val_score(tree_clf, bc_features, bc_target, cv=10).mean()
tree_cross_val_accuracy

0.95789795722379978

In [9]:
print(classification_report(y_test,tree_predictions))
print(roc_auc_score(y_test, tree_predictions))
cm = confusion_matrix(y_test,tree_predictions)
cm_df = pd.DataFrame(cm, index=['Predicted Benign', 'Predicted Malignant'], 
                     columns=['Actual Benign', 'Actual Malignant'])
cm_df

             precision    recall  f1-score   support

          0       0.98      0.94      0.96        93
          1       0.93      0.98      0.96        88

avg / total       0.96      0.96      0.96       181

0.95637829912


Unnamed: 0,Actual Benign,Actual Malignant
Predicted Benign,87,6
Predicted Malignant,2,86


Results are worse than the SVM, but the Decision Tree was not optimized. Also, "was this model better or worse" is a bit of a trick question-- in my opinion, we care more about false negatives (predicted benign, actually malignant) than false positives (predicted malignant, actually benign).

**6) Repeat steps 2 through 4 using a Random Forest model. Are the results better or worse than the SVM?**

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(max_depth=5, n_estimators=100, max_features=3)
rf_clf_fit = rf_clf.fit(X_train,y_train)

rf_predictions = rf_clf_fit.predict(X_test)

rf_cross_val_accuracy = cross_val_score(rf_clf, bc_features, bc_target, cv=10).mean()
rf_cross_val_accuracy


0.98341825465420984

In [11]:
print(classification_report(y_test,rf_predictions))
print(roc_auc_score(y_test, rf_predictions))
cm = confusion_matrix(y_test,rf_predictions)
cm_df = pd.DataFrame(cm, index=['Predicted Benign', 'Predicted Malignant'], 
                     columns=['Actual Benign', 'Actual Malignant'])
cm_df

             precision    recall  f1-score   support

          0       1.00      0.98      0.99        93
          1       0.98      1.00      0.99        88

avg / total       0.99      0.99      0.99       181

0.989247311828


Unnamed: 0,Actual Benign,Actual Malignant
Predicted Benign,91,2
Predicted Malignant,0,88


Better than the Decision Tree model on both counts, and has fewer false positives than the SVM model. Same number of false negatives, however. 

### Extra Credit Questions
**The following questions are strongly encouraged, but not required for this homework assignment.**

**7) Combine the SVM and the Decision Tree model using the [Voting Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). Are the results better than either of these base classifiers alone?**

In [14]:
from sklearn.ensemble import VotingClassifier

clf1 = tree_clf 
clf2 = svc_classifier 

voting_clf = VotingClassifier(estimators=[('Decision Tree', clf1), ('SVM', clf2)], voting='hard')
voting_clf_fit = voting_clf.fit(bc_features, bc_target)
vot_predictions = voting_clf_fit.predict(X_test)
voting_cross_val_accuracy = cross_val_score(voting_clf, bc_features, bc_target, cv=10).mean()

print(str((voting_cross_val_accuracy)))
print(classification_report(y_test,vot_predictions))
print(roc_auc_score(y_test, vot_predictions))
cm = confusion_matrix(y_test,vot_predictions)
cm_df = pd.DataFrame(cm, index=['Predicted Benign', 'Predicted Malignant'], 
                     columns=['Actual Benign', 'Actual Malignant'])
cm_df

0.970096308186
             precision    recall  f1-score   support

          0       0.97      0.98      0.97        93
          1       0.98      0.97      0.97        88

avg / total       0.97      0.97      0.97       181

0.972201857283


Unnamed: 0,Actual Benign,Actual Malignant
Predicted Benign,91,2
Predicted Malignant,3,85


Same overall performance as the SVM model, better than the Decision Tree.

**8) Train an SVM using the RBF kernel. Is this model better or worse?**

In [15]:
rbf_svc_classifier = svm.SVC(C=1)
rbf_svc_clf_fit = rbf_svc_classifier.fit(X_train,y_train)
rbf_predictions = rbf_svc_clf_fit.predict(X_test)


rbf_cv_accuracy_score = cross_val_score(rbf_svc_classifier, bc_features, bc_target, cv=10).mean()

print(str(rbf_cv_accuracy_score))
print(confusion_matrix(y_test,rbf_predictions))
print(classification_report(y_test,rbf_predictions))
print(roc_auc_score(y_test, rbf_predictions))
cm = confusion_matrix(y_test,rbf_predictions)
cm_df = pd.DataFrame(cm, index=['Predicted Benign', 'Predicted Malignant'], 
                     columns=['Actual Benign', 'Actual Malignant'])
cm_df

0.974528542618
[[90  3]
 [ 2 86]]
             precision    recall  f1-score   support

          0       0.98      0.97      0.97        93
          1       0.97      0.98      0.97        88

avg / total       0.97      0.97      0.97       181

0.972507331378


Unnamed: 0,Actual Benign,Actual Malignant
Predicted Benign,90,3
Predicted Malignant,2,86


Slightly worse than the Voting Classifier.