# DAT 19: Homework 4 Assignment - SVMs, Trees, RF

## Instructions

The goal of this homework is to review and bring together what we have learned about Support Vector Machines, Decision Trees, Random Forests, and ensembles. 

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:30PM on Wednesday, February 17.**

## About the Data

Use the cancer_uci.csv dataset in the Data directory of our course repo. This is the [Breast Cancer Wisconsin](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)) dataset from the UCI ML Repository.

## Homework Assignment

**1) Load the data and check for balance between the two classes. If the ratio is less than 60/40 rebalance the classes to 50/50. We've provided some help here to get you started.**

In [44]:
import pandas as pd
import numpy as np

canc = pd.read_csv("../data/cancer_uci.csv", index_col=0)
canc.head()
canc.Class.value_counts()
# What are the frequencies of each class?

Benign       458
Malignant    241
Name: Class, dtype: int64

We have imbalanced classes so we need to decide if we want to undersample, and take only 241 values from the Benign category, or oversample, and artificially inflate the volume of malignant data. First, let's convert to binary 1,0 for classification.

In [45]:
canc.Class = canc.Class.map({'Benign':0,'Malignant':1})
canc.Class.value_counts()

0    458
1    241
Name: Class, dtype: int64

To undersample, we would throw away almost half of our benign examples, which would greatly alter our dataset and we don't want to lose that much info. So let's oversample! Here is a pattern for how to oversample:

In [46]:
# Separate your two classes:
mal_example = canc[canc.Class == 1]
benign_example = canc[canc.Class == 0]

# Oversample the malignant class to have a 50/50 ratio:
mal_over_example = mal_example.sample(458,replace=True)

# Recombine the two frames:
over_sample = pd.concat([mal_over_example,benign_example])

# Sanity check the length:
print len(over_sample)

916


**2) Are the features normalized? If not, use the scikit-learn standard scaler to normalize them.**

In [47]:
# review data for null values and replace
over_sample.info()

# convert Bare_Nuclei to numeric
over_sample.Bare_Nuclei = over_sample.Bare_Nuclei.convert_objects(convert_numeric=True)

# fill null values with imputed mean and standard deviation
# over_sample.Bare_Nuclei[over_sample.Bare_Nuclei.notnull()].describe()
over_sample.Bare_Nuclei.fillna(np.random.normal(4.636, 3.962), inplace=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 916 entries, 263 to 695
Data columns (total 11 columns):
Sample_code_number             916 non-null int64
Clump_Thickness                916 non-null int64
Uniformity_of_Cell_Size        916 non-null int64
Uniformity_of_Cell_Shape       916 non-null int64
Marginal_Adhesion              916 non-null int64
Single_Epithelial_Cell_Size    916 non-null int64
Bare_Nuclei                    916 non-null object
Bland_Chromatin                916 non-null int64
Normal_Nucleoli                916 non-null int64
Mitoses                        916 non-null int64
Class                          916 non-null int64
dtypes: int64(10), object(1)
memory usage: 85.9+ KB




In [48]:
# scale the features
from sklearn.preprocessing import StandardScaler

# assign features and target
features = over_sample.drop(['Class', 'Sample_code_number'], axis = 1)
target = over_sample.Class

# standardize the features
scaler = StandardScaler()
scal_features = scaler.fit_transform(features)

**3) Train a linear SVM, using the cross validated accuracy as the score (use the scikit-learn method).**

In [49]:
# import relevant libraries
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score

# train the model and make predictions using scal_features
svc_X_train, svc_X_test, svc_y_train, svc_y_test = train_test_split(scal_features, target, test_size = 0.2, random_state=1)

model_svc = SVC(C=1, kernel='linear', probability = True).fit(svc_X_train, svc_y_train) # linear model
print cross_val_score(SVC(C=1,kernel='linear', probability = True), scal_features, target, cv=5).mean() # cross validated score

0.978225035834


**4) Display the confusion matrix, classification report, and AUC.**

In [50]:
# import relevant libraries
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

svc_preds = model_svc.predict(svc_X_test)
print metrics.confusion_matrix(svc_y_test, svc_preds)
print classification_report(svc_y_test, model_svc.predict(svc_X_test))

# calculate SVC linear predicted probability of test features and use to generate AUC
model_svc_predicted_proba = model_svc.predict_proba(svc_X_test)
svc_fpr, svc_tpr, svc_thresholds = roc_curve(svc_y_test, model_svc_predicted_proba[:, 1])
svc_roc_auc = auc(svc_fpr, svc_tpr)
print svc_roc_auc

[[91  4]
 [ 0 89]]
             precision    recall  f1-score   support

          0       1.00      0.96      0.98        95
          1       0.96      1.00      0.98        89

avg / total       0.98      0.98      0.98       184

0.994559432289


**5) Repeat steps 2 through 4 using a Decision Tree model. Are the results better or worse than the SVM?**

In [51]:
# train and predict using Decision Tree model
from sklearn.tree import DecisionTreeClassifier

# train the model and make predictions using scal_features
tree_X_train, tree_X_test, tree_y_train, tree_y_test = train_test_split(scal_features, target, test_size = 0.2, random_state=1)

model_tree = DecisionTreeClassifier(random_state=1)
model_tree.fit(tree_X_train, tree_y_train)

print cross_val_score(model_tree, scal_features, target, cv=5).mean() # cross validated score

# print confusion matrix and classification report
tree_preds = model_tree.predict(tree_X_test)
print metrics.confusion_matrix(tree_y_test, tree_preds)
print classification_report(tree_y_test, model_tree.predict(tree_X_test))

# calculate decision tree predicted probability of test features and use to generate AUC
model_tree_predicted_proba = model_tree.predict_proba(tree_X_test)
tree_fpr, tree_tpr, tree_thresholds = roc_curve(tree_y_test, model_tree_predicted_proba[:, 1])
tree_roc_auc = auc(tree_fpr, tree_tpr)
print tree_roc_auc

0.962924032489
[[92  3]
 [ 1 88]]
             precision    recall  f1-score   support

          0       0.99      0.97      0.98        95
          1       0.97      0.99      0.98        89

avg / total       0.98      0.98      0.98       184

0.978592548788


Based on the above scores, the Decision Tree model does not perform quite as well as the SVM model

**6) Repeat steps 2 through 4 using a Random Forest model. Are the results better or worse than the SVM?**

In [52]:
# train and predict using Random Forest model
from sklearn.ensemble import RandomForestClassifier

# train the model and make predictions using scal_features
forest_X_train, forest_X_test, forest_y_train, forest_y_test = train_test_split(scal_features, target, test_size = 0.2, random_state=1)

# initialize a random forest classifier, fit the forest to the training set, and make predictions
model_forest = RandomForestClassifier(random_state = 1) 
model_forest.fit(forest_X_train, forest_y_train)
forest_preds = model_forest.predict(forest_X_test)

print cross_val_score(model_forest, scal_features, target, cv=5).mean()

# print confusion matrix and classification report
forest_preds = model_forest.predict(forest_X_test)
print metrics.confusion_matrix(forest_y_test, forest_preds)
print classification_report(forest_y_test, model_forest.predict(forest_X_test))

# calculate randome forest predicted probability of test features and use to generate AUC
model_forest_predicted_proba = model_forest.predict_proba(forest_X_test)
forest_fpr, forest_tpr, forest_thresholds = roc_curve(forest_y_test, model_forest_predicted_proba[:, 1])
forest_roc_auc = auc(forest_fpr, forest_tpr)
print forest_roc_auc

0.981473960822
[[92  3]
 [ 0 89]]
             precision    recall  f1-score   support

          0       1.00      0.97      0.98        95
          1       0.97      1.00      0.98        89

avg / total       0.98      0.98      0.98       184

0.993554109994


Based on the above scores, the Random Forest model performs slightly better than the SVM model (cross-val score slightly better and AUC in-line)

### Extra Credit Questions
**The following questions are strongly encouraged, but not required for this homework assignment.**

**7) Combine the SVM and the Decision Tree model using the [Voting Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). Are the results better than either of these base classifiers alone?**

In [53]:
#your code here

**8) Train an SVM using the RBF kernel. Is this model better or worse?**

In [54]:
#your code here