## Challenge: Decision Tree vs. Random Forest

Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

In [2]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import time
%matplotlib inline

### Dataset: Diagnostic Wisconsin Breast Cancer Dataset

__Source:__ https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [4]:
df = pd.read_csv('breast_cancer_data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [3]:
#check how balanced outcome is
df['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [3]:
#data is skewed towards benign a little but not too much
#data is also pretty clean out of the box

#remove last column of Nans
df = df.drop('Unnamed: 32',1)

#change diagnosis to 0 (benign) or 1 (malignant)
df['diagnosis'] = df['diagnosis'].map(lambda x: 1 if x == 'M' else 0)

#check data types and change anything necessary
#df.dtypes

## Decision Tree

In [15]:
from sklearn import tree
from sklearn.model_selection import cross_val_score
from IPython.display import Image
import pydotplus
import graphviz

#set data and target
X = df.drop('diagnosis',1)
Y = df['diagnosis']

#set variable to time program
start_time = time.clock()

decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=12
)

#variables for validation results
cv = 10
scores_tree = cross_val_score(decision_tree,X,Y,cv=cv)

print('score array:\n', scores_tree)
print('\nruntime:\n',time.clock() - start_time, "seconds")
print('\nscore array mean:\n', np.mean(scores_tree))
print('\nscore array std dev:\n', np.std(scores_tree))

score array:
 [0.86206897 0.87931034 0.87719298 0.9122807  0.94736842 0.94736842
 0.89473684 0.92857143 0.92857143 0.94642857]

runtime:
 0.029609999999999914 seconds

score array mean:
 0.9123898107337307

score array std dev:
 0.030506110135382954


## Random Forest

In [17]:
start_time = time.clock()

from sklearn import ensemble
rfc = ensemble.RandomForestClassifier()

scores_rfc = cross_val_score(rfc,X,Y,cv=cv)

print('score array:\n', scores_rfc)
print('\nruntime:\n',time.clock() - start_time, "seconds")
print('\nscore array mean:\n', np.mean(scores_rfc))
print('\nscore std dev:\n', np.std(scores_rfc))

score array:
 [0.96551724 0.9137931  0.94736842 0.94736842 1.         0.96491228
 0.94736842 0.96428571 0.94642857 0.96428571]

runtime:
 0.2122989999999998 seconds

score array mean:
 0.9561327888687234

score std dev:
 0.02083190642065962


## Conclusion

- Random Forest consistently returns more accurate predictions

- RF scores typically show less variance than a single Decision Tree

- RF takes much longer to run