# Solution: Decision Trees

## Read in the Breast Cancer Dataset, __`data/bcan.csv`__
* look out for missing data–remember to inform __`pandas`__ about it
* at this point you might want to simply drop the missing data

In [None]:
import pandas as pd
data = pd.read_csv('data/bcan.csv', na_values='?')
data.head()

In [None]:
data.info()

In [None]:
data.dropna(inplace=True)
data.info()

## Drop the __`id`__ since it's not going to be in the model
* also note that __`Diag`__ is the diagnosis, which is what we're going to predict
* ...so we'll need to capture that column and then remove it from the dataframe

In [None]:
# We drop the id because at best it's useless, and at worst its _leakage_ data
# and of course we drop the target since that's what we're trying to predict...
X = data.drop(columns=['id', 'Diag'])
y = data.Diag

## Remember that __`sklearn`__ wants the features in 2-d matrix, and the targets in a 1-d array

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=4)
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()

## Fit the model
* if you get an error here, figure out why

In [None]:
model.fit(X, y)

In [None]:
# Trying another classifier, Logistic Regression
model_lr.fit(X, y)

## Use export_graphviz to generate the dot file
* __`feature_names`__ should be the list of column headers
* __`class_names`__ should be the targets

In [None]:
# Let's take a look at the decision tree...
from sklearn.tree import export_graphviz
export_graphviz(model, out_file="bcan_tree.dot",
               feature_names=X.columns,
               class_names=['benign', 'malignant'],
               rounded=True,
               filled=True)

## Generate a PNG images from the dot file and open it

In [None]:
!dot -Tpng bcan_tree.dot -o bcan_tree.png
from IPython.display import Image
Image('bcan_tree.png')

## How well did the model do?

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=100, random_state=123)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
# Did we do better with a Logistic Regressor?
model_lr.score(X, y)

## Do we do better if we increase the tree depth?

In [None]:
# yes

## Why might we not want to do that?

In [None]:
# too deep and we're overfitting

## Determine the feature importances as we did in the demo

In [None]:
model.feature_importances_