## Learning Objectives

Students will be able to:

- Explain how a decision tree is created.
- Build a decision tree model in scikit-learn.
- Interpret a tree diagram.
- Decide whether or not a decision tree is an appropriate model for a given problem.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import plot_confusion_matrix, accuracy_score, precision_score, recall_score, roc_auc_score, f1_score

# allow plots to appear in the notebook
%matplotlib inline
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 14

In [None]:
# Quick function to print relevant metrics for classification

def print_metrics(y,preds):
    accuracy = accuracy_score(y, preds)
    precision = precision_score(y, preds)
    recall = recall_score(y, preds)
    roc = roc_auc_score(y, preds)
    f1 = f1_score(y, preds)
    
    print(f'The classification metics:')
    print(f'Accuracy Score:  {accuracy}')
    print(f'Precision Score: {precision}')
    print(f'Recall Score:    {recall}')
    print(f'ROC-AUC Score:   {roc}')
    print(f'F1 Score:        {f1}')
        

In [None]:
# Read in the data.
path = './data/titanic.csv'
titanic = pd.read_csv(path)

# Encode female as 0 and male as 1.
titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})

# Fill in the missing values for age with the median age.
titanic.Age.fillna(titanic.Age.median(), inplace=True)

# Create a DataFrame of dummy variables for Embarked.
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# Concatenate the original DataFrame and the dummy DataFrame.
titanic = pd.concat([titanic, embarked_dummies], axis=1)

# Print the updated DataFrame.
titanic.head()

In [None]:
# Define X and y.
feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S']

X = titanic[feature_cols]
y = titanic.Survived

## Massively overfit tree with no tuning

In [None]:
tree_n_clf = DecisionTreeClassifier(max_depth=None, random_state=1)
tree_n_clf.fit(X, y)

This uses the `export_graphvis` library from `sklearn.trees`
![Tree for Titanic data](img/tree_titanic-all.png)

In [None]:
plot_confusion_matrix(tree_n_clf, X,y, display_labels= ['Died','Survived']);

In [None]:
print_metrics(y=y, preds=tree_n_clf.predict(X))

## Lightly tuned tree

In [None]:
tree_3_clf = DecisionTreeClassifier(max_depth=3, random_state=1)
tree_3_clf.fit(X, y)

This uses the `export_graphvis` library from `sklearn.trees`
![Tree for Titanic data](img/tree_titanic-3.png)

In [None]:
plot_confusion_matrix(tree_3_clf, X,y, display_labels= ['Died','Survived']);

In [None]:
print_metrics(y, tree_3_clf.predict(X))

In [None]:
# Compute the feature importances (the Gini index at each node).

pd.DataFrame({'feature':feature_cols, 'importance':tree_3_clf.feature_importances_}).sort_values(by='importance', ascending=False)