# CART / Decision Trees

Our first notebook will look at the Classification and Regression Trees (CART) implementation in the `scikit-learn` library.

CART is a fancy way of saying "a decision tree," with some work happening behind the scenes to find the best solution given available information.  In practice, this generates a series of if-else statements which lead us to a single decision.


In [None]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
clf = tree.DecisionTreeClassifier()

## Load Data

For this demo, we will load a dataset of individuals and whether they have a high chance of heart attack (output = 1).

In [None]:
heart_attack_data = "../data/HeartAttackData.csv"
df = pd.read_csv(heart_attack_data, header=0)

# Review the data
df

These measures aren't very self-explanatory, so let's explain them here.

- `age` = Age of patient
- `sex` = Sex of the patient (0 = female, 1 = male)
- `cp` = Type of chest pain.
  - 1 = Typical angina
  - 2 = Atypical angina
  - 3 = Non-anginal pain
  - 4 = Asymptomatic
- `trtbps` = Resting blood pressure (mm/Hg)
- `chol` = Cholesterol level
- `fbs` = Fasting blood sugar above 120 mg/dl
- `restecg` = Resting ECG result
  - 0 = Normal
  - 1 = ST-T wave abnormality
  - 2 = Probable or definite left ventricular hypertrophy
- `thalachh` = Maximum heart rate achieved
- `exng` = Exercise-induced angina (1 = yes, 0 = no)
- `oldpeak` = Previous peak
- `slp` = Slope
- `caa` = Number of major vessels (0-3)
- `thall` = Thalium Stress Test result (ranges from 0-3)
- `output` = Diagnosis of heart disease (0 = < 50% diameter narrowing, 1 = > 50% diameter narrowing)

Let's double-check the distinct values for the `output` feature:

In [None]:
df[['output']].drop_duplicates()

As a quick note, the implementation of CART that scikit-learn (sklearn) uses requires all inputs be numeric features.  Fortunately, this dataset happens to include only numeric features, so we don't need to do any special processing.

## Split Labels from Features

Let's now create two variables:  `y`, which is the thing we want to predict (output: `{ 0, 1 }`); and `X`, which is everything we can use to predict the specific value of `y`.

With Python, splitting data out like this will not shuffle the results (something we might have to worry about if we split the data up in SQL).

In [None]:
y = df['output']
X = df.drop('output', axis=1)

## Split into Training & Test Datasets

The sklearn library has a method called `train_test_split` which breaks our data out into training and test datasets.  This allows us to train a model on one set of data and then see how it would perform on a completely different set of data.  This gives us a better idea of how our model might perform than simply using accuracy from the test dataset, as models tend to **overfit**:  they latch on the peculiarities of the training dataset.  If those peculiarities do not also exist in the broader population, then the trained model may come up with the wrong answer.  Having a separate test dataset that the trained model knows nothing about gives us a better idea of realistic behavior.  It also allows us to come up with a measure of how much overfitting the trained model does, as we can compare the training accuracy to the test accuracy; if there is a substantial difference between the two, our model is overfitting quite a bit.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

## Perform Classification

We'll train the model on our training data, ignoring the test data for now.  With sklearn, this is easy:  use the `fit()` method.

In [None]:
clf = clf.fit(X_train, y_train)

## How'd we do?

Let's use the `accuracy_score` method in sklearn to see just how well we did.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

Looks like we predicted the correct answer about 75.8% of the time.  Not too bad for a few lines of work!

## Viewing the Tree

Decision trees are a great starting point for us because they are intuitive and we can visualize them easily.  Let's use a different library to visualize our resulting tree and see what factors the algorithm focused in on.

In order to run this section, you will need the `graphviz` library.  You can get it from pip or conda:

`conda install python-graphviz`

`pip install graphviz`

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, feature_names=X.columns.values.tolist(), class_names=['No heart attack', 'Heart attack'], filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph

Leaf nodes in **blue** were cases where there were heart attacks, and leaf nodes in **orange** had no heart attacks.  Non-leaf nodes are colored based on how likely they are to indicate a heart attack, with the starting point in our analysis being `thall <= 2.5`, as that seems to split things fairly well by itself.