# Decision Trees

## 1. Import libraries

In [9]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

## 2. Load and Explore Data

In [11]:
data_path = '/Users/danielchen/Desktop/GitHub/coursera-machine-learning/Data/drug200.csv'
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


## 3. Pre-processing

As we can see above, we'll need to recode the categorical variables (such as `Sex`, `BP`, and `Cholesterol`) which have values stored as strings into numeric values. `sklearn`'s decision trees do not handle categorical variables well. 

In [13]:
X = data.drop('Drug', axis=1).values
X[0:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

In [17]:
le = preprocessing.LabelEncoder()

column_indexes = [x for x in range(1, 4)]
labels = [
    ['F', 'M'],
    ['LOW', 'NORMAL', 'HIGH'],
    ['NORMAL', 'HIGH'],
]

for (column_index, label) in zip(column_indexes, labels):
    le.fit(label)
    X[:, column_index] = le.transform(X[:, column_index])


As we can see, all of the strings have been replaced with numeric values. 

In [30]:
for column_index in column_indexes:
    print(np.unique(X[:, column_index]))

[0 1]
[0 1 2]
[0 1]


And now we can define the target variable.

In [32]:
y = data.Drug.values

## 4. Train Test Split

Next we split the data into training and testing sets.

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

In [34]:
data_splits = [X_train, X_test, y_train, y_test]
for split in data_splits:
    print(split.shape)

(140, 5)
(60, 5)
(140,)
(60,)


## 5. Modeling

We'll create an instance of the DecisionTreeClassifier class and specify `entropy` as the criterion because we want to see the information gain of each node. Generally, entropy is a way to measure the randomness of the data, or the uncertainty. If entropy is 0, then all the data is the same or pure. If data is equally split between different values, then its entropy is 1. As entropy decreases, there's more certain, so we gain more information. 

In [35]:
drug_tree = DecisionTreeClassifier(criterion='entropy', max_depth=4)
drug_tree

In [36]:
drug_tree.fit(X_train, y_train)

## 6. Prediction

In [39]:
y_hat = drug_tree.predict(X_test)
print(y_test[0:5], y_hat[0:5])

['drugY' 'drugX' 'drugX' 'drugX' 'drugX'] ['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
