# Decision Tress

In [1]:
import pandas as pd
import numpy as np 
from sklearn.tree import DecisionTreeClassifier

In [2]:
my_data = pd.read_csv('drug200.csv')
print(my_data.head())
print(my_data.shape)
print(my_data)

   Age Sex      BP Cholesterol  Na_to_K   Drug
0   23   F    HIGH        HIGH   25.355  drugY
1   47   M     LOW        HIGH   13.093  drugC
2   47   M     LOW        HIGH   10.114  drugC
3   28   F  NORMAL        HIGH    7.798  drugX
4   61   F     LOW        HIGH   18.043  drugY
(200, 6)
     Age Sex      BP Cholesterol  Na_to_K   Drug
0     23   F    HIGH        HIGH   25.355  drugY
1     47   M     LOW        HIGH   13.093  drugC
2     47   M     LOW        HIGH   10.114  drugC
3     28   F  NORMAL        HIGH    7.798  drugX
4     61   F     LOW        HIGH   18.043  drugY
..   ...  ..     ...         ...      ...    ...
195   56   F     LOW        HIGH   11.567  drugC
196   16   M     LOW        HIGH   12.006  drugC
197   52   M  NORMAL        HIGH    9.894  drugX
198   23   M  NORMAL      NORMAL   14.020  drugX
199   40   F     LOW      NORMAL   11.349  drugX

[200 rows x 6 columns]


It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.

In [3]:
my_data.columns
x  = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values

Some of the features in this dataset are categorical. However sklearn decision trees can not handle categorical variables. We can convert these features to numerical values using pandas.get_dummies().<br>
Converting categorical variables to dummy/indicator values.

In [4]:
from sklearn import preprocessing

le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
x[:,1] = le_sex.transform(x[:,1])

le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
x[:,2] = le_BP.transform(x[:,2])

le_chol = preprocessing.LabelEncoder()
le_chol.fit(['NORMAL', 'HIGH'])
x[:,3] = le_chol.transform(x[:,3])

In [5]:
x[0:6]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043],
       [22, 0, 2, 0, 8.607]], dtype=object)

**Now we can fil the target variable**

In [6]:
y = my_data['Drug']
y

0      drugY
1      drugC
2      drugC
3      drugX
4      drugY
       ...  
195    drugC
196    drugC
197    drugX
198    drugX
199    drugX
Name: Drug, Length: 200, dtype: object

## Setting up the Decision Tree

In [7]:
from sklearn.model_selection import train_test_split

Now train_test_split will return 4 different parameters. We will name them:<br>
X_trainset, X_testset, y_trainset, y_testset

The train_test_split will need the parameters:<br>
x, y, test_size=0.3, and random_state=3.

The x and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

In [8]:
x_trainset, x_testset, y_trainset, y_testset = train_test_split(x, y, test_size =0.3, random_state=3)
print(x_trainset.shape)
print(y_trainset.shape)
print(x_testset.shape)
print(y_testset.shape)
# Ensure the dimensions match!!!

(140, 5)
(140,)
(60, 5)
(60,)


### Modelling

We will first create an instance of the DecisionTreeClassifier called drugTree.
Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.

In [9]:
drugtree = DecisionTreeClassifier(criterion = 'entropy', max_depth=4)

Next, we will fit the data with the training feature matrix X_trainset and training response vector y_trainset

In [10]:
drugtree.fit(x_trainset, y_trainset)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

### Prediction

Make some predictions on the testing dataset and store it into a variable called predTree.<br>
We can print out predTree and y_testset if you want to visually compare the prediction to the actual values.

In [11]:
predtree = drugtree.predict(x_testset)
print(predtree)
print(y_testset.values)

['drugY' 'drugX' 'drugX' 'drugX' 'drugX' 'drugC' 'drugY' 'drugA' 'drugB'
 'drugA' 'drugY' 'drugA' 'drugY' 'drugY' 'drugX' 'drugY' 'drugX' 'drugX'
 'drugB' 'drugX' 'drugX' 'drugY' 'drugY' 'drugY' 'drugX' 'drugB' 'drugY'
 'drugY' 'drugA' 'drugX' 'drugB' 'drugC' 'drugC' 'drugX' 'drugX' 'drugC'
 'drugY' 'drugX' 'drugX' 'drugX' 'drugA' 'drugY' 'drugC' 'drugY' 'drugA'
 'drugY' 'drugY' 'drugY' 'drugY' 'drugY' 'drugB' 'drugX' 'drugY' 'drugX'
 'drugY' 'drugY' 'drugA' 'drugX' 'drugY' 'drugX']
['drugY' 'drugX' 'drugX' 'drugX' 'drugX' 'drugC' 'drugY' 'drugA' 'drugB'
 'drugA' 'drugY' 'drugA' 'drugY' 'drugY' 'drugX' 'drugY' 'drugX' 'drugX'
 'drugB' 'drugX' 'drugX' 'drugY' 'drugY' 'drugY' 'drugX' 'drugB' 'drugY'
 'drugY' 'drugA' 'drugX' 'drugB' 'drugC' 'drugC' 'drugX' 'drugX' 'drugC'
 'drugY' 'drugX' 'drugX' 'drugX' 'drugA' 'drugY' 'drugC' 'drugY' 'drugA'
 'drugY' 'drugY' 'drugY' 'drugY' 'drugX' 'drugB' 'drugX' 'drugY' 'drugX'
 'drugY' 'drugY' 'drugA' 'drugX' 'drugY' 'drugX']


### Evaluation

Next, we will import metrics from sklearn and check the accuracy of our model.

In [12]:
from sklearn import metrics
import matplotlib.pyplot as plt
print(f"Decision Tree's Accuracy: {metrics.accuracy_score(y_testset, predtree)}")

Decision Tree's Accuracy: 0.9833333333333333


## Visualization

In [13]:
from io import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree

In [14]:
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
dotdata = StringIO()
filename = "drugtree.png"
featureNames = my_data.columns[0:5]
out = tree.export_graphviz(drugtree,feature_names=featureNames, out_file=dotdata, class_names= np.unique(y_trainset), 
                           filled=True,  special_characters=True)
graph = pydotplus.graph_from_dot_data(dotdata.getvalue())
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

InvocationException: GraphViz's executables not found