# Decision Trees

This notebook will discuss about decision trees, a popular classification algorithm. 

It's inspired by IBM's Machine Learning with Python Course and [this blog](https://careerfoundry.com/en/blog/data-analytics/what-is-a-decision-tree/#4-pros-and-cons-of-decision-trees)

# What is a Decision Tree?

A decision tree is a type of flowchart that shows a clear pathway to a decision. 

In terms of data analytics, it is a type of algorithm that includes conditional ‘control’ statements to classify data. 

A decision tree starts at a single point (or ‘node’) which then branches (or ‘splits’) in two or more directions. Each branch offers different possible outcomes, incorporating a variety of decisions and chance events until a final outcome is achieved. 

When shown visually, their appearance is tree-like…hence the name!

## Advantages

* Good for interpreting data in a highly visual way.
* Good for handling a combination of numerical and non-numerical data.
* Easy to define rules, e.g. ‘yes, no, if, then, else…’
* Requires minimal preparation or data cleaning before use.
* Great way to choose between best, worst, and likely case scenarios.
* Can be easily combined with other decision-making techniques.

## Disadvantages

* Overfitting (where a model interprets meaning from irrelevant data) can become a problem if a decision tree’s design is too complex.
* They are not well-suited to continuous variables (i.e. variables which can have more than one value, or a spectrum of values).
* In predictive analysis, calculations can quickly grow cumbersome, especially when a decision path includes many chance variables.
* When using an imbalanced dataset (i.e. where one class of data dominates over another) it is easy for outcomes to be biased in favor of the dominant class.
* Generally, decision trees provide lower prediction accuracy compared to other predictive algorithms.

# Practical Problem

Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.

It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.

First let's import the libraries

In [1]:
import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

# The Data

Read the data using pandas.

In [2]:
df = pd.read_csv('Datasets/Drug Data.csv', delimiter=",")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


## Pre-processing

Remove the column containing the target name since it doesn't contain numeric values.

In [3]:
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']]
X.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K
0,23,F,HIGH,HIGH,25.355
1,47,M,LOW,HIGH,13.093
2,47,M,LOW,HIGH,10.114
3,28,F,NORMAL,HIGH,7.798
4,61,F,LOW,HIGH,18.043


## Label Encoding

As you may figure out, some features in this dataset are categorical such as **Sex** or **BP**. 

Unfortunately, Sklearn Decision Trees do not handle categorical variables.

So, we would convert them into 'numerical' values by using `LabelEncoder` in the `sklearn` library.

In [4]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X.loc[:,'Sex'] = le_sex.transform(X.loc[:,'Sex']) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X.loc[:,'BP'] = le_BP.transform(X.loc[:,'BP'])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X.loc[:,'Cholesterol'] = le_Chol.transform(X.loc[:,'Cholesterol']) 

X.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K
0,23,0,0,0,25.355
1,47,1,1,0,13.093
2,47,1,1,0,10.114
3,28,0,2,0,7.798
4,61,0,1,0,18.043


Now we can fill the target variable.


In [5]:
y = df["Drug"]
y.head()

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

## Split into train/test

We will be using __train/test split__ to seperate into training and testing sets. 

Let's import first.

In [6]:
from sklearn.model_selection import train_test_split

We will be splitting in the train/test ratio of 7/3 with random state as 3.

In [7]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

# Modeling

To use decision tree we would be using the `DecisionTreeClassifier` from `sklearn`

We would also be using the criterion as entropy rather than gini since it usually is better metric and time/space is not an issue here. 

In [8]:
drugTree = DecisionTreeClassifier(criterion='entropy')
drugTree # it shows the default parameters

DecisionTreeClassifier(criterion='entropy')

Next, we will fit the data with the training feature matrix _X_trainset_ and response vector _y_trainset_

In [9]:
drugTree.fit(X_trainset,y_trainset)

DecisionTreeClassifier(criterion='entropy')

# Prediction and Evaluation

Now, we need to make predictions on the testing dataset and then use metrics to evaluate our model.

In [10]:
y_hat = drugTree.predict(X_testset)

Next, we need to evaluate it. 

We would be using `metrics` in `sklearn`. 

In [11]:
from sklearn import metrics

In [12]:
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, y_hat))
print("\n")
print("DecisionTrees's Jaccard Index (micro): ", metrics.jaccard_score(y_testset, y_hat, average='micro'))
print("DecisionTrees's Jaccard Index (macro): ", metrics.jaccard_score(y_testset, y_hat, average='macro'))
print("\n")
print("DecisionTrees's precision Score (micro): ", metrics.precision_score(y_testset, y_hat, average='micro'))
print("DecisionTrees's precision Score (macro): ", metrics.precision_score(y_testset, y_hat, average='macro'))
print("\n")
print("DecisionTrees's recall Score (micro): ", metrics.recall_score(y_testset, y_hat, average='micro'))
print("DecisionTrees's recall Score (macro): ", metrics.recall_score(y_testset, y_hat, average='macro'))
print("\n")
print("DecisionTrees's F1 Score (micro): ", metrics.f1_score(y_testset, y_hat, average='micro'))
print("DecisionTrees's F1 Score (macro): ", metrics.f1_score(y_testset, y_hat, average='macro'))

DecisionTrees's Accuracy:  0.9833333333333333


DecisionTrees's Jaccard Index (micro):  0.9672131147540983
DecisionTrees's Jaccard Index (macro):  0.9817805383022774


DecisionTrees's precision Score (micro):  0.9833333333333333
DecisionTrees's precision Score (macro):  0.9913043478260869


DecisionTrees's recall Score (micro):  0.9833333333333333
DecisionTrees's recall Score (macro):  0.9904761904761905


DecisionTrees's F1 Score (micro):  0.9833333333333333
DecisionTrees's F1 Score (macro):  0.9906775067750677


Those are good scores. So, we can confidently say that our model is decent and is able to predict the drug for the patient to take with high accuracy.

# Grid Search

Let's try to optimize the max depth of the decision tree and get (hopefully) the best model.

First, let's import

In [13]:
from sklearn.model_selection import GridSearchCV

Now let's create the estimator and params grid.

In [14]:
dtreeClassifier = DecisionTreeClassifier(criterion='entropy')
dtreeClassifier.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

In [15]:
hyper_params = {'max_depth': [3, 4, 5, 6, 7, None]}

Now we need to pass into the `GridSearchCV` and get the object.

In [16]:
grid_dtree = GridSearchCV(dtreeClassifier, hyper_params, scoring='f1_micro')

Fit into it. We would be using the F1 micro score to estimate.

In [17]:
grid_dtree.fit(X, y)

GridSearchCV(estimator=DecisionTreeClassifier(criterion='entropy'),
             param_grid={'max_depth': [3, 4, 5, 6, 7, None]},
             scoring='f1_micro')

Now, let's see the best estimator

In [18]:
grid_dtree.best_params_

{'max_depth': 4}

In [19]:
grid_dtree.best_score_

0.985

So, it turns out that 4 was the best max depth for the decision tree.

# Evaluation of Best Decision Tree from Grid Search

Now let's get the estimator and get all the relevant scores.

First, let's get the estimator.

In [20]:
best_dtree = grid_dtree.best_estimator_
best_dtree.fit(X_trainset, y_trainset)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

Now, let's predict from our test set

In [21]:
y_hat_b = best_dtree.predict(X_testset)

Finally, the scores:

In [22]:
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, y_hat_b))
print("\n")
print("DecisionTrees's Jaccard Index (micro): ", metrics.jaccard_score(y_testset, y_hat_b, average='micro'))
print("DecisionTrees's Jaccard Index (macro): ", metrics.jaccard_score(y_testset, y_hat_b, average='macro'))
print("\n")
print("DecisionTrees's precision Score (micro): ", metrics.precision_score(y_testset, y_hat_b, average='micro'))
print("DecisionTrees's precision Score (macro): ", metrics.precision_score(y_testset, y_hat_b, average='macro'))
print("\n")
print("DecisionTrees's recall Score (micro): ", metrics.recall_score(y_testset, y_hat_b, average='micro'))
print("DecisionTrees's recall Score (macro): ", metrics.recall_score(y_testset, y_hat_b, average='macro'))
print("\n")
print("DecisionTrees's F1 Score (micro): ", metrics.f1_score(y_testset, y_hat_b, average='micro'))
print("DecisionTrees's F1 Score (macro): ", metrics.f1_score(y_testset, y_hat_b, average='macro'))

DecisionTrees's Accuracy:  0.9833333333333333


DecisionTrees's Jaccard Index (micro):  0.9672131147540983
DecisionTrees's Jaccard Index (macro):  0.9817805383022774


DecisionTrees's precision Score (micro):  0.9833333333333333
DecisionTrees's precision Score (macro):  0.9913043478260869


DecisionTrees's recall Score (micro):  0.9833333333333333
DecisionTrees's recall Score (macro):  0.9904761904761905


DecisionTrees's F1 Score (micro):  0.9833333333333333
DecisionTrees's F1 Score (macro):  0.9906775067750677


# Conclusion

There you have it. A model to predict which drug the patient should take with a high accuracy.

# Author
By Abhinav Garg