# Decision Tree

In this Notebook, we will implement various decision trees on the `Cleveland Heart Disease` dataset obtained from the `UCI` repository , and after the required pre-processing steps as covered before.

In [1]:
# Import Statements

# For accessing the datasets
import pandas as pd

# For splitting the dataset into train and test sets
from sklearn.model_selection import train_test_split

# For implementing the decision tree
from sklearn.tree import DecisionTreeClassifier

# For evaluvating the performance of the classifier
from sklearn import metrics

# For Normalization/Scaling of the features
from sklearn.preprocessing import StandardScaler

## 1. Reading the dataset

In [2]:
df = pd.read_csv('cleveland_data.csv')
df.head()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


## 2. Feature Selection

We will list out the columns which we will consider to be relevant to making accurate predictions, and also the target variable.

In [126]:
# The features are all the first 13 columns

# Taking all the 13 features gives about 60% test accuracy
features = list(df.columns[:13])
features

['age',
 'gender',
 'cp',
 'trestbps',
 'chol',
 'fbs',
 'restecg',
 'thalach',
 'exang',
 'oldpeak',
 'slope',
 'ca',
 'thal']

In [127]:
# Target Variable is the last column
target = 'num'

In [128]:
# Load the X and the Y variables

X = df[features]
Y = df[target]

## 3. Test Train Split

We split the entire dataset into a testing and a training datasets.

* The `Train` dataset will be used during the training phase to optimize the algorithm
* The `Test` dataset will be used to validate the model on unseen data

In [129]:
# Here we split the rows of the dataset randomly between the train and test set, and the split between the two will be a 80% - 20% between the train and test sets respectively

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=101)

## 4. Building and Training the Classifier Models

In this section, we build the Decision tree models, and also implement their training process.

### 4.1 `ID3` Decision Tree

This uses the `information gain` or `entropy` for the attribute selection process on each level of the splitting process.

In [130]:
# Initialize the Classifier model with maximum depth of 5 layers to prevent overfitting

id3_classifier = DecisionTreeClassifier(criterion="entropy", random_state=42, max_depth=6, min_samples_leaf=5)

In [131]:
# Fit the model on the train set
# Training Phase

id3_classifier = id3_classifier.fit(X_train, Y_train)

### 4.2 `CART` Decision Tree

This uses the `Gini Index` for the attribute selection process on each level of the splitting process.

In [155]:
# Initialize the Classifier model
# criterion:'gini' is also the default value, so we can also leave this field while creating the model.
# This has been initialized with maximum depth of 6 layers to prevent overfitting

cart_classifier = DecisionTreeClassifier(criterion="gini", random_state=101, max_depth=6, min_samples_leaf=5)

In [156]:
# Fit the model on the train set
# Training Phase

cart_classifier = cart_classifier.fit(X_train, Y_train)

## 5. Evaluation of the Models

Here we test the models that we have trained earlier, and then show their individual scores and metrics.

### 5.1 `ID3` Decision Tree

In [134]:
# Find the training Accuracy of the model

id3_train__y_pred = id3_classifier.predict(X_train)

In [135]:
# Calculate the accuracy of this model

id3_train_accuracy = metrics.accuracy_score(Y_train, id3_train__y_pred)
print("Train Accuracy of the ID3 Model : ", id3_train_accuracy)

Train Accuracy of the ID3 Model :  0.7468354430379747


In [136]:
# Predict the outputs for the test set

id3_y_pred = id3_classifier.predict(X_test)

In [137]:
# Calculate the test accuracy of this model

id3_accuracy = metrics.accuracy_score(Y_test, id3_y_pred)
print("Test Accuracy of the ID3 Model : ", id3_accuracy)

Test Accuracy of the ID3 Model :  0.55


In [138]:
# Print the Confusion Matrix of the results from the test set

print("Confusion Matrix of the ID3 model is : \n", metrics.confusion_matrix(Y_test, id3_y_pred))

Confusion Matrix of the ID3 model is : 
 [[30  4  1  5  0]
 [ 6  0  1  2  1]
 [ 2  1  1  1  2]
 [ 0  0  1  2  0]
 [ 0  0  0  0  0]]


In [158]:
# Print the Classification Report

print("Classification Report of the ID3 model is : \n", metrics.classification_report(Y_test, id3_y_pred))

Classification Report of the ID3 model is : 
               precision    recall  f1-score   support

           0       0.79      0.75      0.77        40
           1       0.00      0.00      0.00        10
           2       0.25      0.14      0.18         7
           3       0.20      0.67      0.31         3
           4       0.00      0.00      0.00         0

    accuracy                           0.55        60
   macro avg       0.25      0.31      0.25        60
weighted avg       0.57      0.55      0.55        60



### 5.2 `CART` Decision Tree

In [159]:
# Find the training Accuracy of the model

cart_train__y_pred = cart_classifier.predict(X_train)

In [160]:
# Calculate the accuracy of this model

cart_train_accuracy = metrics.accuracy_score(Y_train, cart_train__y_pred)
print("Train Accuracy of the CART Model : ", cart_train_accuracy)

Train Accuracy of the CART Model :  0.7510548523206751


In [161]:
# Predict the outputs for the test set

cart_y_pred = cart_classifier.predict(X_test)

In [162]:
# Calculate the test accuracy of this model

cart_accuracy = metrics.accuracy_score(Y_test, cart_y_pred)
print("Test Accuracy of the CART Model : ", cart_accuracy)

Test Accuracy of the CART Model :  0.6


In [163]:
# Print the Confusion Matrix of the results from the test set

print("Confusion Matrix of the CART model is : \n", metrics.confusion_matrix(Y_test, cart_y_pred))

Confusion Matrix of the CART model is : 
 [[33  5  1  1  0]
 [ 6  1  2  1  0]
 [ 2  1  1  2  1]
 [ 0  0  2  1  0]
 [ 0  0  0  0  0]]


In [165]:
# Print the Classification Report

print("Classification Report of the CART model is : \n", metrics.classification_report(Y_test, cart_y_pred))

Classification Report of the CART model is : 
               precision    recall  f1-score   support

           0       0.80      0.82      0.81        40
           1       0.14      0.10      0.12        10
           2       0.17      0.14      0.15         7
           3       0.20      0.33      0.25         3
           4       0.00      0.00      0.00         0

    accuracy                           0.60        60
   macro avg       0.26      0.28      0.27        60
weighted avg       0.59      0.60      0.59        60



## 6. Results

In [147]:
# Print Training Accuracies

print("Train Accuracy of the ID3 Model : ", id3_train_accuracy)
print("Train Accuracy of the CART Model : ", cart_train_accuracy)

Train Accuracy of the ID3 Model :  0.7468354430379747
Train Accuracy of the CART Model :  0.7510548523206751


In [148]:
# Print Test set Accuracies

print("Test Accuracy of the ID3 Model : ", id3_accuracy)
print("Test Accuracy of the CART Model : ", cart_accuracy)

Test Accuracy of the ID3 Model :  0.55
Test Accuracy of the CART Model :  0.6


**Conclusion :**
We can notice here that the `CART` model which uses the `GINI` index has achieved slightly greater accuracy than the `ID3` model which uses `Entropy`.