# Diabetes Decision Tree Notebook

This is a notebook I created to walk through a tutorial on Decision Trees. Most of the code here is taken straight from that tutorial and is therefore not my solo work. However, I improved upon the tutorial code by adding a validation set and evaluating on additional criteria. Link to tutorial: https://www.datacamp.com/community/tutorials/decision-tree-classification-python

# Step 1 - Import libraries, import data, & prepare data for modeling

In [1]:
# import libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [2]:
# import data
pima = pd.read_csv('diabetes.csv')
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
# assign features and label
x = pima[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 
          'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = pima['Outcome']

In [8]:
# train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [9]:
# train validate split
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=1)

# Step 2 - Create & train the model

In [10]:
# create model
tree = DecisionTreeClassifier()

In [11]:
# fit the model to the data
tree.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [13]:
# predict responses for the VALIDATION set
val_predict = tree.predict(x_val)

# Step 3 - Evaluate Model on validation set

In [14]:
# evaluate model on 'accuracy'

print('Accuracy:', metrics.accuracy_score(y_val, val_predict))

Accuracy: 0.6747967479674797


In [15]:
# evaluate model on 'precision,' 'recall,' and 'f1 score'

print('Precision:', metrics.precision_score(y_val, val_predict))
print('Recall:', metrics.recall_score(y_val, val_predict))
print('F1 Score:', metrics.f1_score(y_val, val_predict))

Precision: 0.54
Recall: 0.6136363636363636
F1 Score: 0.574468085106383


# Step 4 - "Prune" Tree (optimize tree by adjusting hyperparameters)

In [16]:
# create new 'pruned' model
pruned_tree = DecisionTreeClassifier(criterion='entropy', max_depth=3)

In [17]:
# fit & evaluate new model
pruned_tree.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [18]:
pruned_val_predict = pruned_tree.predict(x_val)

In [19]:
print('Accuracy:', metrics.accuracy_score(y_val, pruned_val_predict))
print('Precision:', metrics.precision_score(y_val, pruned_val_predict))
print('Recall:', metrics.recall_score(y_val, pruned_val_predict))
print('F1 Score:', metrics.f1_score(y_val, pruned_val_predict))

Accuracy: 0.7154471544715447
Precision: 0.5882352941176471
Recall: 0.6818181818181818
F1 Score: 0.631578947368421


We can see that evaluation on all metrics has improved.