# **Decision Tree example**


### To run this code:

1- Use Jupyter notebook: https://jupyter.org/


2- Use Google Colab: https://colab.research.google.com/



### Decision Trees illustrative example

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

# Diabetes Dataset
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

https://www.kaggle.com/uciml/pima-indians-diabetes-database

To assess the correspondence (or lack thereof) between
self-reported exposure and cotinine levels, the following variables
were recorded:

1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Body mass index (BMI) (weight in kg/(height in m)^2) 
6. Diabetes pedigree function
7. Age
8. Outcome (0, 1)

Can we build a decision Tree model to accurately predict whether or not the patients in the dataset have diabetes or not?

In [None]:
# Import modules and libraries

import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Read dataset into a pandas dataframe
data_url = 'https://github.com/abdelrahman-ayad/MiCM-introML-W21/raw/main/notebooks/Data/diabetes.csv'
data = pd.read_csv(data_url)


#Display the first five rows
data.head(20)

In [None]:
#Plot data
plt.rcParams['figure.figsize'] = [15, 10]
hist = data.hist()

### Task 1: Separate data into features and targets (i.e., x and y)

In [None]:
# Code here to separate the data into features and targets
y = data['Outcome']
x = data[data.columns.drop('Outcome')]

### Task 2: Split the data into training and testing (70/30)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0) # 70% training and 30% test

### Task 3: Build a DTs model

### The criterion is entropy with max depth of 10, and minimum samples leaf of 5
### Hint: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier


In [None]:
# Create a DT model 
DTmodel = DecisionTreeClassifier(criterion = 'entropy', max_depth=10, min_samples_leaf=5)

# We fit the model with train data
DTmodel.fit(x_train, y_train)

# Make the predictions for test part
y_pred = DTmodel.predict(x_test)

# Compare prediction with actual y_test to get accuracy
correct = y_test == y_pred
incorrect = np.logical_not(correct)
accuracy = np.sum(y_pred == y_test)/y_test.shape[0]
print(f'accuracy is {accuracy*100:.1f}.')

### Task 4: Determine confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
plt.rcParams['figure.figsize'] = [8, 8]
cm = confusion_matrix(y_test, y_pred)
cm_display = ConfusionMatrixDisplay(cm, display_labels = ['Not diabetes', 'Diabetes']).plot()