# Module 9 Exercises - Decision Trees

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

### Exercise 1:

Using the diabetes.csv file from the Module 8 Exercises notebook, load the file as a dataframe. Repeat the steps from exercises 1 & 2 in the Module 8 Exercise notebook to prepare your dataset for modeling.

In [4]:
# Load diabetes data.
location = "datasets/diabetes.csv"
df = pd.read_csv(location)

In [6]:
# Clean zero values that can skew prediction.
df = df[~(df.BloodPressure == 0)]
df = df[~(df.SkinThickness == 0)]
df = df[~(df.Insulin == 0)]
model_df = df.drop(['DiabetesPedigreeFunction', 'BMI'], axis=1)
model_df.head(15)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,Age,Outcome
3,1,89,66,23,94,21,0
4,0,137,40,35,168,33,1
6,3,78,50,32,88,26,1
8,2,197,70,45,543,53,1
13,1,189,60,23,846,59,1
14,5,166,72,19,175,51,1
16,0,118,84,47,230,31,1
18,1,103,30,38,83,33,0
19,1,115,70,30,96,32,1
20,3,126,88,41,235,27,0


In [7]:
y = model_df['Outcome']
X = model_df.drop(['Outcome'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=15)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(295, 6)
(99, 6)
(295,)
(99,)


### Exercise 2:

Using the decision tree function in the scikit-learn library (sklearn), fit the model with the training dataset. Then score the model for training; how well did it do?

In [16]:
# extract target variable.
y = model_df['Outcome']

# a model_df copy without 'Outcome' column
X = model_df.drop(['Outcome'], axis=1)

# assign decision tree function to model variable
model = tree.DecisionTreeClassifier()

#develop model using training data
#defining arguments in the model can help prevent overfitting
model.fit(X_train, y_train)

#run the predictions on the test data
y_predict = model.predict(X_test)

y_predict

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1], dtype=int64)

### Exercise 3:

Now use the test dataset on the decision tree function and get its score.

In [17]:
#check the accuracy of model
accuracy_score(y_test, y_predict)

0.70707070707070707

### Exercise 4:

Make a confusion matrix for the predicted outcomes to compare it against the "true" outcomes. How many values for each outcome did the model get incorrect?

In [18]:
#look at true and false predictions
pd.DataFrame(
    confusion_matrix(y_test, y_predict),
    columns=['Predicted Not Diabetes', 'Predicted Diabetes'],
    index=['True Not Diabetes', 'True Diabetes']
)

Unnamed: 0,Predicted Not Diabetes,Predicted Diabetes
True Not Diabetes,53,8
True Diabetes,21,17


### Exercise 5:

Get a classification report on the model for the predicted data. Which outcome is the model more accurate at predicting?

In [22]:
print(classification_report(y_test, y_predict))

             precision    recall  f1-score   support

          0       0.72      0.87      0.79        61
          1       0.68      0.45      0.54        38

avg / total       0.70      0.71      0.69        99



The model decision tree did better predicting people that are prone to get diabetes.

### Exercise 6:

Compare the predictions from the decision tree model to the logistic regression model in the Module 8 Exercise notebook. Which model was best at predicting the outcome of diabetes for a patient?

Decision Trees were better but not by much:
    
Decision Tree:

|       | precision | recall | f1-score | support |
| ----- | --------- | ------ | -------- | ------- |
| 0     | 0.72      | 0.87   | 0.79     | 61      |
| 1     | 0.68      | 0.45   | 0.54     | 38      |
| ----- | --------- | ------ | -------- | ------- |
| total | 0.70      | 0.71   | 0.69     | 99      |

Logistic Regression:

|       | precision | recall | f1-score | support |
| ----- | --------- | ------ | -------- | ------- |
| 0     | 0.71      | 0.89   | 0.79     | 61      |
| 1     | 0.70      | 0.42   | 0.52     | 38      |
| ----- | --------- | ------ | -------- | ------- |
| total | 0.70      | 0.71   | 0.71     | 99      |
 
The outcome was almost identical for both models. The model decision tree did better predicting people that are prone to get diabetes. However, the logistic tree did better at predicting people that are not prone to get diabetes.
