# VNU 2021 April 10th

**Exercise 3**
Author: Associate Professor Lemai Nguyen

Reference:
**Pima Indians Diabetes Database**
Predict the onset of diabetes based on diagnostic measures

**Context**: This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

**Data**: The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

**Acknowledgements**

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

**Inspiration**

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

**Source**: https://www.kaggle.com/uciml/pima-indians-diabetes-database


**Loading Libraries and Functions**

Read about DecisionTreeClassifier at:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Train Test Split:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

Classification metrics:
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

In [2]:
# load libraries
import pandas as pd #for data manipulation and analysis
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for model evaluation


In [1]:
!pip install pydotplus #interface for graph visualisation
!pip install graphviz #for graph visualisation



**Loading Data**


1.   Load the dataset
2.   Explore the data



In [None]:
# load dataset diabetes.csv

records = pd.read_csv("..")

#explore the dataset
print(records)

print('Sample size:', records.shape[0])
print('Number of columns:', records.shape[1]) 
print('General stats:')
records.describe()

**Feature Selection**

Select predictors (attributes) for Classification
Set role (Target)

In [None]:
#feature selection - select 8 first attributes
features = records.columns[..] #these are the selected attributes to be the predictors

#set input X to records with selected features
X=records[]  #Input data

#set target to Outcome
y=records[''] # Target variable

print(records.head())
print(X.head())
print(y.head())

**Split the Dataset**

Split arrays or matrices into random train and test subsets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

In [None]:
# Split dataset into training set and test set using train_test_split with parameters X, y, test_size=
X_train, X_test, y_train, y_test = (..)  # 80% training and 20% testing 

#inspect the split datasets
print(X_train.head())
print(y_train.head())

print('Training dataset size:',X_train.shape[0])
print('Test dataset size:',X_test.shape[0])

**Training a DT Model**

1.   Train a model using the training dataset
2.   Make prediction using the model for the test dataset

Read about DecisionTreeClassifier at: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html




In [36]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=5) #default criterion is gini

# Train Decision Tree Classifer with the traning dataset 
clf = clf.fit(X_train, y_train)

#Make predictions for the test dataset
y_pred = clf.predict(X_test)


**Inspect Predictions**

In [None]:
#inspection
inspection=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
inspection.head(20)

**Model Evaluation**



1.   Calculate Accuracy, Precision, Recall, F1


Classification metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics







In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
from sklearn.metrics import plot_confusion_matrix

#Model Evaluation, calculate metrics: Accuracy, Precision, Recall, F1,
print("Accuracy: ", metrics.accuracy_score(y_test,y_pred))
print("Precision: ", metrics.precision_score(y_test,y_pred))
print("Recall: ", metrics.recall_score(y_test,y_pred))
print("F1: ", metrics.f1_score(y_test,y_pred))



In [None]:
#print confusion matrix and evaluation report
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

**Plot ROC (Receiver operating characteristic) curve and confusion matrix**

ROC surve
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html?highlight=plot_roc_curve#sklearn.metrics.plot_roc_curve

Confusion matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html?highlight=plot%20confusion%20matrix#sklearn.metrics.plot_confusion_matrix

In [None]:

metrics.plot_roc_curve(clf, X_test, y_test)  
plt.show()   

plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Greens, values_format = '.5g')  
plt.show()

**Visualise the DT model**

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = features,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('Diabetes.png')
Image(graph.create_png())

**Tuning the decision tree model hyper parameters**

In [None]:
#Let's optimise the decision tree
# Create Decision Tree classifer object with criterion="entropy", define your own max_depth
clf = 

# Train Decision Tree Classifer using fit
clf = clf.

#Predict the response for test dataset using predict
y_pred = clf.

# Model Accuracy, how often is the classifier correct?
# metrics.accuracy_score(y_test,y_pred))
# metrics.precision_score(y_test,y_pred))
# metrics.recall_score(y_test,y_pred))
# metrics.f1_score(y_test,y_pred))

print("Accuracy: ")
print("Precision: ")
print("Recall: ")
print("F1: ")

metrics.plot_roc_curve(clf, X_test, y_test)  
plt.show()   

plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Greens, values_format = '.5g')  
plt.show()

**Evaluate the new DT**

In [None]:
#print confusion matrix and evaluation report
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

**Visualise the new DT**

In [None]:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = features,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('Diabetes.png')
Image(graph.create_png())

**Create Random Forest Model**

In [None]:
from sklearn.ensemble import RandomForestClassifier

#Creat a RandomForestClassifier object, set your own parameters n_estimators, max_depth, and criterion
rfclf = 

#Train the RandomForestClassifier
rfclf.

#Make predictions
y_pred = rfclf.

#Evaluate the RandomForestClassifier
print(classification_report(y_test, y_pred))

**Evaluate the Random Forest Model**

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy: ", metrics.accuracy_score(y_test,y_pred))
print("Precision: ", metrics.precision_score(y_test,y_pred))
print("Recall: ", metrics.recall_score(y_test,y_pred))
print("F1: ", metrics.f1_score(y_test,y_pred))

#Can you plot roc curve and draw a confusion matrix, dont forget to use plt.show()   
