# Machine Learning Workshop - Predicting Patient Diabetes - Part 2

## More data, more fun!

## Accessing this material

All material used in this workshop can be downloaded at **[https://github.com/alexgcsa/ml_workshop_Oct2023](https://github.com/alexgcsa/ml_workshop_Oct2023)**.

Instructions for local installation are detailed in the *README.md* file.


## Outline
- Task Overview;
- Download Data and Requirement Files;
- Patient Data Exploration;
- Supervised Machine Learning:
    - Classification -- Identifying diabetic and non-diabetic (healthy) patients.


## Task Overview

The problem we will solve is also **predicting patient diabetes with machine learning using medical and demographic data**.

However, now, we have more data and more variables to consider.

Similar to part 1, we will use a modified/compacted version of a dataset from a **[Kaggle](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/)**. But this dataset now includes **8** variables (features) instead of **6**:
 1. age;
 2. gender -- Distinguished in this data between male (1) and female (0);
 3. If the patient has or does not have hypertension (0: not having; 1: having);
 4. If the patient has or does not have heart disease (0: not having; 1: having);
 5. Body Mass Index (BMI);
 6. Smoking history -- Never or ever smoked (0: never; ever: 1);
 7. Blood glucose level;
 8. HbA1c level -- Average blood glucose (sugar) levels for the last two to three months.


Together with these features, patients are differentiated by their IDs and diabetes status (non-diabetic patient = **800**; diabetic patient = **300**). **Therefore, we doubled the patient data for Part 2 of this Workshop**.

We can use this data and these features to build **machine learning models** to predict diabetes in patients based on their medical history and demographic information. 

This can be useful for healthcare professionals in identifying patients at risk of developing diabetes and in developing personalised treatment plans.

## Download Data and Requirement Files

In [None]:
!wget https://raw.githubusercontent.com/alexgcsa/ml_workshop_Oct2023/master/data/diabetes_extra.csv
!wget https://raw.githubusercontent.com/alexgcsa/ml_workshop_Oct2023/master/requirements.txt
!wget https://raw.githubusercontent.com/alexgcsa/ml_workshop_Oct2023/master/utils.py

## Installing dependencies

In [None]:
!pip install -r requirements.txt

## Loading libraries

In [None]:
%load_ext autoreload
%autoreload 2

from plotly.offline import iplot, init_notebook_mode
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import cross_val_predict, train_test_split


import numpy as np
import pandas as pd
import time

import utils

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Exploring the data

In [None]:
# Load data from file to a DataFrame structure:
df_data = pd.read_csv("diabetes_extra.csv") 

print(df_data.shape) # .shape displays how the dataframe (matrix) looks like
df_data.head(10) # .head(10) displays the first 10 items in the dataframe

In [None]:
df_data.tail(10) # .tail(10) displays the last 10 items in the dataframe

### Plotting the distribution of diabetic patients *versus* non-diabetic patients

In [None]:
#Counting the frequencies:
diabetes_counts = dict(df_data["class"].value_counts())

#getting the classes:
classes = list(diabetes_counts.keys())

#Getting the values per class:
values = list(diabetes_counts.values())

#Making the plot:
fig = plt.bar(classes, values, color ='purple',  width = 0.5)

#### What can you conclude from the plot above? Did the class distribution change from Part 1?

### Plotting the distribution of patient features

In [None]:
columns = ["gender", "age", "hypertension", "heart_disease", "bmi", "smoking_history"]

fig = make_subplots(rows=3, cols=2, start_cell="bottom-left")

nrows = 3
ncols = 2

# Iterate through columns in reverse order
for i in range(nrows):
    for j in range(ncols):
        col_index = (nrows - i - 1) * ncols + j  # Calculate the correct column index in reverse order
        if col_index < len(columns):
            column_name = columns[col_index]
            fig.add_trace(go.Histogram(x=df_data[column_name], name=column_name), row=i + 1, col=j + 1)


fig.show(renderer="colab")

#### What can you conclude from the several plots above? How do they differ from Part 1?

### Supervised Machine Learning (Classification)

Your task is to build a classifier to differentiate diabetic patients from non-diabetic patients.

First, use the code below to select  specific columns for the basic features (**df_data**).

In [None]:
# The target class and the main features from the dataset:
target = "class"
features =  ["gender", "age", "hypertension", "heart_disease", "bmi", "smoking_history"]

# Select them from data:
X = df_data[features]
y = df_data[target]

#### Data Splitting

In [None]:
# What is the best TEST_SIZE? 
TEST_SIZE = 0.25

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, \
                                                    random_state=42)

utils.plot_train_test_class(y_train, y_test)

#### Why is data splitting necessary?

If you check in the code lines above, you used a function **train_test_split**. This function divides our data into two subsets: the **training set** and the **testing set**. The training set is used to train (i.e., build) the model. The testing set is used to evaluate the model, assessing its predictive performance.

This step is very important because it helps ensure the creation of machine learning models is valid and accurate. Using the testing set (a subset completely unseen during the training of the model), you can estimate the model's performance when it is applied to new data points.  

**What do you think of having the same proportions of Diabetic and Non-Diabetic patients in the training and testing sets? Is this a fair approach? If it were you, would you change this? Why?**

##### [K-Fold Cross-Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

Cross-validation is a resampling method that uses different data portions to train and validate a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.

### First model: Decision Tree

Decision trees are composed of if/else questions which are disposed of hierarchically. Following these questions, the model is capable of reaching a decision. In the case of our question, the actual output is 'diabetic' or 'diabetic' labels. The decision to reach a prediction is based on the features (patient demographics and medical information) we used as input for the ML algorithm.

##### Training the decision tree on the set of features

In [None]:
# What happens when you change the max depth of the decision tree?
clf_dt = tree.DecisionTreeClassifier(max_depth=3, random_state=42)

# What happens if you change the number of folds in the cross-validation from 5 to 3 (or 10)?
y_pred_train_dt = cross_val_predict(clf_dt, X_train, y_train, cv=5)
clf_dt.fit(X_train, y_train)
y_pred_test_dt = clf_dt.predict(X_test)

 ##### Confusion Matrix Cases:
 1. Patient(s) is (are) Diabetic (1), and the model predicted them as Diabetic (1). **Correct Prediction (True Positive [TP])**.
 2. Patient(s) is (are) Non-Diabetic (0), and the model predicted them as Diabetic (1). **Wrong Prediction (False Positive [FP], Type 1 Error)**.
 3. Patient(s) is (are) Non-Diabetic (0), and the model predicted them as Non-Diabetic (0). **Correct Prediction (True Negative [TN])**.
 4. Patient(s) is (are) Diabetic (1), and the model predicted them as Diabetic (0).  **Wrong Prediction (False Negative [FN], Type II Error)**.

 ##### Performance Metrics:

 Given the Confusion Metrics, we can calculate a range of metrics, such as:
 1. **Precision** = $TP/(TP + FP)$
 2. **Recall** = $TP/(TP + FN)$
 3. **F1-score** = $2* (TP/(2TP + FP + FN))$ = $2 * ((Precision * Recall)/(Precision + Recall))$

    

**Precision** determines how precise/accurate the model is out of those predicted positives (i.e., Diabetic Patients). It basically measures how many of them are actually positive (i.e., Diabetic Patients).

**Recall** calculates how many of the Actual Positives (Truly Diabetic Patients) our model identifies by labelling them as Positive (True Positive, i.e., Diabetic).

**F1-score** is a balance between Precision and Recall. It is useful when there is an uneven class distribution (i.e., when there is a large number of Actual Negatives/Non-Diabetic against Actual Positives/Diabetic; or vice-versa).

##### Estimate the predictive performance given the metrics Precision, Recall and F1-Score, which are calculated using the Confusion Matrix below:

In [None]:
utils.gen_train_test_performances(y_train, y_pred_train_dt, y_test, y_pred_test_dt, ['Not-Diabetic', 'Diabetic'])

##### What metric do you consider as important for predicting diabetes? How to define a good metric for it clinically? How can you compare a good clinical prediction to the models provided in Part 1?

##### [**ROC AUC**](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
It stands for the "Area under the receiver operating characteristic (ROC) curve".

ROC AUC is a graphical plot that determines the diagnostic ability of a classification model considering its discrimination (probability) threshold is varied. 

The ROC AUC is the plot of the true positive rate against the false positive rate, at various threshold settings.

##### How do the ROC AUC plots look like?

In [None]:
utils.gen_train_test_roc(y_train, y_pred_train_dt, y_test, y_pred_test_dt, ['Not-Diabetic', 'Diabetic'])

##### How better are these ROC AUC plots when compared to Part 1?

##### What does the decision tree looks like?

In [None]:
fig, axes = plt.subplots(nrows=1 ,ncols=1, figsize=(9,6), dpi=300)
tree.plot_tree(clf_dt, feature_names=X_train.columns, class_names=['Not-Diabetic', 'Diabetic'], filled=True)
plt.show()

##### Which features are more important for the model?

In [None]:
utils.feat_importance(clf_dt)

##### What can you conclude from the predictive performance of the Decision Tree? What can you say about the important features above? Are they different from those found by Decision Tree from Part 1?

### Second model: Random Forest

A Random Forest ( RF) is also a tree-based model. However, it is an ensemble of multiple random decision trees of different kinds (considering different data and different features). The final value of the model is the average of all the prediction/estimates created by each individual decision tree.

##### Training the random forest on the set of features

In [None]:
# What happens when you change the number of trees (n_estimators) in the random forest?
clf_rf = ensemble.RandomForestClassifier(n_estimators=300, random_state=42)

# What happens if you change the number of folds in the cross-validation from 5 to 3 (or 10)?
y_pred_train_rf = cross_val_predict(clf_rf, X_train, y_train, cv=5)
clf_rf.fit(X_train, y_train)
y_pred_test_rf = clf_rf.predict(X_test)

##### Training the decision tree on the set of features

In [None]:
utils.gen_train_test_performances(y_train, y_pred_train_rf, y_test, y_pred_test_rf, ['Not-Diabetic', 'Diabetic'])

##### How do the ROC AUC plot look like?

In [None]:
utils.gen_train_test_roc(y_train, y_pred_train_rf, y_test, y_pred_test_rf, ['Not-Diabetic', 'Diabetic'])

##### How comparable are these results to the built Random Forest in Part 1?

##### Which features are more important for the model?

In [None]:
utils.feat_importance(clf_rf)

##### What can you conclude from the predictive performance of the Random Forest? What about its most important features? How can we compare the performance of Random Forest to the Decision Tree ?

##### Do they differ when you compare these most important features to Part 1?

### Important Questions:
1. How did including more data and more features affect the predictive performance of the Decision Tree and Random Forest?
2. How do the predictive results compare to each other? Did anything change?
3. Which classification model (Random Forest or Decision Tree) has been more influenced by this data change while predicting diabetes in patients?
   