# Machine Learning Applications for Health (COMP90089_2022_SM2)
# Tutorial 8: Deep Learning with MIMIC-IV clinical data

> ### Goal: Predict the mortality risk for Sepsis Cohort

####Deep Learning Neural Networks


* **Data** set: query the cohort in MIMIC-IV 
* Create the machine learning model with **Keras Library**
* **Compile** the model
* **Fit** the model using the training data set
* **Evaluate** the performance of the model
* **Predict** testing data set (unseen data)


This Tutorial was based on this [source](https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/). 

### Set up the main **libraries**: keras, numpy, pandas.

In [None]:
# !pip install -q keras #Uncomment and run this cell to install Keras

In [1]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

from sklearn.model_selection import train_test_split 
from sklearn import metrics

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None) ##This is only to show all columns when printing a DataFrame

* Authenticate in the BigQuery platform. Define the function to query.

In [2]:
# authenticate
auth.authenticate_user()

In [3]:
# Set up environment variables
project_id = 'CHANGE-ME' ##Change only this variable with your project ID in BigQuery Platform.
if project_id == 'CHANGE-ME': #No Need to change this one!
  raise ValueError('You must change project_id to your GCP project.')
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

# Read data from BigQuery into pandas dataframes.
def run_query(query, project_id=project_id):
  return pd.io.gbq.read_gbq(
      query,
      project_id=project_id,
      dialect='standard')

# set the dataset
dataset = 'mimiciv'


## **Data set**
We'll use a cohort derived from MIMIC-IV.

* The query bellow is searching for the data in the **BigQuery Platform**.
* We are retrieving patients with **Sepsis**: A life-threatening complication caused by the body's response to an infection. When your immune system goes into **overdrive in response to an infection**, sepsis may develop as a result
* Further, we will join the Date of Death information, the age and gender from patients table.


In [None]:
##We are retrieving patients using sepsis3 Table and joining it to patients Table.

df = run_query("""
SELECT sep.subject_id,sep.sofa_score,sep.respiration,sep.coagulation,sep.liver,sep.cardiovascular,sep.cns,sep.renal,pt.dod,pt.anchor_age,pt.gender
FROM `physionet-data.mimiciv_derived.sepsis3` as sep
INNER JOIN `physionet-data.mimiciv_hosp.patients` as pt
ON sep.subject_id = pt.subject_id
ORDER BY subject_id
""")
print(df)

* Analyse the data as we did in the previous Tutorial: missing values, transform categorical into numerical, check dtype of each column.

In [None]:
dataset = df.copy()

#Replace Date of Death times with binary (0 or 1)
dataset.loc[dataset['dod'].notna(),'dod'] = int(1)
dataset.loc[dataset['dod'].isnull(),'dod'] = int(0)
dataset['dod'] = dataset['dod'].astype(int)

#Transform Gender column from Categorical Data to Binary:
gender_categorical = pd.get_dummies(dataset['gender'])

#Concatenate both Data frames:
final_sepsis = pd.concat([dataset,gender_categorical], axis = 1)

#Final Data set to work with:
final_sepsis = final_sepsis.drop(['subject_id','gender'], axis = 1)
print(final_sepsis)

In [None]:
#Check the final dtype of each column. Are they properly defined now? 
print(final_sepsis.info(),"\n\n")

* Split the data set into Training and Testing 

In [None]:
# split into input (X) and output (y) variables
target = 'dod'
X = final_sepsis.drop(labels = target, axis = 1) #Remove the target column from the dataset and create the independent(features) variables set
y = final_sepsis[target]

#Adjust the size of the testing set: we'll use 10% of the entire data. 
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.1, random_state = 1)

#Check the number of columns (features):
print(X_train.columns)
print(len(X_train.columns))

##Define Keras Model
* Models in Keras are defined as a sequence of layers.

* Create a Sequential model and add [layers](https://keras.io/api/layers/core_layers/) one at a time until satisfied with the network architecture. Read more [here](https://keras.io/api/models/sequential/).

Note:  **the shape of the input to the model** is defined as an argument on the first hidden layer. 

This means that the line of code that adds the first Dense layer is doing two things:

* Defining the input (training columns size) with **input_shape** argument as a vector.
* Defining the first hidden layer.

In [16]:
# Define the keras model. In this example we are using fully-connected network structure (Dense class) with four layers (adding 4 times)
model = Sequential()

#The first hidden layer has 120 nodes and uses the relu activation function.
model.add(Dense(120, input_shape=(10,), activation='relu')) #'relu': rectified linear activation function

#The second hidden layer has 30 nodes and uses the relu activation function.
model.add(Dense(30, activation='relu')) #'relu': rectified linear activation function

#The output layer has 1 node and uses the sigmoid activation function.
model.add(Dense(1, activation='sigmoid')) # Using a sigmoid on the output layer ensures your network output is between 0 and 1


###Compile Keras Model
* Now that the model is defined, you can compile it.
Remember that training a network means **finding the best set of weights** to map inputs to outputs in your dataset.

* Specify the [**loss function**](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/) to use to evaluate a set of weights, the **optimizer** used to search through different weights for the network, and the [**metrics**](https://keras.io/api/metrics/) to evaluate the training.

In [18]:
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#cross entropy as the loss argument. This loss is for binary classification problems.
#stochastic gradient descent algorithm “adam“

###Fit Keras Model
* You have defined your model and compiled it to get ready for efficient computation.

* Now it is time to execute the model on some data: train or fit your model on your loaded data by calling the fit() function on the model.

Training occurs over epochs, and each epoch is split into batches. Read more [here](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/).

* **Epoch:** One pass through all of the rows in the training dataset
* **Batch:** One or more samples considered by the model within an epoch before weights are updated

In [None]:
# fit the keras model on the dataset
classifier = model.fit(X_train, y_train, epochs=200, batch_size=100, verbose=1) #set verbose = 1 to see the fitting process on screen.

### Let's visualise the behavior of the model in terms of loss and accuracy

In [None]:
# Get the history data for the classifier:
print(classifier.history.keys())

In [None]:
# Plot Accuracy over the epochs
plt.plot(classifier.history['accuracy'])
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.title('model accuracy')
plt.legend(['train'], loc='upper left')
plt.show()

# Plot Loss over the epochs
plt.plot(classifier.history['loss'])
plt.ylabel('loss')
plt.xlabel('epoch')
plt.title('model loss')
plt.legend(['train'], loc='upper left')
plt.show()

###Evaluate Keras Model

* Ideally, you would like the loss to go to zero and the accuracy to go to 1.0 (e.g., 100%). This is not possible for any but the most trivial machine learning problems. 
* Instead, you will always have some error in your model. The goal is to **choose a model configuration and training configuration that achieve the lowest loss and highest accuracy possible** for a given dataset.

In [None]:
# evaluate the keras model
loss, accuracy = model.evaluate(X_train, y_train)
print('Accuracy: %.2f' % (accuracy*100))
print(loss,accuracy)


### Make Predictions and Assess the Performance on Testing set (Blind)

In [None]:
# #Predict the testing set
predictions = (model.predict(X_test) > 0.5).astype(int)

#Accuracy classification score
acc = float(round(metrics.accuracy_score(y_test, predictions),3))

#Compute the balanced accuracy.
bacc = float(round(metrics.balanced_accuracy_score(y_test, predictions),3))

#Compute the Matthews correlation coefficient (MCC)
mcc = float(round(metrics.matthews_corrcoef(y_test, predictions),3))

#Compute the F1 score, also known as balanced F-score or F-measure.
f1 = float(round(metrics.f1_score(y_test, predictions),3))

#Show results as a DataFrame:
results = {'Accuracy' : [acc], 'Balanced Accuracy' : [bacc], 'MCC' : [mcc], 'F1-Score' : [f1]}
df_results = pd.DataFrame.from_dict(data = results, orient='columns')
print(df_results)

* Discussion: How can you compare this result with the previous (unsupervised with the same Sepsis data)?