### Submission guidelines

1. Fill in your name in the notebook in the top cell.
2. Fill in the gaps in the code where indicated. <br> Make sure that you:<br> - fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE" <br> - **do not leave any `raise NotImplementedErrors`** in the code
3. Do **NOT change the variable names**, however, you can add comments in the code.
4. Do **NOT remove any of the cells** of the notebook!
5. Discussion is allowed, but every student needs to hand a personal version of the lab. Plagiarism will be sanctioned!   
6. Before submitting, restart your kernel & **make sure that every cell runs**.<br>Code that doesn't run will not be scored.<br>The notebooks with all source code, and optional extra files need to be handed in using Ufora.<br> Make sure all your notebooks are already executed when you upload them (i.e. there should be output after the cells). 
7. **Zip** your lab assignment folder and name the archive: `Surname_Name.zip` <br> Keep the same folder structure as the provided lab assignment!<br><span style='color: red'>Do not rename any of the notebooks or files</span>!<br>



In [None]:
NAME = ""


Final tip: make sure you have answered every question and filled in all the required code by running through the notebook and searching for *YOUR ANSWER HERE* and *YOUR CODE HERE*!

Good luck!

---

# Part 2: Data preparation
In Lab2 you saw how to extract data from the Mimic-III database with SQL commands. In this Lab we will use machine learning in to predict the mortality of patients in the ICU based on their vital signs of the first 48h spent in the ICU.

The raw data you've extracted in Lab2 cannot directly be used for machine learning. In order to do machine learning we need clean the structured data. This notebook will guide you through the steps of converting this raw data into a suitable dataset for machine learning.

**All the required packages for this lab are in the requirements.txt file, we can all install them with one pip command:**

In [None]:
!pip install -r requirements.txt

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython import display

We have already extracted the raw dataset for you, so you don't need to run any SQL queries in order to obtain it. It is stored in a **.parquet** format, this is a compressed data format so that the dataset doesn't take too much space.

In [None]:
### Path to your 'icustays_events.parquet' file
path_to_data_file = 'data/icustays_events.parquet'

# Load data as a pandas DataFrame using the ICU stay ID as the DataFrame index, for facilitating data manipulation.
data = pd.read_parquet(path_to_data_file).set_index('icustay_id')

A quick preview of data can be obtained using the 'head' function, which prints the first rows of any given DataFrame:

In [None]:
data.head(5)

We can take a look at all the columns that are in the dataset:

In [None]:
data.columns

The dataset contains information about:
- patient demographics: age, gender, weight
- physiological vital signs: heart rate, mean blood pressure, respiratory rate, ...
- lab test results: sodium, glucose, urea, creatinin, ...
- Mortality in hospital and mortality at 90 days

Each observation/row is associated with a time stamp (column 'hours_in_icu'), indicating the number of hours since ICU admission where the observation was made. Each icustay has several observations for the same variable/column. The dataset contains the observations of the first **48h** in the ICU. 

We can print the number of ICU stays by calculating the length of the unique indexes, number of missing data using the 'info' function and summary statistics using the 'describe' function:

In [None]:
print('Number of ICU stays: ' + str(len(data.index.unique())))
print('Number of survivors: ' + str(len(data[data['mortality_icu']==0].index.unique())))
print('Number of non-survivors: ' + str(len(data[data['mortality_icu']==1].index.unique())))
print('Mortality: ' + str(round(100*len(data[data['mortality_icu']==1].index.unique()) / len(data.index.unique()),1)) + '%')
print()
display.display(data.info(null_counts=True))
display.display(data.describe())

In [None]:
# Feel free to use this cell to perform other exploratory operations, plots, etc.

The dataset consists of 52 799 unique ICU stays and 3 293 990 observations. From the Non-Null Count column we can see that all columns with the exception of 'hours', 'mortality', 'age' and gender have missing information. Looking at the maximum and minimum values it is possible to spot the presence of outliers (e.g. minimum pH of 0). Both missing data and outliers are very common in ICU databases and need to be taken into consideration before applying ML algorithms.

## Variable selection

There should be a trade-off between the potential value of the variable in the model and the amount of data available. We already saw the amount of missing data for every column, but we still do not know how much information is missing at the patient level. In order to do so, we are going to aggregate data by ICU stay and look at the number of non-null values, using the 'groupby' function together with the 'mean' operator. This will give an indication of how many ICU stays have at least one observation for each variable. 

We will consider every ICU stay as an independent sample. 

In [None]:
print(data.groupby(['icustay_id']).mean().info(show_counts=1))

- **Bilirubin**, **albumin** and **central_venous_pressure** will be discarded due to the high amount of missing data.
The other variables will be kept. Let us start with **time-variant** variables and set aside age and gender for now:

In [None]:
ts_variables = ['sodium', 'glucose', 'urea', 'creatinine', 'hemoglobin', 'white_blood_cells', 'ph', 'po2', 'pco2', 'weight', 'glasgow_coma_scale', 'heart_rate', 
               'mean_blood_pressure', 'respiratory_rate', 'oxygen_saturation', 'temperature', 'systolic_blood_pressure', 'diastolic_blood_pressure']
static_variables = ['age', 'gender']
label = ['mortality_icu']

## Removal of outliers

We can use boxplots to see the amount of outliers and how far they are.

In [None]:
fig, axes = plt.subplots(4, 5, figsize=(15, 15))

for idx, variable in enumerate(ts_variables):
    a = data.boxplot(variable, ax=axes.flatten()[idx])
    
plt.show()

In some cases, the outliers are so deviant from the norm that it is not even possible to visualize the distribution of data (minimum, first quartile, median, third quartile, maximum) using boxplots. A lot of outliers are unrealisticly high or low. Ideally, we want to remove values that probably wrong due to incorrect input or measurement (such as negative temperatures or a weight above 1000kg), but we want to keep extreme values that are related to the patients poor health condition. In order to choose good threshold values for outlier removal, ideally, expert knowledge is needed to avoid discarding useful information. In our case we choose the values that (according to the boxplots above) seem to be very extreme.

In [None]:
nulls_before = data.isnull().sum().sum()

data.loc[data['glucose']>2000, 'glucose'] = np.nan
data.loc[data['creatinine']>40, 'creatinine'] = np.nan
data.loc[data['hemoglobin']>40, 'hemoglobin'] = np.nan
data.loc[(data['ph']>7.8) | (data['ph']<6.8), 'ph'] = np.nan
data.loc[data['po2']>1000, 'po2'] = np.nan
data.loc[data['pco2']>1000, 'pco2'] = np.nan
data.loc[data['weight']>500, 'weight'] = np.nan
data.loc[data['heart_rate']>400, 'heart_rate'] = np.nan
data.loc[(data['mean_blood_pressure']>300) | (data['mean_blood_pressure']<0), 'mean_blood_pressure'] = np.nan
data.loc[data['respiratory_rate']>300, 'respiratory_rate'] = np.nan
data.loc[(data['oxygen_saturation']>100) | (data['oxygen_saturation']<0), 'oxygen_saturation'] = np.nan
data.loc[(data['temperature']>50) | (data['temperature']<20), 'temperature'] = np.nan
data.loc[data['systolic_blood_pressure']>300, 'systolic_blood_pressure'] = np.nan
data.loc[data['diastolic_blood_pressure']>300, 'diastolic_blood_pressure'] = np.nan
data.loc[data['age'] > 100, 'age'] = 91.4

nulls_now = data.isnull().sum().sum()
print('Number of observations removed: ' + str(nulls_now - nulls_before))
print('Observations corresponding to outliers: ' + str(round((nulls_now - nulls_before)*100/data.shape[0],2)) + '%')

The same code as before can be used to verify the data distribution after exclusion of outliers. Setting by = 'mortality_icu' shows the boxplots partitioned by outcome.

In [None]:
fig, axes = plt.subplots(4, 5, figsize=(15,15))

for idx, variable in enumerate(ts_variables):
    a = data.boxplot(variable, ax=axes.flatten()[idx], by='mortality_icu')

fig.tight_layout()
plt.show()

## Feature extraction

The next step before ML is to extract relevant features from the time series. We cannot simply use the time series data for our model as for every ICU stay the number of observations and the time at which they were taken are different. Moreover there a lot of observations in 48h at the ICU which would could lead to the overfitting of our models. There is a simpler solution, which is to use only a portion of the information available, ideally the most relevant information for the prediction task.

Feature construction addresses the problem of finding the transformation of variables containing the greatest amount of useful information. We will use four statistical features in order to construct/extract information from the time series:
- Maximum
- Minimum
- Standard deviation
- Mean

These features will summarize the worst, best, variation and average patient' condition from the first 48h the patient stayed in the ICU. Using the 'groupby' function to aggregate data by ICU stay, together with the 'max', 'min', 'std' and 'mean' operators, these features can be easily extracted.

In [None]:
ts_data_max = data.groupby(['icustay_id'])[ts_variables].max()
ts_data_max.columns = ['max_' + str(col) for col in ts_data_max.columns]

ts_data_min = data.groupby(['icustay_id'])[ts_variables].min()
ts_data_min.columns = ['min_' + str(col) for col in ts_data_min.columns]

ts_data_sd = data.groupby(['icustay_id'])[ts_variables].std()
ts_data_sd.columns = ['sd_' + str(col) for col in ts_data_sd.columns]

ts_data_mean = data.groupby(['icustay_id'])[ts_variables].mean()
ts_data_mean.columns = ['mean_' + str(col) for col in ts_data_mean.columns]

ts_features = pd.concat([ts_data_min, ts_data_max, ts_data_sd, ts_data_mean],axis=1)

# Dropping all the rows that contain nan's
ts_features = ts_features.dropna()

print('Extracted features: ')
display.display(list(ts_features.columns))
print('')
print('Number of rows with nan\'s dropped: ' + str(len(data.index.unique()) - ts_features.shape[0]))
print('Number of ICU stays: ' + str(ts_features.shape[0]))
print('Number of features: ' + str(ts_features.shape[1]))
ts_features.head()

We still have to add our non time variant features and labels to the dataset:

In [None]:
static_features = data[static_variables].groupby(by='icustay_id').mean()
mortality = data[['mortality_icu']].groupby(by='icustay_id').mean()

all_features = pd.concat([ts_features, static_features, mortality], axis=1).dropna()
print('Number of ICU stays: ' + str(all_features.shape[0]))

Now we can save this cleaned dataset in order to use it for the machine learning part of this Lab.

In [None]:
all_features.to_parquet('cleaned_dataset.parquet')

Continue with the notebook called **Part 3 - Mortality Clustering and Classification.ipynb**.