# Exercise 3: Data preparation 

In this exercise, we would prepare the dataset for model fitting in the following steps: 
1. Feature/label split
2. One-hot encoding for categorical variables
3. Convert data table to numpy arrays 
4. Train/test set split 

First, our data consists of two parts: features (clinical variables) and labels (ICU mortality). We are going to split them from the dataset for further processing use. 

In [66]:
import pandas as pd

In [67]:
data = pd.read_csv("./imputed_data.csv", index_col=0)

In [68]:
data.head()

Unnamed: 0,subject_id,hadm_id,icustay_id,gender,age,first_careunit,aniongap_min,aniongap_max,albumin_min,albumin_max,...,pt_min,pt_max,sodium_min,sodium_max,bun_min,bun_max,wbc_min,wbc_max,vent,mort_icu
0,3,145834,211552,M,76.5268,MICU,15.0,23.0,1.8,1.8,...,13.5,15.7,136.0,153.0,41.0,53.0,11.3,24.4,1,0
1,4,185777,294638,F,47.845,MICU,15.0,15.0,2.8,2.8,...,12.8,12.8,141.0,141.0,10.0,10.0,9.7,9.7,0,0
2,6,107064,228232,F,65.9407,SICU,20.0,23.0,3.0,3.0,...,12.6,14.6,134.0,138.0,62.0,65.0,10.6,10.6,0,0
3,7,118037,236754,F,0.0017,NICU,13.043277,15.757741,3.115281,3.200476,...,14.893293,16.910052,136.661065,140.08174,24.039024,28.386141,22.8,22.8,0,0
4,8,159514,262299,M,0.0012,NICU,13.043277,15.757741,3.115281,3.200476,...,14.893293,16.910052,136.661065,140.08174,24.039024,28.386141,18.7,18.7,1,0


In [69]:
# Select clinical variables, which is from third column to the second last columns
features = data.iloc[:, 3:-1]

In [70]:
features.head()

Unnamed: 0,gender,age,first_careunit,aniongap_min,aniongap_max,albumin_min,albumin_max,bicarbonate_min,bicarbonate_max,bilirubin_min,...,inr_max,pt_min,pt_max,sodium_min,sodium_max,bun_min,bun_max,wbc_min,wbc_max,vent
0,M,76.5268,MICU,15.0,23.0,1.8,1.8,11.0,25.0,0.8,...,1.7,13.5,15.7,136.0,153.0,41.0,53.0,11.3,24.4,1
1,F,47.845,MICU,15.0,15.0,2.8,2.8,21.0,21.0,1.9,...,1.1,12.8,12.8,141.0,141.0,10.0,10.0,9.7,9.7,0
2,F,65.9407,SICU,20.0,23.0,3.0,3.0,15.0,18.0,0.2,...,1.4,12.6,14.6,134.0,138.0,62.0,65.0,10.6,10.6,0
3,F,0.0017,NICU,13.043277,15.757741,3.115281,3.200476,22.670118,25.078192,2.043297,...,1.656829,14.893293,16.910052,136.661065,140.08174,24.039024,28.386141,22.8,22.8,0
4,M,0.0012,NICU,13.043277,15.757741,3.115281,3.200476,22.670118,25.078192,2.043297,...,1.656829,14.893293,16.910052,136.661065,140.08174,24.039024,28.386141,18.7,18.7,1


In [71]:
# Select ICU mortality from the dataset (last column)
labels = data.mort_icu

In [72]:
labels.head()

0    0
1    0
2    0
3    0
4    0
Name: mort_icu, dtype: int64

The next step, we need to take the categorical variables, gender and first care unit, and encode them to a numerical representation without an arbitrary ordering. For example, gender is represented as 'M' for male and 'F' for female in the original data, however, machine learning models are not able to directly learn from words. To make sure the data is understandable for model, we may simply map them to values 0 or 1, but in feautres with more levels, such as 5 levels, it may lead the algorithm to place more importance on categories with large number. Instead, we change the single column of gender into 2 columns of binary data. It is called **One-hot encoding** in machien learning field. 

In [73]:
# Perform one-hot encoding 
features = pd.get_dummies(features)

In [74]:
features.head()

Unnamed: 0,age,aniongap_min,aniongap_max,albumin_min,albumin_max,bicarbonate_min,bicarbonate_max,bilirubin_min,bilirubin_max,creatinine_min,...,wbc_max,vent,gender_F,gender_M,first_careunit_CCU,first_careunit_CSRU,first_careunit_MICU,first_careunit_NICU,first_careunit_SICU,first_careunit_TSICU
0,76.5268,15.0,23.0,1.8,1.8,11.0,25.0,0.8,0.8,2.4,...,24.4,1,0,1,0,0,1,0,0,0
1,47.845,15.0,15.0,2.8,2.8,21.0,21.0,1.9,1.9,0.5,...,9.7,0,1,0,0,0,1,0,0,0
2,65.9407,20.0,23.0,3.0,3.0,15.0,18.0,0.2,0.2,10.0,...,10.6,0,1,0,0,0,0,0,1,0
3,0.0017,13.043277,15.757741,3.115281,3.200476,22.670118,25.078192,2.043297,2.33418,1.325493,...,22.8,0,1,0,0,0,0,1,0,0
4,0.0012,13.043277,15.757741,3.115281,3.200476,22.670118,25.078192,2.043297,2.33418,1.325493,...,18.7,1,0,1,0,0,0,1,0,0


Next we are going to convert data table into arrays such that they can be directly feeded into algorithms. 

In [75]:
# Convert featurs and labels to numpy arrays. 
features = features.values
labels = labels.values

In the last step of data preparation, we are going to split the dataset into two parts: training and testing set. The training is for us to fit the model while the testing set is designed to evaluate the performance of the fitted model. 

In [76]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
# the size of testing set is 1/4 of training set
# The data was shuffled then split into the two sets, therefore, we need to set random state 
# to ensure that the data split is reproducable. 
train_features, test_features, train_labels, test_labels = \
                                    train_test_split(features, labels, test_size = 0.25, random_state = 2018)

In [77]:
# Check out the size of train_features and train_labels
print("The size of train features is ", train_features.shape)
print("The size of train labels is ", train_labels.shape)

The size of train features is  (46149, 46)
The size of train labels is  (46149,)
