# Heart Failure Prediction - Basic Models

## Overview

Preparing the data, computing basic statistics and constructing simple models are essential steps for data science practice. In this activity, we will use clinical data as raw input to perform **Heart Failure Prediction**. 

In [1]:
import os
import sys

DATA_PATH = "lib/data/"
TRAIN_DATA_PATH = DATA_PATH + "train/"
VAL_DATA_PATH = DATA_PATH + "val/"

## Raw Data

For this project, we will be using a clinical dataset synthesized from [MIMIC-III](https://www.nature.com/articles/sdata201635).

Navigate to `TRAIN_DATA_PATH`. There are three CSV files which will be the input data in this homework. 

In [2]:
!ls $TRAIN_DATA_PATH

event_feature_map.csv  events.csv  hf_events.csv


**events.csv**

The data provided in *events.csv* are event sequences. Each line of this file consists of a tuple with the format *(pid, event_id, vid, value)*. 

For example, 

```
33,DIAG_244,0,1
33,DIAG_414,0,1
33,DIAG_427,0,1
33,LAB_50971,0,1
33,LAB_50931,0,1
33,LAB_50812,1,1
33,DIAG_425,1,1
33,DIAG_427,1,1
33,DRUG_0,1,1
33,DRUG_3,1,1
```

- **pid**: De-identified patient identier. For example, the patient in the example above has pid 33. 
- **event_id**: Clinical event identifier. For example, DIAG_244 means the patient was diagnosed of disease with ICD9 code [244](http://www.icd9data.com/2013/Volume1/240-279/240-246/244/244.htm); LAB_50971 means that the laboratory test with code 50971 was conducted on the patient; and DRUG_0 means that a drug with code 0 was prescribed to the patient. Corresponding lab (drug) names can be found in `{DATA_PATH}/lab_list.txt` (`{DATA_PATH}/drug_list.txt`).
- **vid**: Visit identifier. For example, the patient has two visits in total. Note that vid is ordinal. That is, visits with bigger vid occour after that with smaller vid.
- **value**: Contains the value associated to an event (always 1 in the synthesized dataset).

**hf_events.csv**

The data provided in *hf_events.csv* contains pid of patients who have been diagnosed with heart failure (i.e., DIAG_398, DIAG_402, DIAG_404, DIAG_428) in at least one visit. They are in the form of a tuple with the format *(pid, vid, label)*. For example,

```
156,0,1
181,1,1
```

The vid indicates the index of the first visit with heart failure of that patient and a label of 1 indicates the presence of heart failure. **Note that only patients with heart failure are included in this file. Patients who are not mentioned in this file have never been diagnosed with heart failure.**

**event_feature_map.csv**

The *event_feature_map.csv* is a map from an event_id to an integer index. This file contains *(idx, event_id)* pairs for all event ids.

## 1 Descriptive Statistics

Before starting analytic modeling, it is good to get descriptive statistics of the input raw data. We will write code that computes various metrics on the data described previously.

The definition of terms used in the result table are described below:

- **Event count**: Number of events recorded for a given patient.
- **Encounter count**: Number of visits recorded for a given patient.

Note that every line in the input file is an event, while each visit consists of multiple events.

In [3]:
import time
import pandas as pd
import numpy as np
import datetime

def read_csv(filepath=TRAIN_DATA_PATH):

    '''
    Read the events.csv and hf_events.csv files. 
    Variables returned from this function are passed as input to the metric functions.
    '''
    
    events = pd.read_csv(filepath + 'events.csv')
    hf = pd.read_csv(filepath + 'hf_events.csv')

    return events, hf

def event_count_metrics(events, hf):

    '''
    Return the event count metrics.
    Event count is defined as the number of events recorded for a given patient.
    '''
    event_counts_df = events.pid.value_counts().rename_axis('pid').reset_index(name='event_counts')
    hf_event_counts, norm_event_counts = [], []
    for index, row in event_counts_df.iterrows():
        if row.pid in list(hf.pid):
            hf_event_counts.append(row.event_counts)
        else:
            norm_event_counts.append(row.event_counts)

    avg_hf_event_count = np.mean(hf_event_counts)
    max_hf_event_count = np.max(hf_event_counts)
    min_hf_event_count = np.min(hf_event_counts)
    avg_norm_event_count = np.mean(norm_event_counts)
    max_norm_event_count = np.max(norm_event_counts)
    min_norm_event_count = np.min(norm_event_counts)    

    return avg_hf_event_count, max_hf_event_count, min_hf_event_count, \
           avg_norm_event_count, max_norm_event_count, min_norm_event_count

def encounter_count_metrics(events, hf):

    '''
    Return the encounter count metrics.
    Encounter count is defined as the number of visits recorded for a given patient. 
    '''
    encounter_counts_df = events.drop_duplicates(subset=['pid', 'vid']).pid.value_counts().rename_axis('pid').reset_index(name='encounter_counts')
    hf_encounter_counts, norm_encounter_counts = [], []
    for index, row in encounter_counts_df.iterrows():
        if row.pid in list(hf.pid):
            hf_encounter_counts.append(row.encounter_counts)
        else:
            norm_encounter_counts.append(row.encounter_counts)

    avg_hf_encounter_count = np.mean(hf_encounter_counts)
    max_hf_encounter_count = np.max(hf_encounter_counts)
    min_hf_encounter_count = np.min(hf_encounter_counts)
    avg_norm_encounter_count = np.mean(norm_encounter_counts)
    max_norm_encounter_count = np.max(norm_encounter_counts)
    min_norm_encounter_count = np.min(norm_encounter_counts)     

    return avg_hf_encounter_count, max_hf_encounter_count, min_hf_encounter_count, \
           avg_norm_encounter_count, max_norm_encounter_count, min_norm_encounter_count

In [4]:
events, hf = read_csv(TRAIN_DATA_PATH)

#Compute the event count metrics
start_time = time.time()
event_count = event_count_metrics(events, hf)
end_time = time.time()
print(("Time to compute event count metrics: " + str(end_time - start_time) + "s"))
print(event_count)

#Compute the encounter count metrics
start_time = time.time()
encounter_count = encounter_count_metrics(events, hf)
end_time = time.time()
print(("Time to compute encounter count metrics: " + str(end_time - start_time) + "s"))
print(encounter_count)

Time to compute event count metrics: 1.9538638591766357s
(188.9375, 2046, 28, 118.64423076923077, 1014, 6)
Time to compute encounter count metrics: 2.034022331237793s
(2.8060810810810812, 34, 2, 2.189423076923077, 11, 1)


## 2 Feature construction

It is a common practice to convert raw data into a standard data format before running machine learning models. Here we will implement the necessary python functions in this script and work with *events.csv*, *hf_events.csv* and *event_feature_map.csv* files provided in **TRAIN_DATA_PATH** folder.

Some related concepts:

<img src="img/window.jpg" width="600"/>

- **Index vid**: Index vid is evaluated as follows:
  - For heart failure patients: Index vid is the vid of the first visit with heart failure for that patient (i.e., vid field in *hf_events.csv*). 
  - For normal patients: Index vid is the vid of the last visit for that patient (i.e., vid field in *events.csv*). 
- **Observation Window**: The time interval you will use to identify relevant events. Only events present in this window should be included while constructing feature vectors.
- **Prediction Window**: A fixed time interval that is to be used to make the prediction.

In the example above, the index vid is 3. Visits with vid 0, 1, 2 are within the observation window. The prediction window is between visit 2 and 3.

### 2.1 Compute the index vid

We will use the above definitions to compute the index vid for all patients. 

In [7]:
import pandas as pd
import datetime


def read_csv(filepath=TRAIN_DATA_PATH):
    
    '''
    Read the events.csv, hf_events.csv and event_feature_map.csv files.
    '''

    events = pd.read_csv(filepath + 'events.csv')
    hf = pd.read_csv(filepath + 'hf_events.csv')
    feature_map = pd.read_csv(filepath + 'event_feature_map.csv')

    return events, hf, feature_map


def calculate_index_vid(events, hf):
    
    '''
    Steps:
        1. Create list of normal patients (hf_events.csv only contains information about heart failure patients).
        2. Split events into two groups based on whether the patient has heart failure or not.
        3. Calculate index vid for each patient.
    '''

    indx_vid = events[['pid', 'vid']].drop_duplicates(subset=['pid'], keep = 'last', ignore_index=True)
    hf_pid_df = hf.set_index('pid')
    hf_pids = list(hf.pid)
    for index, row in indx_vid.iterrows():
        if row.pid in hf_pids:
            indx_vid.vid[index] = hf_pid_df.vid[row.pid]
    indx_vid.rename(columns={"vid": "indx_vid"}, inplace=True)
    
    return indx_vid

### 2.2 Filter events

Remove the events that occur outside the observation window. That is, all events in visits before index vid.

In [9]:
def filter_events(events, indx_vid):
    '''
    Steps:
        1. Join indx_vid with events on pid.
        2. Filter events occuring in the observation window [:, index vid).
    '''
    events_indx_vid = pd.merge(events, indx_vid, how="left", on=["pid"])
    filtered_events = events_indx_vid[events_indx_vid.vid < events_indx_vid.indx_vid][['pid', 'event_id', 'value']].reset_index(drop=True)
    return filtered_events

### 2.3 Aggregate events

To create features suitable for machine learning, we will need to aggregate the events for each patient as follows:

- **count** occurences for each event.

Each event type will become a feature and we will directly use event_id as feature name. For example, given below raw event sequence for a patient,

```
33,DIAG_244,0,1
33,LAB_50971,0,1
33,LAB_50931,0,1
33,LAB_50931,0,1
33,DIAG_244,1,1
33,DIAG_427,1,1
33,DRUG_0,1,1
33,DRUG_3,1,1
33,DRUG_3,1,1
```

We can get feature value pairs *(event_id, value)* for this patient with ID *33* as
```
(DIAG_244, 2.0)
(LAB_50971, 1.0)
(LAB_50931, 2.0)
(DIAG_427, 1.0)
(DRUG_0, 1.0)
(DRUG_3, 2.0)
```

Next, replace each *event_id* with the *feature_id* provided in *event_feature_map.csv*.

```
(146, 2.0)
(1434, 1.0)
(1429, 2.0)
(304, 1.0)
(898, 1.0)
(1119, 2.0)
```

Lastly, it is important to normalize different features into the same scale. We will use the [min-max normalization](http://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range) approach. (Note: we define $min(x)$ is always 0, i.e. the scale equation become $x$/$max(x)$).

In [11]:
def aggregate_events(filtered_events_df, hf_df, feature_map_df):
    
    '''
    Steps:
        1. Replace event_id's with index available in event_feature_map.csv.
        2. Aggregate events using count to calculate feature value.
        3. Normalize the values obtained above using min-max normalization(the min value will be 0 in all scenarios).
    '''
    filtered_events_df = pd.merge(filtered_events_df, feature_map, how="left", on=["event_id"])
    filtered_events_df = filtered_events_df.rename(columns={'idx':'feature_id'})
    aggregated_events = filtered_events_df.groupby(['pid', 'feature_id']).size().reset_index().rename(columns={0:'feature_value'})
    feature_max = aggregated_events.groupby(['feature_id'])['feature_value'].max().reset_index().rename(columns={'feature_value':'max_value'})
    aggregated_events = pd.merge(aggregated_events, feature_max, how = 'left', on = 'feature_id')
    aggregated_events.feature_value = aggregated_events.feature_value/aggregated_events.max_value
    aggregated_events.drop(columns = ['max_value'], inplace = True)
    
    return aggregated_events

### 2.4 Save in  SVMLight format

If the dimensionality of a feature vector is large but the feature vector is sparse (i.e. it has only a few nonzero elements), sparse representation should be employed. Here we will use the provided data for each patient to construct a feature vector and represent the feature vector in SVMLight format.

```
<line> .=. <target> <feature>:<value> <feature>:<value>
<target> .=. 1 | 0
<feature> .=. <integer>
<value> .=. <float>
```

The target value and each of the feature/value pairs are separated by a space character. Feature/value pairs MUST be ordered by increasing feature number. **(using `save_svmlight()`.)** Features with value zero can be skipped. For example, the feature vector in SVMLight format will look like: 

```
1 2:0.5 3:0.12 10:0.9 2000:0.3
0 4:1.0 78:0.6 1009:0.2
1 33:0.1 34:0.98 1000:0.8 3300:0.2
1 34:0.1 389:0.32
```

where, 1 or 0 will indicate whether the patient has heart failure or not (i.e. the label) and it will be followed by a series of feature-value pairs **sorted** by the feature index (idx) value.

The *utils.py* script will be useful here. 

In [13]:
# %load lib/utils.py

In [14]:
import utils
import collections

def create_features(events_in, hf_in, feature_map_in):

    indx_vid = calculate_index_vid(events_in, hf_in)

    #Filter events in the observation window
    filtered_events = filter_events(events_in, indx_vid)

    #Aggregate the event values for each patient 
    aggregated_events = aggregate_events(filtered_events, hf_in, feature_map_in)

    pid_is_hf = list(hf_in.pid)
    pid_all = list(aggregated_events.pid.unique())
    
    patient_features, hf = {}, {}
    for pid in pid_all:
        patient_features[pid] = aggregated_events[aggregated_events.pid==pid].drop(columns=['pid']).to_records(index=False).tolist()
    for pid in pid_is_hf:
        hf[pid] = 1

    return patient_features, hf

def save_svmlight(patient_features, hf, op_file):

    deliverable = open(op_file, 'wb')
    hf_pids = hf.keys()
    pids = sorted(patient_features.keys())
    for pid in pids:
        label = 1 if pid in hf_pids else 0
        features = sorted(patient_features[pid])
        feature_value = utils.bag_to_svmlight(features)
        # save the files
        deliverable.write(bytes(f"{label} {feature_value} \n", 'utf-8'))
    deliverable.close()

Now we can put together the whole pipeline:

In [16]:
def main():
    events_in, hf_in, feature_map_in = read_csv(TRAIN_DATA_PATH)
    patient_features, hf = create_features(events_in, hf_in, feature_map_in)
    save_svmlight(patient_features, hf, 'features_svmlight.train')
    
    events_in, hf_in, feature_map_in = read_csv(VAL_DATA_PATH)
    patient_features, hf = create_features(events_in, hf_in, feature_map_in)
    save_svmlight(patient_features, hf, 'features_svmlight.val')
    
main()

## 3 Predictive Modeling

### 3.1 Model Creation

Now we have constructed feature vectors for patients to be used as training data in various predictive models (classifiers). We can use this training data (*features_svmlight.train*) in 3 predictive models. 

**Step - a. Implement Logistic Regression, SVM and Decision Tree. Skeleton code is provided in the following code cell.**

In [17]:
import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import *

import utils

RANDOM_STATE = 545510477


def logistic_regression_pred(X_train, Y_train):
    model = LogisticRegression(random_state=RANDOM_STATE).fit(X_train, Y_train)
    return model.predict(X_train)

def svm_pred(X_train, Y_train):
    model = LinearSVC(random_state=RANDOM_STATE).fit(X_train, Y_train)
    return model.predict(X_train)

def decisionTree_pred(X_train, Y_train):
    model = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=5).fit(X_train, Y_train)
    return model.predict(X_train)


def classification_metrics(Y_pred, Y_true):
    tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred).ravel()
    acc = (tp + tn)/(tn + fp + fn + tp)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1score = 2*precision*recall/(precision+recall)
    
    return acc, precision, recall, f1score


def display_metrics(classifierName, Y_pred, Y_true):
    print("______________________________________________")
    print(("Classifier: "+classifierName))
    acc, precision, recall, f1score = classification_metrics(Y_pred,Y_true)
    print(("Accuracy: "+str(acc)))
    print(("Precision: "+str(precision)))
    print(("Recall: "+str(recall)))
    print(("F1-score: "+str(f1score)))
    print("______________________________________________")
    print("")

    
def main():
    X_train, Y_train = utils.get_data_from_svmlight("features_svmlight.train")

    display_metrics("Logistic Regression", logistic_regression_pred(X_train, Y_train), Y_train)
    display_metrics("SVM",svm_pred(X_train, Y_train),Y_train)
    display_metrics("Decision Tree", decisionTree_pred(X_train, Y_train), Y_train)

    
main()

______________________________________________
Classifier: Logistic Regression
Accuracy: 0.856338028169014
Precision: 0.8357933579335793
Recall: 0.937888198757764
F1-score: 0.8839024390243903
______________________________________________

______________________________________________
Classifier: SVM
Accuracy: 0.9070422535211268
Precision: 0.896484375
Recall: 0.9503105590062112
F1-score: 0.9226130653266331
______________________________________________

______________________________________________
Classifier: Decision Tree
Accuracy: 0.703420523138833
Precision: 0.6657355679702048
Recall: 0.9868875086266391
F1-score: 0.7951070336391437
______________________________________________



**Step - b. Evaluate your predictive models on a separate test dataset in *features_svmlight.val* (binary labels are provided in that svmlight file as the first field). Skeleton code is provided in the following code cell.**

In [19]:
def main():
    X_train, Y_train = utils.get_data_from_svmlight("features_svmlight.train")
    X_test, Y_test = utils.get_data_from_svmlight(os.path.join("features_svmlight.val"))

    display_metrics("Logistic Regression", logistic_regression_pred(X_train, Y_train, X_test), Y_test)
    display_metrics("SVM", svm_pred(X_train, Y_train, X_test), Y_test)
    display_metrics("Decision Tree", decisionTree_pred(X_train, Y_train, X_test), Y_test)


main()

______________________________________________
Classifier: Logistic Regression
Accuracy: 0.6937086092715232
Precision: 0.7345360824742269
Recall: 0.776566757493188
F1-score: 0.7549668874172186
______________________________________________

______________________________________________
Classifier: SVM
Accuracy: 0.640728476821192
Precision: 0.7038043478260869
Recall: 0.7057220708446866
F1-score: 0.7047619047619047
______________________________________________

______________________________________________
Classifier: Decision Tree
Accuracy: 0.6821192052980133
Precision: 0.6611418047882136
Recall: 0.9782016348773842
F1-score: 0.789010989010989
______________________________________________



### 3.2 Model Validation

In order to fully utilize the available data and obtain more reliable results, we use cross-validation to evaluate and improve their predictive models. 

- K-fold: Divide all the data into $k$ groups of samples. Each time $\frac{1}{k}$ samples will be used as test data and the remaining samples as training data.
- Randomized K-fold: Iteratively random shuffle the whole dataset and use top specific percentage of data as training and the rest as test. 

**Implement the two cross-validation strategies.**
- **K-fold:** Use the number of iterations k=5; 
- **Randomized K-fold**: Use a test data percentage of 20\% and k=5 for the number of iterations for Randomized

In [21]:
from sklearn.model_selection import KFold, ShuffleSplit
from numpy import mean

import utils

RANDOM_STATE = 545510477

def get_f1_kfold(X, Y, k=5):
    
    """
    First get the train indices and test indices for each iteration.
    Then train the classifier accordingly.
    Report the mean f1 score of all the folds.
    """

    n = len(Y)
    size = n//k
    f1scores = []
    for i in range(k):
        test_idx = [idx for idx in range(i*size,min((i+1)*size, n))]
        train_idx = [idx for idx in range(i*size)]
        train_idx += [idx for idx in range(min((i+1)*size, n),n)]
        X_test = X[test_idx,]
        Y_test = Y[test_idx]
        X_train = X[train_idx,]
        Y_train = Y[train_idx]                
        model = LinearSVC(random_state=RANDOM_STATE).fit(X_train, Y_train)
        Y_pred = model.predict(X_test)
        acc, precision, recall, f1score = classification_metrics(Y_pred,Y_test)
        f1scores.append(f1score)
    return np.mean(f1scores)

def get_f1_randomisedCV(X, Y, iterNo=5, test_percent=0.20):

    """
    First get the train indices and test indices for each iteration.
    Then train the classifier accordingly.
    Report the mean f1 score of all the iterations.
    """
    f1scores = []
    rs = ShuffleSplit(n_splits=iterNo, test_size=test_percent, random_state=RANDOM_STATE)
    rs.get_n_splits(X)
    for train_index, test_index in rs.split(X):            
        X_test = X[test_index,]
        Y_test = Y[test_index]
        X_train = X[train_index,]
        Y_train = Y[train_index]                
        model = LinearSVC(random_state=RANDOM_STATE).fit(X_train, Y_train)
        Y_pred = model.predict(X_test)
        acc, precision, recall, f1score = classification_metrics(Y_pred,Y_test)
        f1scores.append(f1score)
    return np.mean(f1scores)    

    
def main():
    X,Y = utils.get_data_from_svmlight("features_svmlight.train")
    print("Classifier: SVD")
    f1_k = get_f1_kfold(X,Y)
    print(("Average F1 Score in KFold CV: "+str(f1_k)))
    f1_r = get_f1_randomisedCV(X,Y)
    print(("Average F1 Score in Randomised CV: "+str(f1_r)))


main()

Classifier: SVD
Average F1 Score in KFold CV: 0.7258461959533061
Average F1 Score in Randomised CV: 0.7195678940019832
