## Initial steps
1. File -> New -> Terminal
2. In the terminal, type following commands
        mkdir med264/
        cd med264/
        wget https://archive.physionet.org/users/shared/challenge-2019/training_setB.zip
        unzip training_setB.zip

## Install & Load Dependencies

In [1]:
dependencies  <- c("tidyverse", "tidymodels", "")

for (package in dependencies) { 
    library(package, character.only = TRUE)
}



Installing package into ‘/home/aaron/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.1     [32m✔[39m [34mdplyr  [39m 1.0.5
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Installing package into ‘/home/aaron/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.3 ──

[32m✔[39m [34mbroom       [39m 0.7.6      [32m✔[39m [34mrsam

## Data processing
1. Load the list of files in the input directory
2. Split the records to training and testing set
3. Convert the data into a numpy array

In [3]:
# Name of the input directory
input_directory = 'training_setB'

# Read all PSV files in the input directory and convert to a tibble
all_data  <- list.files(path = input_directory, full.names = TRUE) %>% 
    .[grepl(".psv", .)] %>%
    map(., read_delim, delim = "|", col_types = cols()) %>% 
    bind_rows() %>% 
    mutate(ID = row_number())

# Split the data into training and testing set
# 80% of data -> Training
# 20% of data -> Testing

train_all  <- slice_sample(all_data, prop = 0.8)

test_all  <- all_data %>%
    anti_join(train_all, by="ID")

train_labels  <- train_all["SepsisLabel"]
test_labels  <- test_all["SepsisLabel"]

train_data <- select(train_all, -c("SepsisLabel", "ID"))
test_data <- select(test_all, -c("SepsisLabel", "ID"))

## Standardization of data
1. Compute the mean and standard deviation of ONLY training set
2. Use the statistics computed to standardize Training and Testing sets

In [4]:
x_mean  <- colMeans(train_data, na.rm = TRUE)
x_sd  <- apply(train_data, 2, sd, na.rm = TRUE)

# For NaN entries, replace with 0
# For the remaining entries, standardize with mean and std
train_data <- sweep(train_data, 2, x_mean, "-") %>%
    sweep(., 2, x_sd, "/")  %>%
    `[<-`(., is.na(.), value = 0) %>% 
    cbind(train_labels)

test_data <- sweep(test_data, 2, x_mean, "-") %>%
    sweep(., 2, x_sd, "/") %>% 
    `[<-`(., is.na(.), value = 0) %>% 
    cbind(test_labels)
    

## Model training
1. We will use a simple logistic regression model to model the data

### Exercise 1
1. Use a different penalty for Logistic Regression (LR) and plot the AUC curves
2. Use a Support Vector Machine classifier instead of LR and plot the AUC curves
3. Use a Random Forest Classifier instead of LR and plot the AUC curves 

        https://cran.r-project.org/web/packages/caret/caret.pdf

In [22]:
model  <- logistic_reg(penalty = 1, mixture = 0) %>%  # Mixture=0 corresponds to Ridge Regression (L2) 
    set_engine("stan", iter = 150, algorithm = "optimizing") %>% # Optimizing corresponds to LBFGS 
    set_mode("classification") %>% 
    fit(SepsisLabel~., data = train_data)


ERROR: Error: This engine requires some package installs: 'rstanarm'


In [None]:
install.packages("rstanarm")
library(rstanarm)

Installing package into ‘/home/aaron/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)

also installing the dependencies ‘V8’, ‘rstan’, ‘shinystan’




## Model prediction
1. Use the trained model to get output probability scores

## Performance Metrics
1. Compute Area Under the Curve

## Plot AUC curves

## Exercise 2 - Interpretability

Methods such as LIME [1], SHAP [2] allow for revealing the top features contributing to the predicted score at a local level. 

The python library SHAP (https://shap.readthedocs.io/en/latest/) uses the method of shapley values to determine the top contributing features.

Use the python library SHAP and show the force plot, dependance plot and summary plot for each of the models developed above.
Some examples using SHAP: https://shap.readthedocs.io/en/latest/examples.html

[1] https://christophm.github.io/interpretable-ml-book/lime.html

[2] https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d