# LSTM Time Series Deep Learning and Adversarial Attacks

## 1. Overview

### 1.1 Summary of Prior Work

This project builds on results originally published in:

[Sun, M., Tang, F., Yi, J., Wang, F. and Zhou, J., 2018, July. Identify susceptible locations in medical records via adversarial attacks on deep predictive models. In *Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining* (pp. 793-801)](https://dl.acm.org/doi/10.1145/3219819.3219909) 

The original paper trained a Long Short-Term Memory (LSTM) time series model using patient vital-sign and lab data from the Medical Information Mart for Intensive Care (MIMIC-III) database to predict in-hospital mortality. MIMIC-III consists of de-identified data from in-patients critical care units of Beth Israel Hospital from 2005 to 2011. The full dataset consists of data from 46,520 patients and 58,976 hospital admissions. After filtering to remove patients with age < 18 years, and hospital stays < 24 hours, 

The input features consisted of data from 13 lab measurements (Blood Urea Nitrogen, HCO<sub>3</sub>, Na, PaCO<sub>2</sub>, Glucose, Creatinine, Albumin, Mg, K, Ca, Platelets, and Lactate), and 6 vital signs (Heart Rate, Respiration Rate, Systolic Blood Pressure, Diastolic Blood Pressure, Oxygen Saturation, and Temperature). The prediction target was a binary variable representing in-hospital mortality.  Input data were collected over time spans ranging from 6 to 48 hours. Each patient was represented by a 48 (hour) x 19 (measurements) input feature matrix of floating point values, and a binary output label. Missing data were imputed by interpolation vs. time when possible. If no data was available for a particular patient and measurement parameter, the global mean of that parameter (across all patients and times) was used.

Sun et al. implemented an LSTM model using Tensorflow, and after cross-validation hypertuning, selected a final model architecture consisting of:
* A single-layer, bi-directional LSTM with; an input size of 19; 128 hidden states per direction; and Tanh activation of the outputs 
* A dropout layer with dropout probability = 50%
* A fully-connected layer with an input size of 256 (for the 2 x 128 LSTM outputs),  32 output nodes, and ReLU activation.
* A final 2-node layer with soft-max activation.

This model was trained using a Binary Cross Entropy loss; and an Adam optimizer with learning rate = 1e-4, momentum decay rate = 0.999, and moving average decay rate = 0.5. An adversarial attack algorithm was then used to identify small perturbations which, when applied to a real, correctly-classified input features, caused the trained model to misclassify the perturbed input. The attack algorithm used L1 regularization to favor adversarial examples with sparse perturbations which simulate the structure of data entry errors in real medical data.

### 1.2 Focus of Current Poject

The current project follows the general approach of Sun et al, and adds the modifications / extensions:

* A streamlined `preprocess` sub-package is used to convert database SQL query outputs to the tensor form used for model input. This package reduces RAM consumption by saving large intermediate data structures to disk (primarily Pandas dataframes saved as .pickle files) instead of returning them to global scope. It also performs the query output to tensor conversion 90% faster than the original code, primarily through the use of vectorized operations. 
* A GPU-compatible adversarial attack algorithm that allows attacks to run on batches of samples
* Improved LSTM predictive performance achieved through hyperparameter tuning the Optuna Tree-structued Parzen Estimator (TPE) algorithm
* Hyperparameter tuning of the attack algorithm (also using TPE) to find adversarial perturbations with higher sparsity and lower magnitude

### 1.3 Structure of Current Project

The project is stru




## 2. Project Setup

Before running this notebook, review the project README at https://github.com/duanegoodner/lstm_adversarial_attack, and complete all steps in the "How to run this project" section.

## 2. Imports

### 2.1 Standard Library Packages

In [1]:
import pprint
import sys

### 2.2 External Packages

In [2]:
import numpy as np
import torch

### 2.3 Internal Project Modules and Sub-packages
To help gain a sense of project structure, we will import internal packages and modules as-needed (i.e. immediately before the code cells where they are first needed). For now, we will import the project `src` path defined in `lstm_adversarial_attack/notebooks/src_path` and add it to sys.path so we can easily import project code into this notebook, and we import project config files.

In [3]:
import src_paths
sys.path.append(str(src_paths.lstm_adversarial_attack_pkg))
import lstm_adversarial_attack.config_paths as cfg_paths
import lstm_adversarial_attack.config_settings as cfg_set

## 3. Database Queries
We need to run four queries on the PostgreSQL database. The paths to files containing the queries are stored in a list as `DB_QUERIES` in the project `config_paths` file:

In [4]:
pprint.pprint(cfg_paths.DB_QUERIES)

[PosixPath('/home/devspace/project/src/mimiciii_queries/icustay_detail.sql'),
 PosixPath('/home/devspace/project/src/mimiciii_queries/pivoted_bg.sql'),
 PosixPath('/home/devspace/project/src/mimiciii_queries/pivoted_lab.sql'),
 PosixPath('/home/devspace/project/src/mimiciii_queries/pivoted_vital.sql')]


To connect to the database, and execute the queries, we instantiate a `MimiciiiDatabaseAccess` object from module `mimiciii_database` of project sub-package `query_db` and use its .connect(), .run_sql_queries() and .close_connection() methods.

In [5]:
import lstm_adversarial_attack.query_db.mimiciii_database as mdb

db_access = mdb.MimiciiiDatabaseAccess(
    dotenv_path=cfg_paths.DB_DOTENV_PATH, output_dir=cfg_paths.DB_OUTPUT_DIR
)
db_access.connect()
db_query_results = db_access.run_sql_queries(
    sql_query_paths=cfg_paths.DB_QUERIES
)
db_access.close_connection()

Query 1 of 4
Executing: /home/devspace/project/src/mimiciii_queries/icustay_detail.sql
Done. Query time = 0.58 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/icustay_detail.csv
Done. csv write time = 0.49 seconds

Query 2 of 4
Executing: /home/devspace/project/src/mimiciii_queries/pivoted_bg.sql
Done. Query time = 17.22 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/pivoted_bg.csv
Done. csv write time = 3.58 seconds

Query 3 of 4
Executing: /home/devspace/project/src/mimiciii_queries/pivoted_lab.sql
Done. Query time = 25.36 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/pivoted_lab.csv
Done. csv write time = 5.74 seconds

Query 4 of 4
Executing: /home/devspace/project/src/mimiciii_queries/pivoted_vital.sql
Done. Query time = 64.41 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/pivoted_vital.csv
Done. csv write time = 25.88 seconds



The results of each `.sql` query is saved to a `.csv` file. The path to each of these files is shown in the terminal output above. The output path of the queries is defined by variable `DB_OUTPUT_DIR` in the project `config_settings` file.

## 4. Preprocessor

### 4.1 Instantiate a Preprocessor object
We import the `preprocessor` module from internal sub-package `preprocess`, instantiate a `Preprocessor` object, and examine its .preprocess_modules data member. 

In [9]:
import lstm_adversarial_attack.preprocess.preprocessor as pre
preprocessor = pre.Preprocessor()

We can get a general idea of how the `lstm_adversarial_attack.preprocess` sub-package works by looking at the `Preprocessor` object's `.preprocessor_modules` data member.

In [8]:
pprint.pprint([item.__class__ for item in preprocessor.preprocess_modules])

[<class 'lstm_adversarial_attack.preprocess.prefilter.Prefilter'>,
 <class 'lstm_adversarial_attack.preprocess.icustay_measurement_combiner.ICUStayMeasurementCombiner'>,
 <class 'lstm_adversarial_attack.preprocess.sample_list_builder.FullAdmissionListBuilder'>,
 <class 'lstm_adversarial_attack.preprocess.feature_builder.FeatureBuilder'>,
 <class 'lstm_adversarial_attack.preprocess.feature_finalizer.FeatureFinalizer'>]


* Prefilter reads the database query outputs into Pandas Dataframes, removes all data related to patients younger than 18 years in age, ensures consistent column naming formats, and takes care of datatype details.
* ICUStayMeasurementCombiner 

In [7]:
preprocessed_resources = preprocessor.preprocess()


Running preprocess module 1 of 5: Prefilter
Incoming resources:
/home/devspace/project/data/mimiciii_query_results/icustay_detail.csv
/home/devspace/project/data/mimiciii_query_results/pivoted_bg.csv
/home/devspace/project/data/mimiciii_query_results/pivoted_vital.csv
/home/devspace/project/data/mimiciii_query_results/pivoted_lab.csv
Done with Prefilter. Results saved to:
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/icustay.pickle, data_type: DataFrame
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/bg.pickle, data_type: DataFrame
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/vital.pickle, data_type: DataFrame
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/lab.pickle, data_type: DataFrame

Running preprocess module 2 of 5: ICU Stay Data + Measurement Data Combiner
Incoming resources:
/home/devspace/project/data/preprocess_checkpoints/1_prefilter/icustay.pickle
/home/devspace/project/data/preproce

## 5. LSTM Hyperparameter Tuning

### 5.1 Model Architecture and Tuning Parameters

In [None]:
import lstm_adversarial_attack.x19_mort_general_dataset as xmd
dataset = xmd.X19MGeneralDataset.from_feature_finalizer_output()

In [None]:
print(f"Number of samples in dataset = {len(dataset)}")
print(f"Type returned by dataset.__getitem__ = {type(dataset[0])}")
print(
    f"Length of each tuple returned by dataset.__getitem__ = {len(dataset[0])}"
)
print(
    "\nObject type, dimensionality, and datatype of each element in a tuple"
    " returned by dataset.__getitem__:"
)
print(tuple([(type(item), item.dim(), item.dtype) for item in dataset[0]]))
print(f"The 'input size' (# columns) of each feature matrix is "
     f"{np.unique([item.shape[1] for item in dataset[:][0]]).item()}")
print(f"The various sequence lengths (# rows) among the input feature matrices are\n"
     f"{np.unique([item.shape[0] for item in dataset[:][0]])}")



unique_sequence_lengths, sequence_length_counts = np.unique(
    [item.shape[0] for item in dataset[:][0]], return_counts=True
)
print(
    np.concatenate(
        (
            unique_sequence_lengths.reshape(-1, 1),
            sequence_length_counts.reshape(-1, 1),
        ),
        axis=1,
    )
)


unique_labels, label_counts = np.unique([dataset[:][1]], return_counts=True)
print(
    np.concatenate(
        (unique_labels.reshape(-1, 1), label_counts.reshape(-1, 1)), axis=1
    )
)


In [None]:
len(dataset[0]

In [None]:
print(type(dataset[0][0]))

In [None]:
import torch
import lstm_adversarial_attack.tune_train.tuner_driver as td

if torch.cuda.is_available():
    cur_device = torch.device("cuda:0")
else:
    cur_device = torch.device("cpu")

print(f"cur_device is {cur_device}")

In [None]:
tuner_driver = td.TunerDriver(device=cur_device)
pprint.pprint(tuner_driver.tuner.tuning_ranges)

In [None]:
pprint.pprint(tuner_driver.tuner.dataset[0][0].shape)

In [None]:
my_completed_study = tuner_driver(num_trials=30)