# LSTM Time Series Deep Learning and Adversarial Attacks

## 1. About this Notebook

See the project [README](https://github.com/duanegoodner/lstm_adversarial_attack) for general information on the dataset and approach used in this notebook.

The implementation details for this project are encapsulated in various classes and methods defined in modules under the project `src` directory, and various intermediate data structures and logs are saved in the project `data` directory. Most of the code in this notebook simply instantiates top-level classes and makes calls to their methods without revealing implementation details. Please look to code in the `src` and `data` directories if you interested in lower level details. The import paths as well as the terminal output shown in this notebook will provide some guidance on where to look within those directories.

## 2. Project Setup

Before running this notebook, review the project README at https://github.com/duanegoodner/lstm_adversarial_attack, and complete all steps in the "How to run this project" section.

## 3. Imports
Most of the necessary standard library imports and external package imports are handled modules in the `src` directory, but we need to import a few here.

### 3.1 Standard Library Imports

In [1]:
import pprint
import sys

### 3.2 External Packages

In [2]:
import numpy as np
import torch

### 3.3 Internal Project Modules and Sub-packages
To help gain a sense of project structure, we will import internal packages and modules as-needed (i.e. immediately before the notebook code cells where they are first used). For now, we import the project `src` path defined in `lstm_adversarial_attack/notebooks/src_path`, add it to sys.path (so we can easily import project code), and we import project config files.

In [3]:
import src_paths
sys.path.append(str(src_paths.lstm_adversarial_attack_pkg))
import lstm_adversarial_attack.config_paths as cfg_paths
import lstm_adversarial_attack.config_settings as cfg_set

## 3. Database Queries
We need to run four queries on the MIMIC-III PostgreSQL database. The paths to files containing the queries are stored in a list as `DB_QUERIES` in the project `config_paths` file:

In [4]:
pprint.pprint(cfg_paths.DB_QUERIES)

[PosixPath('/home/devspace/project/src/mimiciii_queries/icustay_detail.sql'),
 PosixPath('/home/devspace/project/src/mimiciii_queries/pivoted_bg.sql'),
 PosixPath('/home/devspace/project/src/mimiciii_queries/pivoted_lab.sql'),
 PosixPath('/home/devspace/project/src/mimiciii_queries/pivoted_vital.sql')]


To connect to the database, and execute the queries, we instantiate a `MimiciiiDatabaseAccess` object from module `mimiciii_database` of project sub-package `query_db` and use its .connect(), .run_sql_queries() and .close_connection() methods.

In [5]:
import lstm_adversarial_attack.query_db.mimiciii_database as mdb

db_access = mdb.MimiciiiDatabaseAccess(
    dotenv_path=cfg_paths.DB_DOTENV_PATH, output_dir=cfg_paths.DB_OUTPUT_DIR
)
db_access.connect()
db_query_results = db_access.run_sql_queries(
    sql_query_paths=cfg_paths.DB_QUERIES
)
db_access.close_connection()

Query 1 of 4
Executing: /home/devspace/project/src/mimiciii_queries/icustay_detail.sql
Done. Query time = 0.48 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/icustay_detail.csv
Done. csv write time = 0.52 seconds

Query 2 of 4
Executing: /home/devspace/project/src/mimiciii_queries/pivoted_bg.sql
Done. Query time = 16.95 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/pivoted_bg.csv
Done. csv write time = 4.32 seconds

Query 3 of 4
Executing: /home/devspace/project/src/mimiciii_queries/pivoted_lab.sql
Done. Query time = 24.69 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/pivoted_lab.csv
Done. csv write time = 6.34 seconds

Query 4 of 4
Executing: /home/devspace/project/src/mimiciii_queries/pivoted_vital.sql
Done. Query time = 63.48 seconds
Writing result to csv: /home/devspace/project/data/mimiciii_query_results/pivoted_vital.csv
Done. csv write time = 27.32 seconds



The results of each `.sql` query is saved to a `.csv` file. The path to each of these files is shown in the terminal output above. The output path of the queries is defined by variable `DB_OUTPUT_DIR` in the project `config_settings` file.

## 4. Preprocessor

### 4.1 Instantiate a Preprocessor object
We import the `preprocessor` module from internal sub-package `preprocess`, instantiate a `Preprocessor` object, and examine its .preprocess_modules data member. 

In [7]:
import lstm_adversarial_attack.preprocess.preprocessor as pre
preprocessor = pre.Preprocessor()

We can get a general idea of how the `lstm_adversarial_attack.preprocess` sub-package works by looking at the `Preprocessor` object's `.preprocessor_modules` data member.

In [8]:
pprint.pprint([item.__class__ for item in preprocessor.preprocess_modules])

[<class 'lstm_adversarial_attack.preprocess.prefilter.Prefilter'>,
 <class 'lstm_adversarial_attack.preprocess.icustay_measurement_combiner.ICUStayMeasurementCombiner'>,
 <class 'lstm_adversarial_attack.preprocess.sample_list_builder.FullAdmissionListBuilder'>,
 <class 'lstm_adversarial_attack.preprocess.feature_builder.FeatureBuilder'>,
 <class 'lstm_adversarial_attack.preprocess.feature_finalizer.FeatureFinalizer'>]


* Prefilter reads the database query outputs into Pandas Dataframes, removes all data related to patients younger than 18 years in age, ensures consistent column naming formats, and takes care of datatype details.
* ICUStayMeasurementCombiner performs various joins (aka "merges" in the language of Pandas) to combine lab and vital sign measurement data with ICU stay data.
* FullAdmissionListBuilder generates a list consisting of one FullAdmissionData object per ICU stay. The attributes of a FullAdmissionData object include ICU stay info, and a dataframe containing the measurement and timestamp data for all vital sign and lab data associated with the ICU stay.
* FeatureBuilder resamples the time series datafame to one-hour intervals, imputes missing data, winsorizes measurement values (with cutoffs at the 5th and 95th global percentiles), and normalizes the measuremnt values so all data are between 0 and 1.
* FeatureFinalizer selects the data observation time window (default starts at hospital admission time and ends 48 hours after admission). This module outputs the entire dataset features as a list of numpy arrays, and the mortality labels as a list of integers. These data structures (saved as .pickle files) will be convenient starting points when the `tune_train` and `attack` sub-packages need to create PyTorch Datasets.

Now that we have a some background info, we are ready to run the Preprocessor.

In [9]:
preprocessed_resources = preprocessor.preprocess()


Running preprocess module 1 of 5: Prefilter
Incoming resources:
/home/devspace/project/data/mimiciii_query_results/icustay_detail.csv
/home/devspace/project/data/mimiciii_query_results/pivoted_bg.csv
/home/devspace/project/data/mimiciii_query_results/pivoted_vital.csv
/home/devspace/project/data/mimiciii_query_results/pivoted_lab.csv
Done with Prefilter. Results saved to:
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/icustay.pickle, data_type: DataFrame
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/bg.pickle, data_type: DataFrame
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/vital.pickle, data_type: DataFrame
path: /home/devspace/project/data/preprocess_checkpoints/1_prefilter/lab.pickle, data_type: DataFrame

Running preprocess module 2 of 5: ICU Stay Data + Measurement Data Combiner
Incoming resources:
/home/devspace/project/data/preprocess_checkpoints/1_prefilter/icustay.pickle
/home/devspace/project/data/preproce

## 5. Pytorch Dataset object

### 5.1 Create the Dataset
We import module `x19_mort_general_dataset` and use it along with files saved by the Preprocessor's Feature Finalizer module to insantiate a Pytorch Dataset

In [12]:
import lstm_adversarial_attack.x19_mort_general_dataset as xmd
dataset = xmd.X19MGeneralDataset.from_feature_finalizer_output()

### 5.2 Examine the Dataset

In [23]:
print(f"Number of samples in dataset = {len(dataset)}\n")
print(f"Type returned by dataset.__getitem__ = {type(dataset[0])}\n")
print(
    f"Length of each tuple returned by dataset.__getitem__ = {len(dataset[0])}"
)
print(
    "\nObject type, dimensionality, and datatype of each element in a tuple"
    " returned by dataset.__getitem__:"
)
print(tuple([(type(item), item.dim(), item.dtype) for item in dataset[0]]))
print(f"\ninput size (# columns) of each feature matrix is:\n"
     f"{np.unique([item.shape[1] for item in dataset[:][0]]).item()}\n")

print("Distribution of input sequence lengths (# rows):")
print("length\tcounts")
unique_sequence_lengths, sequence_length_counts = np.unique(
    [item.shape[0] for item in dataset[:][0]], return_counts=True
)
print(
    np.concatenate(
        (
            unique_sequence_lengths.reshape(-1, 1),
            sequence_length_counts.reshape(-1, 1),
        ),
        axis=1,
    )
)

print("\nLabel counts:")
print("value\tcounts")
unique_labels, label_counts = np.unique([dataset[:][1]], return_counts=True)
print(
    np.concatenate(
        (unique_labels.reshape(-1, 1), label_counts.reshape(-1, 1)), axis=1
    )
)


Number of samples in dataset = 41951

Type returned by dataset.__getitem__ = <class 'tuple'>

Length of each tuple returned by dataset.__getitem__ = 2

Object type, dimensionality, and datatype of each element in a tuple returned by dataset.__getitem__:
((<class 'torch.Tensor'>, 2, torch.float32), (<class 'torch.Tensor'>, 0, torch.int64))

input size (# columns) of each feature matrix is:
19

Distribution of input sequence lengths (# rows):
length	counts
[[    6     1]
 [   13     1]
 [   14     1]
 [   16     2]
 [   17     4]
 [   18     3]
 [   19     8]
 [   20    12]
 [   21    25]
 [   22    49]
 [   23    84]
 [   24   144]
 [   25   126]
 [   26   110]
 [   27    93]
 [   28    95]
 [   29    84]
 [   30    90]
 [   31    75]
 [   32    99]
 [   33    98]
 [   34   113]
 [   35   148]
 [   36   152]
 [   37   189]
 [   38   199]
 [   39   220]
 [   40   178]
 [   41   231]
 [   42   203]
 [   43   211]
 [   44   191]
 [   45   185]
 [   46   221]
 [   47   474]
 [   48 37832]]


## 6. Model Hyperparameter Tuning

### 6.1 Check for GPU
Model hyperparameter tuning (along with training, and model attacks) is implemented in PyTorch, and we really need a GPU to run things in a reasonable amount of time. We check for a GPU a

In [26]:
import torch

if torch.cuda.is_available():
    cur_device = torch.device("cuda:0")
else:
    cur_device = torch.device("cpu")

print(f"cur_device is {cur_device}")

cur_device is cpu


### 6.2 Instantiate and Examine TunerDriver
We then instantiate a TunerDriver object with a device that is hopefully a GPU passed to its constructor.

In [27]:
import lstm_adversarial_attack.tune_train.tuner_driver as td
tuner_driver = td.TunerDriver(device=cur_device)
pprint.pprint(tuner_driver.tuner.tuning_ranges)

X19MLSTMTuningRanges(log_lstm_hidden_size=(5, 7),
                     lstm_act_options=('ReLU', 'Tanh'),
                     dropout=(0, 0.5),
                     log_fc_hidden_size=(4, 8),
                     fc_act_options=('ReLU', 'Tanh'),
                     optimizer_options=('Adam', 'RMSprop', 'SGD'),
                     learning_rate=(1e-05, 0.1),
                     log_batch_size=(5, 8))


The default 

In [None]:
pprint.pprint(tuner_driver.tuner.dataset[0][0].shape)

In [None]:
my_completed_study = tuner_driver(num_trials=30)