# Problem Set 10 Solutions

This code covers the implementation of a Long Short-Term Memory (LSTM) architecture. Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem in standard RNNs. It is particularly well-suited for processing sequences of data, making it a popular choice for tasks such as time series analysis, natural language processing, and more.

#### A brief note: 
As we move through the latter half of the Deep Learning module, we will be covering some of the newer, state-of-the-art models and architectures. These models become increasingly complex and require some higher-level coding insights and ability to create from scratch. As you may have noticed, implementations of LSTMs and RNNs are not on Homework 2. This doesn't mean they aren't important! 

This Problem Set will be largely *understanding* based - not asking as much "coding" questions as the previous few. Of course, as we walk through the code feel free to ask any questions about the code itself.

---

## Data

The data from this Problem Set is the training set of [The 2012 PhysioNet/Computing in Cardiology Challenge](https://physionet.org/content/challenge-2012/1.0.0/). The goal of the challenge (and this model) is to aid in the prediction of mortaility rates in the ICU. Though the challenge offers multiple possible targets, in this code we focus on **Length of Stay** (LOS). 

Predicting LOS is valuable for many reasons: it aids in resource allocation and planning, helping healthcare providers allocate beds, staffing, and equipment efficiently, ensuring that critical care resources are available when needed \[[1](https://onlinelibrary.wiley.com/doi/full/10.1111/imj.14962)\]. LOS prediction can improve patient care and outcomes by allowing medical teams to make timely decisions regarding treatment, discharge, or transfer to lower-acuity settings \[[2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1067452/)\]. It can help identify patients at risk of prolonged stays, enabling interventions to prevent complications and reduce healthcare costs. Furthermore, LOS prediction can facilitate hospital management, optimize bed turnover, and reduce wait times, ultimately benefiting both patients and healthcare institutions \[[3](https://doi.org/10.1097/MLR.0000000000001596)\].



This dataset consists of records from 4,000 ICU stays. All patients were adults and stayed for at least 48 hours. **A great deal of preprocessing has been performed on this data to make it suitable for LSTM.**

In order, the steps performed are:
1. Loading the data from its original form (a large folder of csv files of the form `ID.txt`, one for each patient) into one larger csv called `patient_data.csv` containingthe columns Patient_ID, Time, Parameter, Value. (Script: `icu_data.py`)
2. Loading this new csv and subsetting 5 of the original 37 features: SaO2, NIDiasABP, HR, Urine, PaO2
    - These features were chosen at random
3. Performing numerical/type conversion for the data: changing Time from the form HH:MM to "minutes" and ensuring all data is a native type in Python
4. Aggregating the data - converting the columns from Patient_ID, Time, Parameter, Value to Patient_ID, Parameter, Time 0, Time 1, Time 2, ... Time 2880
5. Selecting only patients that had measurements for *all* 5 of our features
6. Filling the NaN values in our timepoints with backfill (and forward fill for our final timepoints).
6. Saving as a JSON file with key/value pairs of Patient_ID/time-series data.

**What issues may there be with this preprocessing, if any? How might you fix them?**

[Type Here]

In [None]:
import json
import numpy as np
import pandas as pd

I have saved the patient data into a json file called `patient_data.json` (it should, if you cloned the GitHub, exist in the `data` directory relative to our current one). **Load this data into a variable called `patient_data`. Print the number of entries in this dataset**

In [None]:
# CODE HERE

In [None]:
# Loading the patient outcomes (the Length of stay):
outcomes = pd.read_csv('data/outcomes.csv')
outcomes

This data has 4000 rows and 2 columns. **Is this different than the number of items in our `patient_data` variable? If so, why? How can we fix it?** (Hint: it is different)

[Type Here]

In [None]:
# Fixing the targets by removing unused IDs and preparing the features array
## Start with two empty arrays
features = []
targets = []


for patient_id in patient_data.keys():  # Recall: keys are our patient ids
    outcome = outcomes.Length_of_stay[outcomes.RecordID == int(patient_id)]  # Take LOS where the record matches
    feature_array = np.array(patient_data[patient_id])                       # Take features where record matches
    # Add to our arrays
    targets.append(int(outcome))                                        
    features.append(feature_array)
    
# Convert from python lists to numpy arrays.
features = np.array(features)
targets = np.array(targets)
features.shape

In [None]:
targets.shape

The output from cell 3 shows that our feature data is 3 dimensional: 1394x5x2881. **What do each of these dimensions represent? What is the interpretaion of the value `features[0, 1, 2]`?**

[Type Here]

---

## Building the Model

I will be using Pytorch's API for this Problem Set, though tensorflow will of course be somewhat similar.

In [None]:
import torch 
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

torch.manual_seed(42)

Looking ahead at the documentation, I noticed that torch's LSTM module expects data to be in the form $\left(\textrm{Number of Participants} \times \textrm{Sequence Length} \times \textrm{Number of Features}\right)$. 

**Create a variable `X` with the type `torch.Tensor`, and use the Tensor method `permute` to change the dimensions.
Verify that the shape has changed by printing off `X.shape`.**

In [None]:
# CODE HERE

Y is already the right shape, so I'll just convert that to a Tensor.

In [None]:
y = torch.Tensor(targets)
y.shape

Of course, we need to split the data. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
X_train = (X_train - X_train.mean()) / X_train.std()
X_test = (X_test - X_train.mean()) / X_train.std()

### Defining an LSTM model.

PyTorch has implemented an LSTM layer so we don't have to do a lot of the hard work. We have a bit of Object Oriented Programming ahead with the class definition. 

As a reminder from our intro to python, the `class` in python contains a collection of functions and values of our choosing. When creating a class, we can specifiy and access these values through the `self` variable given to each class function.

In [None]:
class LSTMModel(nn.Module):
    # The method that is called when an object is created with this class
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        # Start with pytorch's model initialization - this is called Superclassing
        super().__init__() 
        
        # Use our passed parameters to create a class variable lstm
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        
        # One fully connected layer that takes our output of the LSTM and produces a prediction
        self.fc = nn.Linear(hidden_size, output_size)
        
        
    # The method that is called when we do the "foward pass" of our model: features in -> predictions out
    def forward(self, x):
        # LSTM layers return a tuple, and we only care about the output of the layer.
        out, _ = self.lstm(x)  # The _ is pythons way of ignoring an output. 
        
        out = out[:, -1, :]  # Take the last value in the output of our LSTM (the output of the last "block")
        out = self.fc(out)   # Pass though our FC layer
        return out    

In our forward pass, we "throw out" the second part of a tuple returned by the LSTM layer. Looking at the documentation, we are throwing out two variables (joined in a tuple) called `h_n` and `c_n`. **Using your knowledge of the components of LSTM, what is this part we are throwing out?**

[Type Here]

---

### Hyperparameters

In [None]:
input_size = 5         # Number of features in
hidden_size = 64       # Number of output neurons for our LSTM model
num_layers = 2         # How many LSTM layers to add 
output_size = 1        # Number of features out
learning_rate = 0.001  # How fast the model trains
num_epochs = 10        # How many times to loop through the training data      
batch_size = 32        # How many training samples to work through before updating our weights

Which of these can we change? Which are dependent on our data?

[Type Here]

---

### Preparing for Training

In [None]:
model = LSTMModel(input_size, hidden_size, num_layers, output_size)  # Instance of our defined class!
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

#### Datasets and Dataloaders:
Datasets and Dataloaders are pytorch's optimized way of loading in data. Though generally unnecessary for the class, those of you with a desire to implement some of the more advanced ML algorithms in your own projects may benefit from this efficiency. Generally speaking, a Dataset will "pair up" your input features and targets and provide instructions for loading it into memory (either VRAM on a GPU or RAM on a CPU) and a DataLoader will allow you to access the data in batches using python's `in` syntax. In more advanced applications when you want to perform augmentation to your data, you can do so in the DataLoader through the `transforms` keyword argument.

In [None]:
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

In [None]:
losses = []
epochs = []
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    total_loss = 0

    for inputs, targets in train_loader:
        # Forward pass
        outputs = model(inputs)
        
        # Calculate the loss
        loss = criterion(outputs, targets.view(-1, 1))  # Reshape targets to the same shape as our output
        
        # Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Update our loss. loss.item() extracts the actual value of our loss (as a float)
        total_loss += loss.item()

    # Print the average loss for the current epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss / len(train_loader)}')
    losses.append(total_loss/len(train_loader))
    epochs.append(epoch+1)

In [None]:
# Set the model to evaluation mode
model.eval()

# Forward pass on the test data
with torch.no_grad():
    test_outputs = model(X_test)

# Calculate the loss (MSE)
criterion = nn.MSELoss()
test_loss = criterion(test_outputs, y_test.view(-1, 1))  # Reshape y_test like in our training loop

# Convert to numpy arrays
test_outputs = test_outputs.numpy()
y_test = y_test.numpy()


# Print or return the test loss
print(f'Test Loss (MSE): {test_loss.item()}')

**Evauluate the model. Is it overfitting or underfitting? Besides MSE, what are some other metrics, plots, or methods we may want to calculate/implement to evaluate our model?**

[Type Here]

---