# Lab 3

In the previous labs, we perform mortality prediction based on the last visit's diagnosis codes using DNN. This practice igrnoes massive information in the previous visits of a patient. Thus, Starting from this lab, we will play with sequential visit data. That is, each patient will have a sequence of visists. 

However, MLP is quite unsatisfying when dealing with such rich structure data. This lab introduces convolutional neural networks (CNNs), a powerful family of neural networks that are designed for precisely this purpose.

Table of Contents:
- Convolutions for Images
- Padding and Stride
- Pooling
- Assignment

Some contents of this lab are adapted from [Dive into Deep Learning](https://d2l.ai) and [Official PyTorch Tutorials](https://pytorch.org/tutorials/).

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

In [None]:
# set seed
seed = 24
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)

In [None]:
DATA_PATH = "../LAB3-lib/data"
assert os.path.isdir(DATA_PATH)
!ls {DATA_PATH}

## 1. Convolution Operation

Though we will deal with sequential data (1D) in the assignment. Let us first start with some images data (2D) to 
build out intuition. 

Convolutional neural networks are efficient architectures for exploring structure in image data.

Convolution operation take an input tensor and a kernel tensor and produce an output tensor through convolution operation. Let us ignore channels for now and see how this works with two-dimensional data and hidden representations. In the figure below, the input is a two-dimensional tensor with a height of 3 and width of 3. We mark the shape of the tensor as 3x3 or (3, 3). The height and width of the kernel are both 2. The shape of the kernel window (or convolution window) is given by the height and width of the kernel (here it is 2x2).

<img src='./img/convolution.svg'>

In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the top-left corner of the input tensor and slide it across the input tensor, both from left to right and top to bottom. When the convolution window slides to a certain position, the input subtensor contained in that window and the kernel tensor are multiplied elementwise and the resulting tensor is summed up yielding a single scalar value. This result gives the value of the output tensor at the corresponding location. Here, the output tensor has a height of 2 and width of 2 and the four elements are derived from the two-dimensional cross-correlation operation:

$$
\begin{split}0\times0+1\times1+3\times2+4\times3=19,\\
1\times0+2\times1+4\times2+5\times3=25,\\
3\times0+4\times1+6\times2+7\times3=37,\\
4\times0+5\times1+7\times2+8\times3=43.\end{split}
$$

### Exercise 1 [10 points]

Calculate the output shape for a convolutional layer: given the input tensor shape $(n_w, n_h)$, the kernel tensor shape $(k_w, k_h)$, calculate the output tensor shape. For example, the output shape for the figure above is $(2, 2)$.

In [None]:
def conv_output_shape_1(n_w, n_h, k_w, k_h):
    
    """
    TODO: Calculate the output tensor shape.
    Note the output should a tuple with two elements (width, height). 
    """
    # your code here
    raise NotImplementedError

In [None]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert conv_output_shape_1(n_w=7, n_h=7, k_w=3, k_h=3) == (5, 5)
assert conv_output_shape_1(n_w=7, n_h=9, k_w=4, k_h=2) == (4, 8)



### Exercise 2 [10 points]

Implement the 2D convolution function, which accepts an input tensor X and a kernel tensor K and returns an output tensor Y.

In [None]:
def corr2d(X, K):
    """ TODO: Compute 2D convolution. """
    # your code here
    raise NotImplementedError

In [None]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
assert torch.allclose(corr2d(X, K), torch.tensor([[19., 25.], [37., 43.]]))



## 2. Padding and Stride

In several cases, we incorporate techniques, including padding and strided convolutions, that affect the size of the output. As motivation, note that since kernels generally have width and height greater than 1, after applying many successive convolutions, we tend to wind up with outputs that are considerably smaller than our input. If we start with a 240×240 pixel image, 10 layers of 5x5 convolutions reduce the image to 200×200 pixels, slicing off 30%
of the image and with it obliterating any interesting information on the boundaries of the original image. Padding is the most popular tool for handling this issue.

In other cases, we may want to reduce the dimensionality drastically, e.g., if we find the original input resolution to be unwieldy. Strided convolutions are a popular technique that can help in these instances.

### 2.1 Padding

As described above, one tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but this can add up as we apply many successive convolutional layers. One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image. Typically, we set the values of the extra pixels to zero. In the figure below, we pad a 3×3 input, increasing its size to 5×5. The corresponding output then increases to a 4×4 matrix. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: 0×0+0×1+0×2+0×3=0.

<img src='./img/conv-pad.svg'>

### 2.2 Stride

When computing the convolution, we start with the convolution window at the top-left corner of the input tensor, and then slide it over all locations both down and to the right. In previous examples, we default to sliding one element at a time. However, sometimes, either for computational efficiency or because we wish to downsample, we move our window more than one element at a time, skipping the intermediate locations.

We refer to the number of rows and columns traversed per slide as the stride. So far, we have used strides of 1, both for height and width. Sometimes, we may want to use a larger stride. The figure below shows a two-dimensional convolution operation with a stride of 3 vertically and 2 horizontally. The shaded portions are the output elements as well as the input and kernel tensor elements used for the output computation:  0×0+0×1+1×2+2×3=8, 0×0+6×1+0×2+0×3=6. We can see that when the second element of the first column is outputted, the convolution window slides down three rows. The convolution window slides two columns to the right when the second element of the first row is outputted. When the convolution window continues to slide two columns to the right on the input, there is no output because the input element cannot fill the window (unless we add another column of padding).

### Exercise 3 [10 points]

Calculate the output shape for a convolutional layer with padding: given the input tensor shape $(n_w, n_h)$, the kernel tensor shape $(k_w, k_h)$, padding size $(p_w, p_h)$, stride size $(s_w, s_h)$, calculate the output tensor shape.

In [None]:
def conv_output_shape_2(n_w, n_h, k_w, k_h, p_w, p_h, s_w, s_h):
    
    """
    TODO: Calculate the output tensor shape.
    Note the output should a tuple with two elements (width, height). 
    """
    # your code here
    raise NotImplementedError

In [None]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert conv_output_shape_2(n_w=7, n_h=7, k_w=3, k_h=3, p_w=1, p_h=1, s_w=1, s_h=1) == (7, 7)
assert conv_output_shape_2(n_w=7, n_h=7, k_w=3, k_h=3, p_w=0, p_h=0, s_w=2, s_h=2) == (3, 3)
assert conv_output_shape_2(n_w=7, n_h=9, k_w=4, k_h=2, p_w=0, p_h=1, s_w=2, s_h=1) == (2, 10)



## 3. Multiple Input and Multiple Output Channels

Denote by $c_i$ and $c_o$ the number of input and output channels, respectively, and let $k_h$ and $k_w$ be the height and width of the kernel. To get an output with multiple channels, we can create a kernel tensor of shape $c_i \times k_h \times k_w$ for every output channel. We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o \times c_i \times k_h \times k_w$. In convolution operations, the result on each output channel is calculated from the convolution kernel corresponding to that output channel and takes input from all channels in the input tensor.

<img src='./img/conv-channel.svg'>

In the figure above, the number of input and output channels are 3 and 2. And there are $2 \times 3$ sets of kernels.

## 4. Pooling

Often, as we process images, we want to gradually reduce the spatial resolution of our hidden representations, aggregating information so that the higher up we go in the network, the larger the receptive field (in the input) to which each hidden node is sensitive.

Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no kernel). Instead, pooling operators are deterministic, typically calculating either the maximum or the average value of the elements in the pooling window. These operations are called maximum pooling (max pooling for short) and average pooling, respectively.

In both cases, as with the cross-correlation operator, we can think of the pooling window as starting from the upper-left of the input tensor and sliding across the input tensor from left to right and top to bottom. At each location that the pooling window hits, it computes the maximum or average value of the input subtensor in the window, depending on whether max or average pooling is employed.

<img src='./img/pooling.svg'>

The output tensor in the figure above has a height of 2 and a width of 2. The four elements are derived from the maximum value in each pooling window:

$$
\begin{split}\max(0, 1, 3, 4)=4,\\
\max(1, 2, 4, 5)=5,\\
\max(3, 4, 6, 7)=7,\\
\max(4, 5, 7, 8)=8.\\\end{split}
$$

### Exercise 4 [10 points]

Implement a 2D max pooling layer from scratch, which accepts an input tensor X and pool size and returns an output tensor Y.

In [None]:
def maxpool2d(X, pool_size):
    # your code here
    raise NotImplementedError

In [None]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
assert torch.allclose(maxpool2d(X, (2, 2)), torch.tensor([[4., 5.], [7., 8.]]))



## 5. CNN with PyTorch

Luckily, PyTorch has all kinds of convolution and pooling operations implemented for us ([link](https://pytorch.org/docs/stable/nn.html#convolution-layers)). For the previous image example, we can use [`nn.Conv2d()`](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d) and [`nn.MaxPool2d`](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d).

For example, the code below implements a 2D convolution layer with 3 input channels, 8 output channels, kernel shape (3, 3), stride shape (2, 2), and no padding.

In [None]:
m = nn.Conv2d(in_channels=3, out_channels=8, kernel_size=3, stride=2, padding=0)

If we have an image of shape (3, 224, 224), after this convolution layer, the output shape will be (8, 111, 111). Let us verify this.

In [None]:
# the first dimension is the batch size (1 in this case, since we only have one image)
img = torch.randn(1, 3, 224, 224)
m(img).shape

In the assignment, on the other hand, we will play with sequential data. That is, each patient will be represented as a sequence of visits, and each visit will be represented as a set of diagnosis codes (a one-hot vector).

Denote the number of visits for a patient as $n$, and the total number of diagnosis codes as $m$, this patient can be represented as a matrix of shape $(n, m)$. 

For example, let us say there are 30 diagnosis codes in total. And there is a patient with 3 visits. Then the patient can be represented as:

In [None]:
# the first dimension is the batch size (1 in this case, since we only have one patient)
# the second dimension is the total number of diagnosis codes
# the third dimension is the total number of visits
patient = torch.randn(1, 30, 3)

We can then perform 1D convolution to capture the temporal information. The code below implements an 1D convolution layer with 30 input channels, 16 output channels, kernel shape 2, stride shape 1, and no padding.

In [None]:
m = nn.Conv1d(in_channels=30, out_channels=16, kernel_size=2, stride=1, padding=0)

After convolution, we should have a tensor of shape (16, 2).

In [None]:
m(patient).shape

## Assignment [60 points]

In this assignment, you will use [MIMIC-III Demo](https://physionet.org/content/mimiciii-demo/) dataset, which contains all intensive care unit (ICU) stays for 100 patients. The task is Mortality Prediction.

### Load Data

In the previous lab, we have preprocessed the data. Thus, for this lab, we will directly use the processed data.

In [None]:
!ls {DATA_PATH}

Here are the helper fuctions and CustomDataset from the previous lab. 

The only difference is that, starting from this lab, we will use the entire patient visit instead of only the last visit. Due to this reason, we will only keep patients with more than one visits.

In [None]:
# two helper functions

TOTAL_NUM_CODES = 271


def read_csv(filename):
    """ reading csv from filename """
    data = []
    with open(filename, "r") as file:
        csv_reader = csv.DictReader(file, delimiter=',')
        for row in csv_reader:
            data.append(row)
    header = list(data[0].keys())
    return header, data


def to_one_hot(label, num_class):
    """ convert to one hot label """
    one_hot_label = [0] * num_class
    for i in label:
        one_hot_label[i] = 1
    return one_hot_label

In [None]:
from torch.utils.data import Dataset


class CustomDataset(Dataset):
    
    def __init__(self):
        # read the csv
        self._df = pd.read_csv(f'{DATA_PATH}/data.csv')
        # split diagnosis code index by ';' and convert it to integer
        self._df.icd9 = self._df.icd9.apply(lambda x: [int(i) for i in x.split(';')])
        # build data dict
        self._build_data_dict()
        # a list of subject ids
        self._subj_ids = list(self._data.keys())
        # sort the subject ids to maintain a fixed order
        self._subj_ids.sort()
    
    def _build_data_dict(self):
        """ 
        build SUBJECT_ID to ADMISSION dict
            - subject_id
                - icd9: a list of ICD9 code index
                - mortality: 0/1 morality label
        """
        dict_data = {}
        df = self._df.groupby('subject_id').agg({'mortality': lambda x: x.iloc[0], 'icd9': list}).reset_index()
        for idx, row in df.iterrows():
            subj_id = row.subject_id
            # only keep patients with more than 1 visit
            if len(row.icd9) >= 2:
                dict_data[subj_id] = {}
                dict_data[subj_id]['icd9'] = row.icd9
                dict_data[subj_id]['mortality'] = row.mortality
        self._data = dict_data
    
    def __len__(self):
        """ return the number of samples (i.e. patients). """
        return len(self._subj_ids)
    
    def __getitem__(self, index):
        """ generates one sample of data. """
        # obtain the subject id
        subj_id = self._subj_ids[index]
        # obtain the data dict by subject id
        data = self._data[subj_id]
        # convert last admission's diagnosis code index to one hot
        x = torch.tensor([to_one_hot(visit, TOTAL_NUM_CODES) for visit in data['icd9']], dtype=torch.float32)
        # mortality label
        y = torch.tensor(data['mortality'], dtype=torch.float32)
        return x, y

In [None]:
dataset = CustomDataset()
print('Size of dataset:', len(dataset))

In [None]:
from torch.utils.data.dataset import random_split


split = int(len(dataset)*0.7)

lengths = [split, len(dataset) - split]
train_dataset, test_dataset = random_split(dataset, lengths)

print("Length of train dataset:", len(train_dataset))
print("Length of test dataset:", len(test_dataset))

Here is an example of $x$, and $y$. 

In [None]:
x, y = train_dataset[0]
print(f'Example x (shape {x.shape}):\n', x)
print(f'Example y:\n', y)

We can see that $x$ is of shape $(2, 271)$, which means there are $271$ diagnosis codes in total, and this patient has two visits. It is in one-hot format. A $1$ in position $i$ means that diagnosis code of index $i$ appears in the that visit.

And $y$ is either $0$ or $1$.

### Padding [20 points]

Note that the first dimension of $x$ can be different for different patients (i.e., different patients will have different number of visits). Thus we need to implement a padding function (similar to the zero padding in images).

To achieve this goal, we will implement a special collage function. This collate function `collate_fn()` will be called by `DataLoader` after fetching a list of samples using the indices from `CustomDataset` to collate the list of samples into batches.

For example, assume the `DataLoader` gets a list of two samples (here, assume the total number of codes is 3). 

```
[ [ [0, 1, 0], [1, 0, 1] ], 
  [ [0, 0, 1], [0, 1, 1], [0, 1, 1] ] ]
```

where the first patient has two visits `[0, 1, 0]` and `[1, 0, 1]` and the second patient has three visits `[0, 0, 1]`, `[0, 1, 1]`, and `[0, 1, 1]`.

The collate function `collate_fn()` is supposed to pad them into the same shape (2, 3), where 2 is the number of patients, and 3 is the maximum number of visits.

```
[ [ [0, 1, 0], [1, 0, 1], *[0, 0, 0]* ], 
  [ [0, 0, 1], [0, 1, 1],  [0, 1, 1] ] ]
```

In [None]:
def collate_fn(data):
    """
    TODO: Collate the the list of samples into batches. For each patient, you need to pad the diagnosis
        sequences to the sample shape (max # visits, total # diagnosis codes).
    
    Arguments:
        data: a list of samples fetched from `CustomDataset`
        
    Outputs:
        x: a tensor of shape (# patients, total # diagnosis codes, max # visits) of type torch.float
        y: a tensor of shape (# patients) of type torch.float
        
    Note that you can obtains the list of diagnosis codes and the list of mortality labels
        using: `sequences, labels = zip(*data)`
    """

    sequences, labels = zip(*data)

    y = torch.tensor(labels, dtype=torch.float)
    
    num_patients = len(sequences)
    num_visits = [patient.shape[0] for patient in sequences]
    total_num_codes = sequences[0].shape[1]

    max_num_visits = max(num_visits)
    
    x = torch.zeros((num_patients, total_num_codes, max_num_visits), dtype=torch.float)

    for i_patient, patient in enumerate(sequences):
        for j_visit, visit in enumerate(patient):
            # your code here
            raise NotImplementedError
    
    return x, y

In [None]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

from torch.utils.data import DataLoader


loader = DataLoader(train_dataset, batch_size=4, collate_fn=collate_fn)
loader_iter = iter(loader)
x, y = next(loader_iter)

assert x.dtype == torch.float
assert y.dtype == torch.float

assert x.shape[:-1] == (4, 271)
assert y.shape == (4,)

for i in range(4):
    real_x, real_y = train_dataset[i]
    for j in range(real_x.shape[0]):
        visit = real_x[j]
        got = x[i, :, j]
        assert all(visit == got)
        assert real_y == y[i]



We need to pad the sequences into the same length so that we can do batch training on GPU, which will run much faster. Or, if they have different length, we have to process them one by one. This is extremely slow, especially with a large dataset.

You may also wonder will this padding add some extra noise to the dataset (because we change the number of visits for some patients). The answer is: it depends. Sometimes, padding will bring in some noise and we need to have a separate mask to remove the noise later (you will see this in the next lab).

But in this lab, it does not matter. Because zero padding will not affect the convolution operation. Zero times zero is still zero (assume we do not have bias parameter).

### Data Loader

Now, we can load the dataset into the data loader.

In [None]:
from torch.utils.data import DataLoader

# how many samples per batch to load
batch_size = 4

# prepare dataloaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, collate_fn=collate_fn)

print("# of train batches:", len(train_loader))
print("# of test batches:", len(test_loader))

In [None]:
train_iter = iter(train_loader)
x, y = next(train_iter)

print('Shape of a batch x:', x.shape)
print('Shape of a batch y:', y.shape)

### Build the Model [20 points]

Now, let us build a 1D CNN model. For each patient, the CNN model will take an input tensor of shape (# of visits, total # of codes), and produce an output tensor of 1-dim (0 for non-mortality, 1 for moratality). The detailed model architecture is shown in the table below.

Layers | Configuration | Activation Function
--- | --- | ---
convolution | in channels 271, out channels 32, kernel size 2, stride 1, padding 0, bias False | -
dropout | probability 0.5 | - 
fully connected | input size 32, output size 1 | Sigmoid

Note that you have to set `bias=Flase` for the convolution layer. Only in this way can we ignore the noise introduced by padding.

In [None]:
"""
TODO: Build the CNN shown above.
HINT: Consider using `nn.Conv1d`, `nn.MaxPool1d`, `nn.Dropout`, `nn.Linear`, `torch.sigmoid`.
"""

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # DO NOT change the names
        self.conv = None
        self.dropout = None
        self.fc = None
        
        # your code here
        raise NotImplementedError

    def forward(self, x):
        """
        TODO: 1. pass x through the convolution layer
              2. pass x through the dropout layer
              3. sum x by the last dimension (i.e., visits)
              4. pass x through the linear and sigmoid layer
        """
        # your code here
        raise NotImplementedError

In [None]:
# initialize the CNN
model = Net()
print(model)

In [None]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

model = Net()

assert model.conv.in_channels == 271
assert model.conv.out_channels == 32
assert model.conv.kernel_size == (2,)
assert model.conv.stride == (1,)
assert model.conv.padding == (0,)
assert model.conv.bias is None
assert model.fc.in_features == 32
assert model.fc.out_features == 1

train_iter = iter(train_loader)
x, y = next(train_iter)
output = model.forward(x)
assert output.shape == (4, 1), "Net() is wrong!"



Now that we have a network, let's see what happens when we pass in some data.

In [None]:
model = Net()

# Grab some data 
train_iter = iter(train_loader)
x, y = next(train_iter)

# Forward pass through the network
output = model.forward(x)

print('Input x shape:', x.shape)
print('Output shape: ', output.shape)

### Train the Network [20 points]

In this step, you will train the CNN model.

In [None]:
"""
TODO: Define the loss (BCELoss), assign it to `criterion`.

REFERENCE: https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html#torch.nn.BCELoss
"""

criterion = None

# your code here
raise NotImplementedError

In [None]:
"""
TODO: Define the optimizer (SGD) with learning rate 0.01, assign it to `optimizer`.

REFERENCE: https://pytorch.org/docs/stable/optim.html
"""

optimizer = None

# your code here
raise NotImplementedError

In [None]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert type(criterion) is nn.modules.loss.BCELoss, "criterion is not BCELoss!"
assert type(optimizer) is torch.optim.SGD, "optimizer is not SGD!"
assert optimizer.param_groups[0]['lr'] == 0.01, "learning rate is not 0.01!"



Now we can train the model. The following two cell are exactly the same as previous lab.

In [None]:
from sklearn.metrics import *

#input: Y_score,Y_pred,Y_true
#output: accuracy, auc, precision, recall, f1-score
def classification_metrics(Y_score, Y_pred, Y_true):
    acc, auc, precision, recall, f1score = accuracy_score(Y_true, Y_pred), \
                                           roc_auc_score(Y_true, Y_score), \
                                           precision_score(Y_true, Y_pred), \
                                           recall_score(Y_true, Y_pred), \
                                           f1_score(Y_true, Y_pred)
    return acc, auc, precision, recall, f1score


#input: model, loader
def evaluate(model, loader):
    model.eval()
    all_y_true = torch.LongTensor()
    all_y_pred = torch.LongTensor()
    all_y_score = torch.FloatTensor()
    for x, y in loader:
        # pass the input through the model
        y_hat = model(x)
        # convert shape from [batch size, 1] to [batch size]
        y_hat = y_hat.view(y_hat.shape[0])
        y_pred = (y_hat > 0.5).type(torch.float)
        all_y_true = torch.cat((all_y_true, y.to('cpu')), dim=0)
        all_y_pred = torch.cat((all_y_pred,  y_pred.to('cpu')), dim=0)
        all_y_score = torch.cat((all_y_score,  y_hat.to('cpu')), dim=0)
        
    acc, auc, precision, recall, f1 = classification_metrics(all_y_score.detach().numpy(), 
                                                             all_y_pred.detach().numpy(), 
                                                             all_y_true.detach().numpy())
    print(f"acc: {acc:.3f}, auc: {auc:.3f}, precision: {precision:.3f}, recall: {recall:.3f}, f1: {f1:.3f}")
    return

In [None]:
print("model perfomance before training:")
evaluate(model, train_loader)
evaluate(model, test_loader)

In [None]:
# number of epochs to train the model
# feel free to change this
n_epochs = 10

# prep model for training
model.train()

for epoch in range(n_epochs):
    
    train_loss = 0
    for x, y in train_loader:
        """ Step 1. clear gradients """
        optimizer.zero_grad()
        """  Step 2. perform forward pass using `model`, save the output to y_hat """
        y_hat = model(x)
        """ Step 3. calculate the loss using `criterion`, save the output to loss. """
        # convert shape from [batch size, 1] to [batch size]
        y_hat = y_hat.view(y_hat.shape[0])
        loss = criterion(y_hat, y)
        """ Step 4. backward pass """
        loss.backward()
        """ Step 5. optimization """
        optimizer.step()
        """ Step 6. record loss """
        train_loss += loss.item()
        
    train_loss = train_loss / len(train_loader)
    print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch+1, train_loss))
    evaluate(model, train_loader)
    evaluate(model, test_loader)

The result is bad due to very limited data. The model overfits the training data very fast.

You are encouraged to try this on the whole MIMIC-III dataset. The result will be much more promising!