# Q3 - Representation Learning

We want a low-dimensional representations that are transferable to similar
datasets

For the input data I decide to use time grid from Q1.3. (Maybe we should explain why?)

2. Freeze/fix the weights of your pretrained network and compute a single embedding
vector for each patient. Train a logistic regression (i.e. a linear probe) on the training
set to predict the target only from your pretrained embeddings. Compare your results
to the supervised performances obtained in prior tasks. (2 pts)


In [100]:
import pandas as pd
import numpy as np
from project_1.config import PROCESSED_DATA_DIR, PROJ_ROOT
from project_1.loading import *
from project_1.dataset import *


SEED = 42

## Q3.1 - Pretraining and Linear Probes

In [101]:
set_a, set_b, set_c = load_final_data_without_ICU()
death_a, death_b, death_c = load_outcomes()

Shapes of the datasets:
Set A: (183416, 42) Set B: (183495, 42) Set C: (183711, 42)
Shapes of labels:
Set A: (4000, 2) Set B: (4000, 2) Set C: (4000, 2)


I will train an encoder model. To ensure fairness in the comparison (as requested), the architecture will build from Attempt 1 from file 5_fb_rnn which was the RNN type model which achieved the highest performance (2-layer LSTM, 64 hidden units, dropout 0.3).

How do you monitor this pre-training step?.

We monitor the reconstruction loss (MSE loss) over epochs. A decreasing loss indicates the model is learning useful representations.


Since the scarce performances in training, we opted for a bit modified version of the LSTM encoder architecture 

Layers:
* LSTM:
      - hidden_size=64
      - num_layers=2 (stacked)
      - dropout=0.3 (between layers)
      - batch_first=True
      → Outputs final hidden state (hn[-1])

* Fully Connected:
      - nn.Linear(64, 64)
      → Projects to latent space

* Normalization/Activation:
      - LayerNorm (over hidden_size)
      - ReLU activation
      - Dropout (p=0.3)

Hyperparameters:
   - input_size=40 (matches data)
   - hidden_size=64 (latent dim)
   - num_layers=2 (deep LSTM)
   - dropout=0.3 (regularization)



In [102]:
train_dataset = create_dataset_from_timeseries(set_a, death_a["In-hospital_death"])
validation_dataset = create_dataset_from_timeseries(set_b, death_b["In-hospital_death"])
test_dataset = create_dataset_from_timeseries(set_c, death_c["In-hospital_death"])

train_dataset.tensors[0].shape # (batch_size, seq_len, input_size)

torch.Size([4000, 49, 40])

In [103]:
# Convert to DataLoader
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
validation_loader = DataLoader(validation_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

In [104]:
############################################################################################################


import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 1. Define the Encoder
class LSTMEncoder(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.3):
        super(LSTMEncoder, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                           batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(hidden_size)
        
    def forward(self, x):
        _, (hidden, _) = self.lstm(x)
        last_hidden = hidden[-1]
        out = self.fc(last_hidden)
        out = self.layer_norm(out)
        out = torch.relu(out)
        out = self.dropout(out)
        return out

# 2. Define the Decoder
class LSTMDecoder(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers=2, dropout=0.3):
        super(LSTMDecoder, self).__init__()
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers,
                           batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, hidden_size*2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size*2, output_size)
        )
        
    def forward(self, x, seq_len):
        x = x.unsqueeze(1).repeat(1, seq_len, 1)
        out, _ = self.lstm(x)
        return self.fc(out)

# 3. Define the Autoencoder
class Seq2SeqAutoencoder(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.3):
        super(Seq2SeqAutoencoder, self).__init__()
        self.encoder = LSTMEncoder(input_size, hidden_size, num_layers, dropout)
        self.decoder = LSTMDecoder(hidden_size, input_size, num_layers, dropout)
    
    def forward(self, x):
        latent = self.encoder(x)
        recon_x = self.decoder(latent, x.size(1))
        return recon_x, latent

In [105]:
# 4. Configuration
input_size = 40  # Matches your data's last dimension
hidden_size = 64
num_layers = 2
dropout = 0.3
n_epochs = 100
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 5. Initialize model
model = Seq2SeqAutoencoder(input_size, hidden_size, num_layers, dropout).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()


In [106]:
# 6. Training Loop with DataLoader (Training Loss Only)
print("Training Autoencoder...")
for epoch in range(n_epochs):
    model.train()
    train_loss = 0.0
    
    for batch_idx, (data, _) in enumerate(train_loader):
        data = data.to(device)
        optimizer.zero_grad()
        
        # Forward pass
        recon_x, _ = model(data)
        loss = criterion(recon_x, data)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
    
    # Print training progress
    if (epoch+1) % 10 == 0:
        avg_train_loss = train_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{n_epochs} | Train Loss: {avg_train_loss:.4f}')

Training Autoencoder...
Epoch 10/100 | Train Loss: 23.3340
Epoch 20/100 | Train Loss: 19.1827
Epoch 30/100 | Train Loss: 17.8545
Epoch 40/100 | Train Loss: 17.4380
Epoch 50/100 | Train Loss: 16.5605
Epoch 60/100 | Train Loss: 13.7803
Epoch 70/100 | Train Loss: 17.4921
Epoch 80/100 | Train Loss: 13.8805
Epoch 90/100 | Train Loss: 10.9647
Epoch 100/100 | Train Loss: 10.6347


In [116]:
import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score

# 1. Freeze the pretrained autoencoder
model.eval()
for param in model.parameters():
    param.requires_grad = False

# 2. Extract embeddings
def get_embeddings(dataloader):
    model.eval()
    embeddings, labels = [], []
    with torch.no_grad():
        for data, target in dataloader:
            data = data.to(device)
            _, latent = model(data)
            embeddings.append(latent.cpu().numpy())
            labels.append(target.numpy())
    return np.vstack(embeddings), np.concatenate(labels)

X_train, y_train = get_embeddings(train_loader)
X_val, y_val = get_embeddings(validation_loader)
X_test, y_test = get_embeddings(test_loader)

# 3. Train Logistic Regression
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)

# 4. Evaluate and print results
def print_metrics(X, y, set_name):
    y_proba = logreg.predict_proba(X)[:, 1]
    auroc = roc_auc_score(y, y_proba)
    auprc = average_precision_score(y, y_proba)
    print(f"{set_name}: AUROC = {auroc:.4f}, AUPRC = {auprc:.4f}")

print("\nPerformance Metrics:")
print_metrics(X_train, y_train, "Training")
print_metrics(X_val, y_val, "Validation")
print_metrics(X_test, y_test, "Test")

# 5. Print formatted table
print("\nSummary Table:")
print("+------------+--------+--------+")
print("| Dataset    | AUROC  | AUPRC  |")
print("+------------+--------+--------+")
for name, X, y in [("Training", X_train, y_train),
                   ("Validation", X_val, y_val),
                   ("Test", X_test, y_test)]:
    y_proba = logreg.predict_proba(X)[:, 1]
    auroc = roc_auc_score(y, y_proba)
    auprc = average_precision_score(y, y_proba)
    print(f"| {name:<10} | {auroc:.4f} | {auprc:.4f} |")
print("+------------+--------+--------+")


Performance Metrics:
Training: AUROC = 0.6663, AUPRC = 0.2805
Validation: AUROC = 0.6511, AUPRC = 0.2780
Test: AUROC = 0.6145, AUPRC = 0.2477

Summary Table:
+------------+--------+--------+
| Dataset    | AUROC  | AUPRC  |
+------------+--------+--------+
| Training   | 0.6663 | 0.2805 |
| Validation | 0.6511 | 0.2780 |
| Test       | 0.6145 | 0.2477 |
+------------+--------+--------+
