# Q3 - Representation Learning

We want a low-dimensional representations that are transferable to similar
datasets

For the input data I decide to use time grid from Q1.3. (Maybe we should explain why?)

## Q3.1 - Pretraining and Linear Probes

### Q3.1.1

In [None]:
import pandas as pd
import numpy as np
from project_1.config import PROCESSED_DATA_DIR, PROJ_ROOT
from project_1.loading import *
from project_1.dataset import *


SEED = 42 #Non l ho usato ovunque sta variabile seed, figa

set_a, set_b, set_c = load_final_data_without_ICU()
death_a, death_b, death_c = load_outcomes()

Shapes of the datasets:
Set A: (183416, 42) Set B: (183495, 42) Set C: (183711, 42)
Shapes of labels:
Set A: (4000, 2) Set B: (4000, 2) Set C: (4000, 2)



#### Model Basis
To ensure fair comparison (as requested), the architecture builds upon **Attempt 1** from `5_fb_rnn` - the top-performing RNN variant in previous experiments:

- **Base Architecture**: 2-layer LSTM
- **Hidden Units**: 64
- **Dropout**: 0.3

#### Monitoring Approach:
**Reconstruction Loss (MSE)** tracked across epochs




#### Modified LSTM Encoder Architecture

Due to the scarce performances in training (*ask ChatGPT to explain this choice over the vanilla LSTM better*), we opted for a slightly modified version of the LSTM encoder architecture.

#### Architecture Overview

##### 1. LSTM Layer
- **Hidden size**: 64
- **Number of layers**: 2 (stacked)
- **Dropout**: 0.3 (between layers)
- **Batch first**: True
- **Output**: Final hidden state (`hn[-1]`)

##### 2. Fully Connected Layer
- `nn.Linear(64, 64)`
- Projects to latent space

##### 3. Normalization & Activation
- LayerNorm (over hidden_size)
- ReLU activation
- Dropout (p=0.3)

##### Hyperparameters
| Parameter       | Value       | Description                     |
|-----------------|-------------|---------------------------------|
| `input_size`    | 40          | Matches input data dimension    |
| `hidden_size`   | 64          | Latent dimension               |
| `num_layers`    | 2           | Deep LSTM architecture         |
| `dropout`       | 0.3         | Regularization                 |

In [None]:
###############################
##BOND!


#####CHECKA se è ok sti dati caricati cosi 
train_dataset = create_dataset_from_timeseries(set_a, death_a["In-hospital_death"])
validation_dataset = create_dataset_from_timeseries(set_b, death_b["In-hospital_death"])
test_dataset = create_dataset_from_timeseries(set_c, death_c["In-hospital_death"])

train_dataset.tensors[0].shape # (batch_size, seq_len, input_size)

torch.Size([4000, 49, 40])

In [103]:
# Convert to DataLoader
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
validation_loader = DataLoader(validation_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

#### Setting up the autoencoder

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 1. Define the Encoder
class LSTMEncoder(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.3):
        super(LSTMEncoder, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                           batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(hidden_size)
        
    def forward(self, x):
        _, (hidden, _) = self.lstm(x)
        last_hidden = hidden[-1]
        out = self.fc(last_hidden)
        out = self.layer_norm(out)
        out = torch.relu(out)
        out = self.dropout(out)
        return out

# 2. Define the Decoder
class LSTMDecoder(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers=2, dropout=0.3):
        super(LSTMDecoder, self).__init__()
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers,
                           batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, hidden_size*2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size*2, output_size)
        )
        
    def forward(self, x, seq_len):
        x = x.unsqueeze(1).repeat(1, seq_len, 1)
        out, _ = self.lstm(x)
        return self.fc(out)

# 3. Define the Autoencoder
class Seq2SeqAutoencoder(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.3):
        super(Seq2SeqAutoencoder, self).__init__()
        self.encoder = LSTMEncoder(input_size, hidden_size, num_layers, dropout)
        self.decoder = LSTMDecoder(hidden_size, input_size, num_layers, dropout)
    
    def forward(self, x):
        latent = self.encoder(x)
        recon_x = self.decoder(latent, x.size(1))
        return recon_x, latent

#### Configuration of the model

In [105]:
# 4. Configuration
input_size = 40  # Matches your data's last dimension
hidden_size = 64
num_layers = 2
dropout = 0.3
n_epochs = 100
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 5. Initialize model
model = Seq2SeqAutoencoder(input_size, hidden_size, num_layers, dropout).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()


#### Let's train!! (Puta madre)

In [106]:
# 6. Training Loop with DataLoader (Training Loss Only)
print("Training Autoencoder...")
for epoch in range(n_epochs):
    model.train()
    train_loss = 0.0
    
    for batch_idx, (data, _) in enumerate(train_loader):
        data = data.to(device)
        optimizer.zero_grad()
        
        # Forward pass
        recon_x, _ = model(data)
        loss = criterion(recon_x, data)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
    
    # Print training progress
    if (epoch+1) % 10 == 0:
        avg_train_loss = train_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{n_epochs} | Train Loss: {avg_train_loss:.4f}')

Training Autoencoder...
Epoch 10/100 | Train Loss: 23.3340
Epoch 20/100 | Train Loss: 19.1827
Epoch 30/100 | Train Loss: 17.8545
Epoch 40/100 | Train Loss: 17.4380
Epoch 50/100 | Train Loss: 16.5605
Epoch 60/100 | Train Loss: 13.7803
Epoch 70/100 | Train Loss: 17.4921
Epoch 80/100 | Train Loss: 13.8805
Epoch 90/100 | Train Loss: 10.9647
Epoch 100/100 | Train Loss: 10.6347


### Q3.1.1

Freeze/fix the weights of your pretrained network and compute a single embedding
vector for each patient. Train a logistic regression (i.e. a linear probe) on the training
set to predict the target only from your pretrained embeddings. Compare your results
to the supervised performances obtained in prior tasks. 

Prima funzionava, bisogna rirunnare da capo figa la puttana triponina


In [142]:
import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score

# 1. Freeze the pretrained autoencoder
model.eval()
for param in model.parameters():
    param.requires_grad = False

# 2. Extract embeddings
def get_embeddings(dataloader):
    model.eval()
    embeddings, labels = [], []
    with torch.no_grad():
        for data, target in dataloader:
            data = data.to(device)
            _, latent = model(data)
            embeddings.append(latent.cpu().numpy())
            # labels.append(target.numpy())
    return np.vstack(embeddings), np.concatenate(labels)

X_train, y_train = get_embeddings(train_loader)
X_val, y_val = get_embeddings(validation_loader)
X_test, y_test = get_embeddings(test_loader)

# 3. Train Logistic Regression
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)

# 4. Evaluate and print results
def print_metrics(X, y, set_name):
    y_proba = logreg.predict_proba(X)[:, 1]
    auroc = roc_auc_score(y, y_proba)
    auprc = average_precision_score(y, y_proba)
    print(f"{set_name}: AUROC = {auroc:.4f}, AUPRC = {auprc:.4f}")

print("\nPerformance Metrics:")
print_metrics(X_train, y_train, "Training")
print_metrics(X_val, y_val, "Validation")
print_metrics(X_test, y_test, "Test")

# 5. Print formatted table
print("\nSummary Table:")
print("+------------+--------+--------+")
print("| Dataset    | AUROC  | AUPRC  |")
print("+------------+--------+--------+")
for name, X, y in [("Training", X_train, y_train),
                   ("Validation", X_val, y_val),
                   ("Test", X_test, y_test)]:
    y_proba = logreg.predict_proba(X)[:, 1]
    auroc = roc_auc_score(y, y_proba)
    auprc = average_precision_score(y, y_proba)
    print(f"| {name:<10} | {auroc:.4f} | {auprc:.4f} |")
print("+------------+--------+--------+")

ValueError: too many values to unpack (expected 2)

## Q3.2 Simulate label scarcity

Train three different supervised (as in Q2.x) models with the same (or as similar as
possible) architecture as your pretrained network, but only use 100, 500, and 1000
patients from the training set and report your full test set performance (2 pts).
● Train three linear probes (as in Q3.1 step 2) using only 100, 500, 1000 labelled
patients and report the full test set C performance. (2 pts).

In [124]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Subset
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score

# 1. Define supervised model (same architecture as encoder)
class SupervisedLSTM(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.3):
        super(SupervisedLSTM, self).__init__()
        # Same architecture as encoder part
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                          batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.LayerNorm(hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, 1)
        )
        
    def forward(self, x):
        _, (hidden, _) = self.lstm(x)
        last_hidden = hidden[-1]
        return torch.sigmoid(self.classifier(last_hidden)).squeeze()

# 2. Create subset datasets
def create_subset(dataset, n_samples):
    indices = torch.randperm(len(dataset))[:n_samples]
    return Subset(dataset, indices)

# Configuration
input_size = 40
hidden_size = 64
num_layers = 2
dropout = 0.3
n_epochs = 100
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Assuming train_dataset and test_loader are already defined
subset_sizes = [100, 500, 1000]
results = {}

# 3. Train and evaluate models
for size in subset_sizes:
    print(f"\nTraining model with {size} patients...")
    
    # Create subset
    subset = create_subset(train_dataset, size)
    subset_loader = DataLoader(subset, batch_size=64, shuffle=True)
    
    # Initialize model
    model = SupervisedLSTM(input_size, hidden_size, num_layers, dropout).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.BCELoss()
    
    # Training loop
    for epoch in range(n_epochs):
        model.train()
        train_loss = 0.0
        
        for data, target in subset_loader:
            data, target = data.to(device), target.float().to(device)
            optimizer.zero_grad()
            
            outputs = model(data)
            loss = criterion(outputs, target)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        if (epoch+1) % 10 == 0:
            avg_loss = train_loss / len(subset_loader)
            print(f'Epoch {epoch+1}/{n_epochs} | Loss: {avg_loss:.4f}')
    
    # Evaluate on test set
    model.eval()
    y_true, y_proba = [], []
    with torch.no_grad():
        for data, target in test_loader:
            data = data.to(device)
            outputs = model(data).cpu().numpy()
            y_proba.extend(outputs)
            y_true.extend(target.numpy())
    
    auroc = roc_auc_score(y_true, y_proba)
    auprc = average_precision_score(y_true, y_proba)
    results[size] = (auroc, auprc)
    print(f"Test Performance - AUROC: {auroc:.4f}, AUPRC: {auprc:.4f}")

# 4. Print final results
print("\nFinal Test Performance:")
print("+--------+--------+--------+")
print("| N      | AUROC  | AUPRC  |")
print("+--------+--------+--------+")
for size, (auroc, auprc) in results.items():
    print(f"| {size:<6} | {auroc:.4f} | {auprc:.4f} |")
print("+--------+--------+--------+")


Training model with 100 patients...
Epoch 10/100 | Loss: 0.4277
Epoch 20/100 | Loss: 0.1752
Epoch 30/100 | Loss: 0.0389
Epoch 40/100 | Loss: 0.0253
Epoch 50/100 | Loss: 0.0155
Epoch 60/100 | Loss: 0.0137
Epoch 70/100 | Loss: 0.0076
Epoch 80/100 | Loss: 0.0066
Epoch 90/100 | Loss: 0.0059
Epoch 100/100 | Loss: 0.0049
Test Performance - AUROC: 0.6382, AUPRC: 0.2443

Training model with 500 patients...
Epoch 10/100 | Loss: 0.1729
Epoch 20/100 | Loss: 0.0736
Epoch 30/100 | Loss: 0.0350
Epoch 40/100 | Loss: 0.0291
Epoch 50/100 | Loss: 0.0179
Epoch 60/100 | Loss: 0.0636
Epoch 70/100 | Loss: 0.0054
Epoch 80/100 | Loss: 0.0025
Epoch 90/100 | Loss: 0.0017
Epoch 100/100 | Loss: 0.0012
Test Performance - AUROC: 0.6935, AUPRC: 0.3100

Training model with 1000 patients...
Epoch 10/100 | Loss: 0.1490
Epoch 20/100 | Loss: 0.0593
Epoch 30/100 | Loss: 0.0270
Epoch 40/100 | Loss: 0.0162
Epoch 50/100 | Loss: 0.0079
Epoch 60/100 | Loss: 0.0049
Epoch 70/100 | Loss: 0.0083
Epoch 80/100 | Loss: 0.0137
Epoch 

Train three linear probes (as in Q3.1 step 2) using only 100, 500, 1000 labelled
patients and report the full test set C performance. (2 pts).

In [143]:
#da fare still