# Multi-Layer Perceptron for SIDER Multi-Label Classification

This notebook implements a Multi-Layer Perceptron (MLP) model for predicting drug side effects using the SIDER dataset. The model performs multi-label classification across 27 different side effect categories.

## 1. Imports and Setup

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import roc_auc_score, classification_report, hamming_loss

# PyTorch and PyTorch Lightning
import torch
import torch.nn as nn
import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader, random_split
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping

# Custom modules
import sys
sys.path.append('..')
from Classification.src.sider_preprocessing import sider_preprocessing
from Classification.src.sider_featurizer import featurizer

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Data Loading and Preprocessing

In [None]:
# Load the SIDER dataset
df = pd.read_csv('../data/raw/sider.csv')
print(f"Original dataset shape: {df.shape}")
print(f"Number of molecules: {len(df)}")
print(f"Number of side effect categories: {len(df.columns) - 1}")

# Display first few rows
df.head()

In [None]:
# Preprocess the data (clean and canonicalize SMILES)
df_cleaned = sider_preprocessing(df)
print(f"\nCleaned dataset shape: {df_cleaned.shape}")

In [None]:
# Generate molecular features
df_final = featurizer(df=df_cleaned, mol_col='Molecule', fpSize=2048)
print(f"\nFinal dataset shape with features: {df_final.shape}")

## 3. Feature and Target Preparation

In [None]:
# Separate features and targets
X = df_final.iloc[:, 29:].copy()  # Molecular features
y = df_final.iloc[:, 2:29]        # 27 side effect labels

# Select only numeric features
X = X.select_dtypes(include=np.number)

# Remove zero-variance features
selector = VarianceThreshold(threshold=0.0)
X_cleaned_array = selector.fit_transform(X)
X = pd.DataFrame(X_cleaned_array, columns=X.columns[selector.get_support()])

print(f"Feature matrix shape: {X.shape}")
print(f"Target matrix shape: {y.shape}")
print(f"\nNumber of features after variance filtering: {X.shape[1]}")

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=None
)

# Feature scaling (important for neural networks)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train_scaled.shape[0]} samples")
print(f"Test set size: {X_test_scaled.shape[0]} samples")

## 4. PyTorch Dataset and DataLoader

In [None]:
class SIDERDataset(Dataset):
    """PyTorch Dataset wrapper for SIDER data."""
    
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.FloatTensor(y.values if hasattr(y, 'values') else y)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

In [None]:
# Create PyTorch datasets
train_dataset = SIDERDataset(X_train_scaled, y_train)
test_dataset = SIDERDataset(X_test_scaled, y_test)

# Create validation split from training data (90-10 split)
train_size = int(0.9 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_subset, val_subset = random_split(
    train_dataset, [train_size, val_size],
    generator=torch.Generator().manual_seed(42)
)

# Create data loaders
batch_size = 32
train_loader = DataLoader(train_subset, batch_size=batch_size, shuffle=True, num_workers=0)
val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, num_workers=0)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=0)

print(f"Training samples: {len(train_subset)}")
print(f"Validation samples: {len(val_subset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Number of batches - Train: {len(train_loader)}, Val: {len(val_loader)}, Test: {len(test_loader)}")

## 5. MLP Model Architecture

In [None]:
class MLP_SIDER(pl.LightningModule):
    """
    Multi-Layer Perceptron for SIDER multi-label classification.
    
    Architecture:
    - Input layer: molecular features
    - Multiple hidden layers with ReLU activation, batch normalization, and dropout
    - Output layer: 27 neurons (one for each side effect)
    - Sigmoid activation for multi-label classification
    """
    
    def __init__(self, input_dim, hidden_dims=[512, 256, 128], out_dim=27, dropout=0.3, lr=0.001):
        super().__init__()
        self.save_hyperparameters()
        self.lr = lr
        
        # Build MLP architecture dynamically
        layers = []
        prev_dim = input_dim
        
        # Add hidden layers with activation, batch norm, and dropout
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.BatchNorm1d(hidden_dim))
            layers.append(nn.Dropout(dropout))
            prev_dim = hidden_dim
        
        # Output layer (no activation - using BCEWithLogitsLoss)
        layers.append(nn.Linear(prev_dim, out_dim))
        
        self.network = nn.Sequential(*layers)
        
        # Loss function for multi-label classification
        self.criterion = nn.BCEWithLogitsLoss()
        
        # Store predictions for epoch-end metrics
        self.validation_outputs = []
        self.test_outputs = []
        
    def forward(self, x):
        return self.network(x)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = self.criterion(logits, y)
        self.log('train_loss', loss, on_step=False, on_epoch=True, prog_bar=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = self.criterion(logits, y)
        self.log('val_loss', loss, on_step=False, on_epoch=True, prog_bar=True)
        
        # Store outputs for epoch-end AUC calculation
        probs = torch.sigmoid(logits)
        self.validation_outputs.append({'preds': probs, 'targets': y})
        return loss
    
    def on_validation_epoch_end(self):
        if len(self.validation_outputs) > 0:
            # Concatenate all predictions and targets
            all_preds = torch.cat([x['preds'] for x in self.validation_outputs], dim=0)
            all_targets = torch.cat([x['targets'] for x in self.validation_outputs], dim=0)
            
            # Calculate macro AUC-ROC
            try:
                macro_auc = roc_auc_score(
                    all_targets.cpu().numpy(), 
                    all_preds.cpu().numpy(), 
                    average='macro'
                )
                self.log('val_macro_auc', macro_auc, on_epoch=True, prog_bar=True)
            except:
                pass  # Skip if AUC calculation fails
            
            # Clear outputs
            self.validation_outputs.clear()
    
    def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        probs = torch.sigmoid(logits)
        self.test_outputs.append({'preds': probs, 'targets': y})
        return {'preds': probs, 'targets': y}
    
    def on_test_epoch_end(self):
        # Concatenate all predictions and targets
        all_preds = torch.cat([x['preds'] for x in self.test_outputs], dim=0)
        all_targets = torch.cat([x['targets'] for x in self.test_outputs], dim=0)
        
        # Calculate metrics
        macro_auc = roc_auc_score(
            all_targets.cpu().numpy(), 
            all_preds.cpu().numpy(), 
            average='macro'
        )
        self.log('test_macro_auc', macro_auc)
        
        # Clear outputs
        self.test_outputs.clear()
    
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr, weight_decay=1e-5)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', factor=0.5, patience=5, verbose=True
        )
        return {
            'optimizer': optimizer,
            'lr_scheduler': {
                'scheduler': scheduler,
                'monitor': 'val_loss'
            }
        }

## 6. Model Training

In [None]:
# Initialize MLP model
input_dim = X_train_scaled.shape[1]
mlp_model = MLP_SIDER(
    input_dim=input_dim,
    hidden_dims=[512, 256, 128],  # Three hidden layers
    out_dim=27,                    # 27 side effect categories
    dropout=0.3,                   # Dropout for regularization
    lr=0.001                       # Learning rate
)

print(f"Model architecture:")
print(f"  Input dimension: {input_dim}")
print(f"  Hidden layers: [512, 256, 128]")
print(f"  Output dimension: 27")
print(f"  Total parameters: {sum(p.numel() for p in mlp_model.parameters()):,}")

In [None]:
# Setup callbacks
checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',
    dirpath='mlp_checkpoints/',
    filename='best-mlp-{epoch:02d}-{val_loss:.4f}',
    save_top_k=1,
    mode='min',
    verbose=True
)

early_stop_callback = EarlyStopping(
    monitor='val_loss',
    patience=15,
    mode='min',
    verbose=True
)

# Initialize trainer
trainer = pl.Trainer(
    max_epochs=100,
    callbacks=[checkpoint_callback, early_stop_callback],
    accelerator='auto',
    devices=1,
    log_every_n_steps=10,
    enable_progress_bar=True
)

print("Starting training...")

In [None]:
# Train the model
trainer.fit(mlp_model, train_loader, val_loader)

## 7. Model Evaluation

In [None]:
# Evaluate on test set
print("Evaluating MLP on test set...")

mlp_model.eval()
all_preds = []
all_targets = []

with torch.no_grad():
    for batch in test_loader:
        x, y = batch
        logits = mlp_model(x)
        probs = torch.sigmoid(logits)
        all_preds.append(probs.cpu().numpy())
        all_targets.append(y.cpu().numpy())

all_preds = np.vstack(all_preds)
all_targets = np.vstack(all_targets)

# Calculate metrics
macro_auc = roc_auc_score(all_targets, all_preds, average='macro')
micro_auc = roc_auc_score(all_targets, all_preds, average='micro')

# Binary predictions for additional metrics
binary_preds = (all_preds > 0.5).astype(int)
hamming = hamming_loss(all_targets, binary_preds)

print(f"\n{'='*50}")
print(f"MLP PERFORMANCE METRICS")
print(f"{'='*50}")
print(f"Macro AUC-ROC: {macro_auc:.4f}")
print(f"Micro AUC-ROC: {micro_auc:.4f}")
print(f"Hamming Loss: {hamming:.4f}")
print(f"{'='*50}")

## 8. Per-Label Performance Analysis

In [None]:
# Calculate AUC-ROC for each label
label_aucs = {}
for i, label in enumerate(y.columns):
    try:
        auc = roc_auc_score(all_targets[:, i], all_preds[:, i])
        label_aucs[label] = auc
    except:
        label_aucs[label] = np.nan

# Sort by AUC
sorted_labels = sorted(label_aucs.items(), key=lambda x: x[1] if not np.isnan(x[1]) else 0, reverse=True)

print("\nPER-LABEL AUC-ROC SCORES")
print("="*60)
print(f"{'Label':<50} {'AUC-ROC':>10}")
print("-"*60)

# Top 5 labels
print("\nTop 5 Best Predicted Labels:")
for label, auc in sorted_labels[:5]:
    if not np.isnan(auc):
        print(f"{label:<50} {auc:>10.4f}")

# Bottom 5 labels
print("\nTop 5 Worst Predicted Labels:")
for label, auc in sorted_labels[-5:]:
    if not np.isnan(auc):
        print(f"{label:<50} {auc:>10.4f}")

## 9. Model Comparison

In [None]:
# Compare with baseline models from the original analysis
model_scores = {
    'Random Forest': 0.6691,
    'XGBoost': 0.6596,
    'GNN (default)': 0.6408,
    'MLP (current)': macro_auc,
    'Logistic Regression': 0.6213,
    'SVM (linear)': 0.6145,
    'Transformer + RF': 0.6070,
    'GNN (optimized)': 0.5886
}

# Sort by performance
sorted_models = sorted(model_scores.items(), key=lambda x: x[1], reverse=True)

print("\n" + "="*60)
print("MODEL PERFORMANCE COMPARISON (MACRO AUC-ROC)")
print("="*60)
print(f"{'Model':<30} {'AUC-ROC':>10} {'Relative to Best':>15}")
print("-"*60)

best_score = sorted_models[0][1]
for model, score in sorted_models:
    relative = ((score - best_score) / best_score) * 100
    marker = " <-- Current Model" if model == 'MLP (current)' else ""
    print(f"{model:<30} {score:>10.4f} {relative:>14.1f}%{marker}")

## 10. Visualization

In [None]:
# Plot model comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Bar plot of model performances
models = [m[0] for m in sorted_models]
scores = [m[1] for m in sorted_models]
colors = ['green' if m == 'MLP (current)' else 'steelblue' for m in models]

ax1.barh(models, scores, color=colors)
ax1.set_xlabel('Macro AUC-ROC')
ax1.set_title('Model Performance Comparison')
ax1.set_xlim([0.55, 0.70])
ax1.grid(True, alpha=0.3)

# Add value labels on bars
for i, (model, score) in enumerate(zip(models, scores)):
    ax1.text(score + 0.002, i, f'{score:.4f}', va='center')

# Per-label AUC distribution
label_scores = list(label_aucs.values())
label_scores = [s for s in label_scores if not np.isnan(s)]  # Remove NaN values

ax2.hist(label_scores, bins=15, color='steelblue', edgecolor='black', alpha=0.7)
ax2.axvline(macro_auc, color='red', linestyle='--', linewidth=2, label=f'Macro AUC: {macro_auc:.4f}')
ax2.set_xlabel('AUC-ROC')
ax2.set_ylabel('Number of Labels')
ax2.set_title('Distribution of Per-Label AUC-ROC Scores')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Plot training history if available
if hasattr(trainer, 'logged_metrics'):
    plt.figure(figsize=(12, 4))
    
    # Extract training metrics
    epochs = range(1, trainer.current_epoch + 1)
    
    plt.subplot(1, 2, 1)
    plt.plot(epochs, label='Training Loss')
    plt.plot(epochs, label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 2, 2)
    plt.plot(epochs, label='Validation Macro AUC')
    plt.xlabel('Epoch')
    plt.ylabel('Macro AUC-ROC')
    plt.title('Validation Performance')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 11. Summary and Conclusions

### Model Performance
The Multi-Layer Perceptron (MLP) model has been successfully implemented for multi-label classification on the SIDER dataset. The model predicts 27 different side effect categories based on molecular features.

### Key Findings:
1. **Overall Performance**: The MLP achieves a macro AUC-ROC score, which measures the model's ability to distinguish between positive and negative samples across all labels.

2. **Architecture**: The model uses three hidden layers (512, 256, 128 neurons) with ReLU activation, batch normalization, and dropout for regularization.

3. **Comparison with Baselines**: The MLP's performance can be compared against other models including Random Forest, XGBoost, GNN, and SVM.

### Strengths:
- Can capture non-linear relationships in the data
- Relatively fast training compared to complex models like GNNs
- Good regularization through dropout and batch normalization

### Potential Improvements:
1. **Hyperparameter Tuning**: Use Bayesian optimization or grid search to find optimal architecture
2. **Feature Engineering**: Explore additional molecular descriptors or fingerprints
3. **Ensemble Methods**: Combine MLP with other models for better performance
4. **Class Imbalance**: Address imbalanced labels with weighted loss functions
5. **Advanced Architectures**: Explore attention mechanisms or residual connections

In [None]:
# Save the final model
model_path = 'mlp_final_model.pt'
torch.save({
    'model_state_dict': mlp_model.state_dict(),
    'input_dim': input_dim,
    'hidden_dims': [512, 256, 128],
    'macro_auc': macro_auc,
    'micro_auc': micro_auc
}, model_path)

print(f"Model saved to {model_path}")
print(f"\nFinal Results:")
print(f"  Macro AUC-ROC: {macro_auc:.4f}")
print(f"  Micro AUC-ROC: {micro_auc:.4f}")