# SIDER Side Effect Prediction with MACCS Fingerprints

This notebook demonstrates a simplified approach to predicting drug side effects using the SIDER dataset.
We'll use MACCS fingerprints (similar to the BACE example) and build classification models.

**Key Differences from BACE:**
- BACE: Regression (predict continuous pIC50 values)
- SIDER: Multi-label classification (predict 27 different side-effect categories)

For simplicity, we'll start by predicting a single side effect: 'Cardiac disorders'

In [None]:
# Install required packages (no DeepChem or TensorFlow needed - we'll use RDKit directly!)
# This installs only what we need without TensorFlow dependencies
!pip install rdkit scikit-learn xgboost pandas numpy matplotlib

In [None]:
# Suppress warnings and ensure we don't accidentally import TensorFlow
import warnings
warnings.filterwarnings('ignore')

# Explicitly set environment to avoid TensorFlow
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Suppress TF warnings if accidentally imported

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import Draw, MACCSkeys

## 1. Load and Explore the SIDER Dataset

In [None]:
# Load the SIDER dataset
df = pd.read_csv('../data/raw/sider.csv')
df.head()

In [None]:
# Check dataset shape
print(f"Dataset shape: {df.shape}")
print(f"Number of molecules: {df.shape[0]}")
print(f"Number of side-effect categories: {df.shape[1] - 1}")  # -1 for SMILES column

In [None]:
# See all side-effect categories
side_effect_cols = [col for col in df.columns if col != 'smiles']
print("\nSide-effect categories:")
for i, col in enumerate(side_effect_cols, 1):
    print(f"{i}. {col}")

In [None]:
# Check class distribution for 'Cardiac disorders'
target_col = 'Cardiac disorders'
print(f"\nClass distribution for '{target_col}':")
print(df[target_col].value_counts())
print(f"\nPositive samples: {df[target_col].sum()} ({df[target_col].mean()*100:.1f}%)")

## 2. Canonicalize SMILES

Just like in the BACE example, we'll canonicalize SMILES to ensure consistency.

In [None]:
def canonicalize_smiles(smiles):
    """Convert non-canonical SMILES to canonical form.
    
    Args:
        smiles: str, non-canonical SMILES of a molecule
    
    Returns:
        canonical_smiles: str, canonical SMILES of the molecule
    """
    try:
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            return None
        canonical_smiles = Chem.MolToSmiles(mol)
        return canonical_smiles
    except:
        return None

# Apply canonicalization
df['canonical_smiles'] = df['smiles'].apply(canonicalize_smiles)

# Remove any molecules that failed to canonicalize
initial_count = len(df)
df = df.dropna(subset=['canonical_smiles'])
print(f"Removed {initial_count - len(df)} invalid SMILES")
print(f"Final dataset size: {len(df)} molecules")

In [None]:
# Create a simplified dataframe for our task
df_sider = df[['canonical_smiles', target_col]].copy()
df_sider.head()

## 3. Visualize Sample Molecules

In [None]:
# Visualize some molecules with and without cardiac disorders
n = 6

# Get 3 positive and 3 negative samples
positive_samples = df_sider[df_sider[target_col] == 1].sample(min(3, df_sider[target_col].sum()))
negative_samples = df_sider[df_sider[target_col] == 0].sample(3)
df_sample = pd.concat([positive_samples, negative_samples])

smiles = df_sample['canonical_smiles'].values
labels = [f"Cardiac: {val}" for val in df_sample[target_col].values]
molecs = [Chem.MolFromSmiles(s) for s in smiles]

Draw.MolsToGridImage(
    molecs,
    legends=labels,
    subImgSize=(400, 300),
    molsPerRow=3
)

## 4. Generate MACCS Fingerprints

We'll use RDKit's MACCSkeys module directly (no need for DeepChem!).

In [None]:
# Generate MACCS fingerprints using RDKit directly
def get_maccs_fingerprint(smiles):
    """
    Generate MACCS keys fingerprint from a SMILES string.
    
    Args:
        smiles: canonical SMILES string
    
    Returns:
        numpy array of 167 binary values (0 or 1)
    """
    try:
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            return np.zeros(167, dtype=int)
        
        # Generate MACCS fingerprint (167 bits)
        maccs_fp = MACCSkeys.GenMACCSKeys(mol)
        # Convert to numpy array
        return np.array(maccs_fp, dtype=int)
    except:
        return np.zeros(167, dtype=int)

# Apply to all molecules
print("Generating MACCS fingerprints for all molecules...")
mf_features = np.array([get_maccs_fingerprint(smiles) for smiles in df_sider['canonical_smiles']])

print(f"MACCS fingerprint shape: {mf_features.shape}")
print(f"Number of features per molecule: {mf_features.shape[1]}")

In [None]:
# Check for any NaN values in features
nan_count = np.isnan(mf_features).sum()
print(f"Total NaN values in features: {nan_count}")

## 5. Feature Selection: Remove Zero-Variance Features

Just like in BACE, we'll remove features that have zero variance (all values are the same).

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Remove zero-variance features
selector = VarianceThreshold(threshold=0.0)
mf_features_filtered = selector.fit_transform(mf_features)

print(f"Original number of features: {mf_features.shape[1]}")
print(f"Number of features after removing zero-variance: {mf_features_filtered.shape[1]}")
print(f"Removed {mf_features.shape[1] - mf_features_filtered.shape[1]} features")

## 6. Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = mf_features_filtered
y = df_sider[target_col]

# 80/20 train-test split with fixed random seed
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"\nTraining set class distribution:")
print(y_train.value_counts())
print(f"\nTest set class distribution:")
print(y_test.value_counts())

## 7. Train Classification Models

We'll train Random Forest and XGBoost classifiers (adapted from the BACE regression example).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Initialize models
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_clf = XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def train_test_classifier(model, X_train, y_train, X_test, y_test, model_name):
    """
    Train a classification model and evaluate it.
    
    Args:
        model: sklearn/xgboost classifier
        X_train, y_train: training data
        X_test, y_test: test data
        model_name: name of the model for display
    """
    # Train model
    print(f"Training {model_name}...")
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Get probability predictions for ROC-AUC
    y_pred_train_proba = model.predict_proba(X_train)[:, 1]
    y_pred_test_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    train_acc = accuracy_score(y_train, y_pred_train)
    test_acc = accuracy_score(y_test, y_pred_test)
    
    train_f1 = f1_score(y_train, y_pred_train)
    test_f1 = f1_score(y_test, y_pred_test)
    
    train_roc_auc = roc_auc_score(y_train, y_pred_train_proba)
    test_roc_auc = roc_auc_score(y_test, y_pred_test_proba)
    
    # Print results
    print(f"\n{model_name} Results:")
    print(f"  Train Accuracy: {train_acc:.3f} | Test Accuracy: {test_acc:.3f}")
    print(f"  Train F1-Score: {train_f1:.3f} | Test F1-Score: {test_f1:.3f}")
    print(f"  Train ROC-AUC:  {train_roc_auc:.3f} | Test ROC-AUC:  {test_roc_auc:.3f}")
    print()
    
    return model

In [None]:
# Train and evaluate Random Forest
rf_clf = train_test_classifier(rf_clf, X_train, y_train, X_test, y_test, "Random Forest")

In [None]:
# Train and evaluate XGBoost
xgb_clf = train_test_classifier(xgb_clf, X_train, y_train, X_test, y_test, "XGBoost")

## 8. Cross-Validation

Let's perform k-fold cross-validation like in the BACE example.

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

n_folds = 5

# Create StratifiedKFold (important for imbalanced datasets)
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

# Cross-validate Random Forest
rf_cv_scores = cross_val_score(
    rf_clf, X_train, y_train,
    cv=skf,
    scoring='roc_auc',
    n_jobs=-1
)

print(f"Random Forest Cross-Validation (ROC-AUC):")
print(f"  Mean: {rf_cv_scores.mean():.3f}")
print(f"  Std:  {rf_cv_scores.std():.3f}")
print(f"  Scores: {rf_cv_scores}")

In [None]:
# Cross-validate XGBoost
xgb_cv_scores = cross_val_score(
    xgb_clf, X_train, y_train,
    cv=skf,
    scoring='roc_auc',
    n_jobs=-1
)

print(f"XGBoost Cross-Validation (ROC-AUC):")
print(f"  Mean: {xgb_cv_scores.mean():.3f}")
print(f"  Std:  {xgb_cv_scores.std():.3f}")
print(f"  Scores: {xgb_cv_scores}")

## 9. Summary

This notebook demonstrated:
1. Loading and exploring the SIDER dataset
2. Canonicalizing SMILES strings with RDKit
3. Generating MACCS fingerprints using **RDKit directly** (no DeepChem needed!)
4. Removing zero-variance features
5. Training Random Forest and XGBoost **classifiers** (vs. regressors in BACE)
6. Evaluating with classification metrics (Accuracy, F1, ROC-AUC)
7. Cross-validation with stratified k-folds

**Key Differences from BACE notebook:**
- BACE uses DeepChem wrapper, this uses RDKit's `MACCSkeys.GenMACCSKeys()` directly
- Both produce the same 167-bit MACCS fingerprints!
- This approach avoids TensorFlow dependency (lighter installation)

**Next Steps:**
- Try predicting other side-effect categories
- Implement multi-label classification (predict all 27 side effects simultaneously)
- Try different fingerprints (Morgan, RDKit descriptors)
- Tune hyperparameters using GridSearchCV