## Introduction

In this notebook, I compare two fundamentally different approaches to molecular representation based on SMILES strings:
- Morgan fingerprints (ECFP) — hand-engineered binary features based on molecular substructures.
- [MolCLR](https://github.com/yuyangw/MolCLR) embeddings — learned continuous representations obtained via contrastive graph-based pretraining.

The goal is to evaluate how each type of feature impacts MoA classification performance, particularly in detecting rare or underrepresented classes.

By training and testing models on identical datasets with only the SMILES representation varied, I aim to assess:
- Which representation yields better overall and per-class metrics
- Whether learned embeddings provide a tangible advantage over traditional descriptors
- How both approaches scale with model complexity and label imbalance

This comparison helps inform the design of future multimodal models by selecting the most informative and generalizable molecular descriptors.

In [1]:
import os
import glob
from datetime import date

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
    

In [2]:
# paths to data
save_path = "result/"

def load_latest_file(suffix: str):
    # loading our final dataset
    file_merged_type = '/*[0-9]_' + suffix
    files_merged = glob.glob(save_path + file_merged_type)

    # gets latest file
    max_file_merged = max(files_merged, key=os.path.getctime)

    # load file
    return pd.read_csv(max_file_merged)

In [3]:
df_merged_morgan = load_latest_file('merged_exp_agm.csv')
df_merged_molcrl = load_latest_file('merged_molcrl_exp_agm.csv')

In [4]:
metadata_cols = [col for col in df_merged_morgan.columns if col.startswith('Metadata_')]
binary_cols = [col for col in df_merged_morgan.columns if col.startswith('binary_')]
chemical_cols = [col for col in df_merged_morgan.columns if col.startswith('chemical_')]
moa_cols = [col for col in df_merged_morgan.columns if col.startswith('moa_')]
drug_status_cols = [col for col in df_merged_morgan.columns if col.startswith('drug_status_')]
fingerprints_cols = [col for col in df_merged_morgan.columns if col.startswith('fp_')]
morphology_cols = [col for col in df_merged_morgan.columns if col.startswith('morphology_')]

moa_counts = df_merged_morgan[moa_cols].sum().sort_values(ascending=False)
top_moa = moa_counts[moa_counts > 100].index.tolist()

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.metrics import precision_recall_curve, average_precision_score, roc_curve, auc, roc_auc_score
from sklearn.preprocessing import StandardScaler

class MultimodalMoAPipeline:
    """
    A pipeline for multimodal classification tasks using morphological, chemical,
    and fingerprint features.

    Parameters
    ----------
    morph_cols : list of str, default=[]
        List of column names representing morphological features.

    chem_cols : list of str, default=[]
        List of column names representing chemical descriptors.

    fp_cols : list of str, default=[]
        List of column names representing fingerprint features.

    use_morph : bool, default=True
        Whether to include morphological features in the model.

    use_chem : bool, default=True
        Whether to include chemical features in the model.

    use_fp : bool, default=True
        Whether to include fingerprint features in the model.

    scaler : str or None, default='standard'
        Type of scaler to apply to features. Options:
            - 'standard': StandardScaler from sklearn
            - None or any other value: no scaling will be applied

        Note: Some models like CatBoost do not require scaling.

    model : sklearn-like classifier, default=None
        A scikit-learn compatible classifier. If None, defaults to
        RandomForestClassifier with predefined parameters.

    random_state : int, default=42
        Random seed for reproducibility.

    use_gridsearch : bool, default=False
        Whether to perform GridSearchCV to tune hyperparameters.
        Only supported for sklearn-compatible estimators.
    """
    def __init__(self, morph_cols=[], chem_cols=[], fp_cols=[],
                 use_morph=True, use_chem=True, use_fp=True,
                 scaler='standard', model=None, random_state=42,
                 use_gridsearch=False):
        self.morph_cols = morph_cols
        self.chem_cols = chem_cols
        self.fp_cols = fp_cols
        self.use_morph = use_morph
        self.use_chem = use_chem
        self.use_fp = use_fp
        
        self.scaler_type = scaler
        self.random_state = random_state
        self.use_gridsearch = use_gridsearch
        
        self.model = model if model is not None else RandomForestClassifier(n_estimators=200, random_state=random_state, class_weight='balanced', min_samples_leaf=3)

    def _get_feature_set(self, df):
        cols = []
        if self.use_morph:
            cols += self.morph_cols
        if self.use_chem:
            cols += self.chem_cols
        if self.use_fp:
            cols += self.fp_cols
        return df[cols].copy()

    def _scale(self, X):
        if self.scaler_type == 'standard':
            self.scaler = StandardScaler()
            X_scaled = pd.DataFrame(self.scaler.fit_transform(X), columns=X.columns, index=X.index)
            return X_scaled
        return X  # no scaling

    def fit(self, df, target_col):
        X = self._get_feature_set(df)
        X = self._scale(X)

        # Support for multilabel: if target_col — it's a one-hot list
        if isinstance(target_col, list):
            df = df.copy()
            df['__moa_label'] = df[target_col].idxmax(axis=1)
            y = df['__moa_label']
        else:
            y = df[target_col]

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y, stratify=y, test_size=0.2, random_state=self.random_state
        )

        if self.use_gridsearch:
            param_grid = {
                'n_estimators': [100, 200],
                'max_depth': [None, 10, 20],
                'min_samples_leaf': [1, 3, 5],
                'class_weight': ['balanced']
            }
            base_model = RandomForestClassifier(random_state=self.random_state)
            cv_strategy = StratifiedKFold(n_splits=3, shuffle=True, random_state=self.random_state)
            grid = GridSearchCV(base_model, param_grid, scoring='f1_macro', cv=cv_strategy, n_jobs=-1)
            grid.fit(self.X_train, self.y_train)
            print("Best params from GridSearchCV:", grid.best_params_)
            self.model = grid.best_estimator_
        else:
            self.model.fit(self.X_train, self.y_train)

        self.y_pred = self.model.predict(self.X_test)


    def evaluate(self, show_plots=True):
        acc = accuracy_score(self.y_test, self.y_pred)
        f1 = f1_score(self.y_test, self.y_pred, average='macro')

        print(f"\n🎯 Accuracy: {acc:.4f}")
        print(f"🎯 Macro F1-score: {f1:.4f}\n")
        print("Classification Report:\n")
        print(classification_report(self.y_test, self.y_pred))
        
        if not show_plots:
            return

        plt.figure(figsize=(10, 8))
        cm = confusion_matrix(self.y_test, self.y_pred, labels=self.model.classes_)
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                    xticklabels=self.model.classes_, yticklabels=self.model.classes_)
        plt.title("Confusion Matrix")
        plt.xlabel("Predicted")
        plt.ylabel("True")
        plt.tight_layout()
        plt.show()
        
        # PR and ROC curves only for binary classification
        if len(self.model.classes_) == 2:
            y_proba = self.model.predict_proba(self.X_test)[:, 1]

            # Precision-Recall Curve
            precision, recall, thresholds = precision_recall_curve(self.y_test, y_proba)
            avg_precision = average_precision_score(self.y_test, y_proba)

            plt.figure(figsize=(8, 6))
            plt.plot(recall, precision, marker='.')
            plt.xlabel('Recall')
            plt.ylabel('Precision')
            plt.title(f'Precision-Recall Curve (AP = {avg_precision:.4f})')
            plt.grid()
            plt.tight_layout()
            plt.show()

            # ROC Curve
            fpr, tpr, _ = roc_curve(self.y_test, y_proba)
            roc_auc = roc_auc_score(self.y_test, y_proba)

            plt.figure(figsize=(8, 6))
            plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.4f})')
            plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
            plt.xlabel('False Positive Rate')
            plt.ylabel('True Positive Rate')
            plt.title('ROC Curve')
            plt.legend(loc='lower right')
            plt.grid()
            plt.tight_layout()
            plt.show()

    def plot_importance(self, top_n=30):
        if not hasattr(self.model, 'feature_importances_'):
            print("This model does not support feature importances.")
            return

        feat_imp = pd.DataFrame({
            'feature': self.X_train.columns,
            'importance': self.model.feature_importances_
        })

        # Figure out the feature groups
        def get_group(feature):
            if feature in self.morph_cols:
                return 'morphology'
            elif feature in self.chem_cols:
                return 'chemistry'
            elif feature in self.fp_cols:
                return 'fingerprint'
            else:
                return 'other'

        feat_imp['group'] = feat_imp['feature'].apply(get_group)

        # Group by feature group and sum importances
        grouped = feat_imp.groupby('group')['importance'].sum().sort_values(ascending=False)
        print("\n📊 Feature Importance by Group:")
        print(grouped)

        # Sort and select top_n features
        feat_imp = feat_imp.sort_values(by='importance', ascending=False).head(top_n)

        plt.figure(figsize=(12, 8))
        sns.barplot(data=feat_imp, x='importance', y='feature', hue='group', dodge=False, palette='viridis')
        plt.title(f"Top {top_n} Feature Importances by Group")
        plt.tight_layout()
        plt.show()

        return feat_imp
    

In [6]:
from catboost import CatBoostClassifier
from sklearn.utils.class_weight import compute_class_weight

X = df_merged_morgan[morphology_cols + chemical_cols + fingerprints_cols]
y = df_merged_morgan[top_moa].idxmax(axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

classes = y_train.unique()
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))

cat_boost_model = CatBoostClassifier(
    iterations=300,
    depth=6,
    learning_rate=0.1,
    loss_function='MultiClass',
    class_weights=class_weights,
    eval_metric='TotalF1',
    random_seed=42,
    verbose=50,
)

In [7]:
cat_boost_pipe = MultimodalMoAPipeline(
    morph_cols=morphology_cols,
    chem_cols=chemical_cols,
    fp_cols=fingerprints_cols,
    use_gridsearch=False,
    model=cat_boost_model
)

In [8]:
cat_boost_pipe.fit(df_merged_morgan, target_col=top_moa)
cat_boost_pipe.evaluate(show_plots=False)

0:	learn: 0.5849560	total: 120ms	remaining: 35.9s
50:	learn: 0.9743312	total: 1.8s	remaining: 8.78s
100:	learn: 0.9974533	total: 3.41s	remaining: 6.71s
150:	learn: 0.9995759	total: 4.9s	remaining: 4.83s
200:	learn: 1.0000000	total: 6.66s	remaining: 3.28s
250:	learn: 1.0000000	total: 8.33s	remaining: 1.63s
299:	learn: 1.0000000	total: 9.9s	remaining: 0us

🎯 Accuracy: 0.8661
🎯 Macro F1-score: 0.3606

Classification Report:

                precision    recall  f1-score   support

   moa_agonist       0.00      0.00      0.00        14
moa_antagonist       1.00      0.08      0.15        12
 moa_inhibitor       0.89      0.97      0.93       198

      accuracy                           0.87       224
     macro avg       0.63      0.35      0.36       224
  weighted avg       0.84      0.87      0.83       224



In [9]:
cat_boost_pipe.fit(df_merged_molcrl, target_col=top_moa)
cat_boost_pipe.evaluate(show_plots=False)

0:	learn: 0.5641783	total: 120ms	remaining: 36s
50:	learn: 0.9682054	total: 2.06s	remaining: 10.1s
100:	learn: 0.9949034	total: 3.89s	remaining: 7.67s
150:	learn: 0.9995759	total: 5.59s	remaining: 5.52s
200:	learn: 1.0000000	total: 7.45s	remaining: 3.67s
250:	learn: 1.0000000	total: 9.23s	remaining: 1.8s
299:	learn: 1.0000000	total: 10.9s	remaining: 0us

🎯 Accuracy: 0.8705
🎯 Macro F1-score: 0.3927

Classification Report:

                precision    recall  f1-score   support

   moa_agonist       0.20      0.07      0.11        14
moa_antagonist       0.50      0.08      0.14        12
 moa_inhibitor       0.89      0.97      0.93       198

      accuracy                           0.87       224
     macro avg       0.53      0.38      0.39       224
  weighted avg       0.83      0.87      0.84       224



## Summary

In this section of the study, I compared two distinct approaches for extracting molecular features from SMILES strings:

- **Morgan fingerprints**: traditional circular substructure fingerprints.
- **MolCLR embeddings**: learned representations using contrastive learning on molecular graphs.

Both representations were evaluated using the same CatBoostClassifier and identical data preprocessing pipelines. The goal was to assess whether modern embedding techniques like MolCLR provide tangible improvements over established fingerprint methods in MOA classification tasks.

### Summary of Results

| SMILES Representation | Accuracy | Macro F1 | Notes |
|------------------------|----------|----------|-------|
| **Morgan fingerprints** | 0.8661   | 0.3606   | Higher precision on `moa_antagonist`, but zero recall on `moa_agonist` |
| **MolCLR embeddings**   | 0.8705   | 0.3927   | Slightly better macro F1; improved balance across minor classes |

### Analysis

#### Morgan Fingerprints
- Delivered strong overall accuracy (0.8661) and good classification of the dominant class (`moa_inhibitor`).
- Achieved perfect precision (1.00) but very low recall (0.08) on `moa_antagonist`.
- Failed to produce any predictions for `moa_agonist`.
- **Conclusion**: While reliable for dominant patterns, this approach underperforms in generalization to rare classes.

#### MolCLR Embeddings
- Slightly improved overall accuracy and notably better macro F1-score (0.3927).
- Successfully identified samples from all three classes, albeit with low recall.
- Demonstrated more balanced error distribution, reflecting better minor-class sensitivity.
- **Conclusion**: A more generalizable and balanced representation; particularly promising for underrepresented MOAs.

### Final Recommendation

**MolCLR embeddings** outperform Morgan fingerprints in terms of class balance and generalization, despite both approaches showing similar overall accuracy. The higher macro F1-score achieved by MolCLR indicates its superior ability to model diverse MOA classes. Given the importance of detecting rare mechanisms in practical drug discovery tasks, I recommend **adopting MolCLR embeddings** for future iterations of this pipeline.

### Future directions:
- Fine-tuning the MolCLR encoder on domain-specific MOA data
- Exploring hybrid approaches that combine fingerprints and embeddings
- Incorporating domain knowledge into representation learning objectives