# Predicting TCR–Neoantigen Binding with ML models (Isolation Forest)

#### Import Libraries

In [3]:
import os
import scanpy as sc
import scipy.io
import scipy.sparse as sp
import pandas as pd 
import numpy as np
import anndata
import matplotlib.pyplot as plt
import seaborn as sns

Load feature matrices for the three samples

In [4]:
# Load feature matrices
X_a = pd.read_parquet("data/feature_matrix_X_a.parquet")
X_b = pd.read_parquet("data/feature_matrix_X_b.parquet")
X_c = pd.read_parquet("data/feature_matrix_X_c.parquet")

# Quick sanity check
print(f"X_a shape: {X_a.shape}")
print(f"X_b shape: {X_b.shape}")
print(f"X_c shape: {X_c.shape}")

X_a shape: (401100, 8)
X_b shape: (399550, 8)
X_c shape: (360150, 8)


## Isolation forest model

Set up the model

In [9]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest

def run_isolation_forest(X, feature_cols, contamination=0.001, random_state=42, min_cells=5):
    """
    Applies Isolation Forest to a feature matrix X.

    Parameters:
    - X (pd.DataFrame): Input feature matrix
    - feature_cols (list): List of feature columns to use for modeling
    - contamination (float): Proportion of outliers expected
    - random_state (int): Random seed for reproducibility
    - min_cells (int): Minimum number of cells required per (dextramer, clonotype) pair

    Returns:
    - X_ranked (pd.DataFrame): Ranked (dextramer, clonotype) binding predictions
    - X (pd.DataFrame): Input X with added anomaly scores and inlier/outlier prediction
    """
    
    # Extract model features
    X_model = X[feature_cols]

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_model)

    # Train Isolation Forest
    iso_forest = IsolationForest(
        n_estimators=100,
        contamination=contamination,
        random_state=random_state
    )
    iso_forest.fit(X_scaled)

    # Predict anomaly scores
    X["binding_anomaly_score"] = iso_forest.decision_function(X_scaled)
    X["is_likely_binder"] = iso_forest.predict(X_scaled)

    print("Isolation Forest model trained.")

    # Group by (dextramer, clonotype)
    X_grouped = X.groupby(["dextramer", "raw_clonotype_id"]).agg(
        mean_score=("binding_anomaly_score", "mean"),
        min_score=("binding_anomaly_score", "min"),
        count=("binding_anomaly_score", "count")
    ).reset_index()

    # Filter for pairs with sufficient supporting cells
    X_grouped = X_grouped[X_grouped["count"] > min_cells]

    # Composite final ranking score
    X_grouped["final_score"] = X_grouped["min_score"] * 0.7 + X_grouped["mean_score"] * 0.3

    # Sort by final score (lower = better)
    X_ranked = X_grouped.sort_values(by="final_score")

    return X_ranked, X


Apply the model

In [10]:
# Define feature columns to use
feature_cols = ["dex_norm", "dex_enrich", "clonotype_count", "clonotype_enrichment"]

# Apply to each dataset
X_a_ranked, X_a = run_isolation_forest(X_a, feature_cols)
X_b_ranked, X_b = run_isolation_forest(X_b, feature_cols)
X_c_ranked, X_c = run_isolation_forest(X_c, feature_cols)

# View top predictions
print("🔵 Top predictions for X_a:")
print(X_a_ranked.head(20))

print("\n🟢 Top predictions for X_b:")
print(X_b_ranked.head(20))

print("\n🟣 Top predictions for X_c:")
print(X_c_ranked.head(20))


Isolation Forest model trained.
Isolation Forest model trained.
Isolation Forest model trained.
🔵 Top predictions for X_a:
          dextramer raw_clonotype_id  mean_score  min_score  count  \
2342   Positive_CMV      clonotype39   -0.047885  -0.058837     17   
2522   Positive_CMV      clonotype64   -0.037270  -0.052546      8   
2531   Positive_CMV       clonotype7   -0.026854  -0.041581    216   
17535      SRSF2_29       clonotype7    0.132413  -0.041581    216   
2047   Positive_CMV       clonotype1    0.078245  -0.018140    596   
8867       SRSF2_18       clonotype1    0.127552  -0.034787    596   
9061       SRSF2_18       clonotype2    0.147621  -0.015299    474   
23871      SRSF2_38       clonotype1    0.191869  -0.029197    596   
3895       SRSF2_10       clonotype7    0.146511  -0.009543    216   
15005      SRSF2_26       clonotype1    0.206822  -0.034787    596   
9120       SRSF2_18       clonotype3    0.167509  -0.014971    417   
3411       SRSF2_10       clonotype1 

## Commentary on Isolation Forest Performance

This unsupervised method is attractive because it does not require any labeled ground truth during training.  
In patient **SRSF2-2 (X_a)**, Isolation Forest clearly identifies the three positive control binders as strong outliers, demonstrating  good sensitivity when the true signals are distinct and the noise level is low.

However, in **SRSF2-9 (X_b)**, although the validated binder (**SRSF2_31 ↔ clonotype16**) does emerge as the top-ranked candidate, its anomaly score is only **1.4× higher** than the next-best hit.  
Moreover, the following top candidates are dominated by **clonotype1**, suggesting many false positives due to background noise.

### Conclusion
- **Isolation Forest** performs well when the true binders are strongly separated from background (e.g., SRSF2-2).
- **XGBoost** generally outperforms Isolation Forest, particularly in noisier datasets like SRSF2-9, where explicit supervision allows the model to better distinguish true binders from noise.

