# Interactive Binding Affinity Predictor

Enter your SMILES strings below to predict binding affinity (pKd) using our pre-trained models.

## How to Use
1. Run all cells in order (Kernel > Run All)
2. Enter your SMILES in the text box
3. Click 'Predict' to get binding affinity predictions

## Model Information
- **XGBoost** (best): 5-fold CV RÂ² = 0.52, RMSE = 1.30 pKd
- Trained on 16,562 compounds from PDBbind v2020
- Target-agnostic descriptor-based model

In [None]:
# Install dependencies if needed
# !pip install rdkit pandas numpy xgboost ipywidgets

In [None]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
from rdkit import Chem
from rdkit.Chem import Descriptors, Draw
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')

print("Dependencies loaded successfully!")

In [None]:
# Load pre-trained model
MODELS_DIR = Path('../models')

model_data = None
model_name = None

for name in ['xgboost', 'lightgbm', 'random_forest']:
    model_path = MODELS_DIR / f'{name}_model.pkl'
    if model_path.exists():
        with open(model_path, 'rb') as f:
            model_data = pickle.load(f)
        model_name = name
        break

if model_data:
    model = model_data['model']
    feature_cols = model_data['feature_cols']
    metrics = model_data['metrics']
    
    print(f"Loaded: {model_name.replace('_', ' ').title()} model")
    print(f"Features: {', '.join(feature_cols)}")
else:
    raise FileNotFoundError("No pre-trained model found. Run 02_train_and_save_models.py first.")

In [None]:
def calculate_descriptors(smiles):
    """Calculate molecular descriptors for a SMILES string."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None, "Invalid SMILES"
    
    try:
        descriptors = {
            'MolWt': Descriptors.MolWt(mol),
            'LogP': Descriptors.MolLogP(mol),
            'TPSA': Descriptors.TPSA(mol),
            'NumHDonors': Descriptors.NumHDonors(mol),
            'NumHAcceptors': Descriptors.NumHAcceptors(mol),
            'NumRotatableBonds': Descriptors.NumRotatableBonds(mol),
            'NumAromaticRings': Descriptors.NumAromaticRings(mol),
            'FractionCSP3': Descriptors.FractionCSP3(mol),
            'NumHeavyAtoms': mol.GetNumHeavyAtoms(),
            'RingCount': Descriptors.RingCount(mol),
        }
        return descriptors, mol
    except Exception as e:
        return None, str(e)

def predict_affinity(smiles):
    """Predict binding affinity for a SMILES string."""
    result = calculate_descriptors(smiles)
    
    if result[0] is None:
        return None, None, result[1]
    
    descriptors, mol = result
    
    # Prepare features
    features = [descriptors[col] for col in feature_cols]
    
    # Predict
    predicted_pKd = model.predict([features])[0]
    predicted_Kd_nM = 10**(9 - predicted_pKd)
    
    return predicted_pKd, predicted_Kd_nM, mol

def interpret_binding(pkd):
    """Interpret binding strength."""
    if pkd > 9:
        return "Very Strong", "green", "pKd > 9 (sub-nanomolar)"
    elif pkd > 7:
        return "Strong", "limegreen", "7 < pKd < 9 (nanomolar)"
    elif pkd > 5:
        return "Moderate", "orange", "5 < pKd < 7 (micromolar)"
    else:
        return "Weak", "red", "pKd < 5 (millimolar)"

print("Prediction functions ready!")

---
## Enter Your SMILES Below

Enter one SMILES per line. Examples:
- Ibuprofen: `CC(C)CC1=CC=C(C=C1)C(C)C(=O)O`
- Aspirin: `CC(=O)OC1=CC=CC=C1C(=O)O`
- Caffeine: `CN1C=NC2=C1C(=O)N(C(=O)N2C)C`

In [None]:
#############################################
# ENTER YOUR SMILES STRINGS BELOW
# One SMILES per line in the list
#############################################

MY_SMILES = [
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",  # Ibuprofen
    "CC(=O)OC1=CC=CC=C1C(=O)O",        # Aspirin
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",   # Caffeine
    # Add your SMILES here:
    # "YOUR_SMILES_1",
    # "YOUR_SMILES_2",
]

print(f"Ready to predict {len(MY_SMILES)} compound(s)")

In [None]:
# Run predictions
results = []
molecules = []

print("=" * 70)
print("BINDING AFFINITY PREDICTIONS")
print("=" * 70)
print()

for i, smiles in enumerate(MY_SMILES, 1):
    smiles = smiles.strip()
    if not smiles or smiles.startswith('#'):
        continue
    
    pkd, kd_nm, mol_or_error = predict_affinity(smiles)
    
    smiles_display = smiles[:50] + "..." if len(smiles) > 50 else smiles
    print(f"Compound {i}: {smiles_display}")
    
    if pkd is None:
        print(f"  ERROR: {mol_or_error}")
        print()
        continue
    
    strength, color, description = interpret_binding(pkd)
    
    print(f"  Predicted pKd:  {pkd:.2f}")
    print(f"  Predicted Kd:   {kd_nm:.1f} nM")
    print(f"  Binding:        {strength} ({description})")
    print()
    
    results.append({
        'SMILES': smiles,
        'Predicted_pKd': pkd,
        'Predicted_Kd_nM': kd_nm,
        'Binding_Strength': strength
    })
    molecules.append(mol_or_error)

print("=" * 70)

In [None]:
# Display results as table
if results:
    df_results = pd.DataFrame(results)
    df_results['Predicted_pKd'] = df_results['Predicted_pKd'].round(2)
    df_results['Predicted_Kd_nM'] = df_results['Predicted_Kd_nM'].round(1)
    
    print("\nResults Summary:")
    display(df_results)

In [None]:
# Visualize molecules
if molecules:
    valid_mols = [m for m in molecules if m is not None and isinstance(m, Chem.rdchem.Mol)]
    if valid_mols:
        print("\nMolecule Structures:")
        img = Draw.MolsToGridImage(valid_mols, molsPerRow=3, subImgSize=(300, 300))
        display(img)

In [None]:
# Export results to CSV
if results:
    output_file = 'my_predictions.csv'
    df_results.to_csv(output_file, index=False)
    print(f"Results saved to: {output_file}")

---
## Model Limitations

**Important caveats:**

1. **Target-agnostic**: This model predicts general binding affinity based on molecular properties only. The same compound will get the same prediction regardless of target protein.

2. **Descriptor-based**: Uses 10 molecular descriptors (MW, LogP, TPSA, etc.). Does not consider 3D structure or protein-ligand interactions.

3. **Training data**: Trained on PDBbind v2020 (~16K compounds). May perform poorly on molecules very different from training data.

4. **Data quality warning**: ~25.7% of cross-database binding data shows conflicts (>10-fold differences). Always validate experimentally.

5. **Expected error**: RMSE = 1.30 pKd units (~20-fold error in Kd).

For publication-quality predictions, consider:
- Protein-specific models
- 3D structural features
- Graph neural networks
- Molecular dynamics-based methods