A machine learning pipeline for predicting protein-ligand binding affinity and binary binding classification. The system combines molecular fingerprints with protein sequence embeddings using a three-model approach: baseline, fusion, and calibrated prediction models.
This project implements a comprehensive protein-ligand binding prediction system designed for drug discovery applications. The pipeline processes protein sequences and ligand SMILES strings to predict binding affinity (px values) and binary binding classification (active/inactive).
Key Features:
- Multi-model ensemble approach for robust predictions
- Molecular fingerprint generation using ECFP4 (Extended Connectivity Fingerprints)
- Protein sequence embedding using ESM2 transformer model
- Temperature scaling for probability calibration
- Comprehensive evaluation with multiple metrics
- Modular architecture supporting both training and inference
Scientific Background: The system predicts binding affinity expressed as px values, which represent the negative logarithm (base 10) of binding affinity in molar concentration. Higher px values indicate stronger binding (e.g., px = 7.0 corresponds to 100 nM binding affinity).
-
Baseline Model: ECFP4 molecular fingerprints processed through logistic regression
- Fast inference and interpretable results
- Uses 2048-bit molecular fingerprints with radius 2
- Serves as performance baseline and fallback option
-
Fusion Model: Combined molecular and protein features through neural network
- Concatenates ECFP4 fingerprints with ESM2 protein embeddings
- Multi-layer perceptron with dropout regularization
- Simultaneous binary classification and regression prediction
-
Calibrated Model: Temperature scaling applied to fusion model outputs
- Improves probability calibration for better uncertainty estimation
- Uses validation data to learn optimal temperature parameter
- Provides well-calibrated confidence scores
- Molecular Features: ECFP4 fingerprints (2048 dimensions)
- Protein Features: ESM2-t6-8M embeddings (320 dimensions)
- Optimization: AdamW optimizer with gradient clipping
- Training: Multi-task learning with classification and regression heads
- Evaluation: Comprehensive metrics including AUROC, AUPRC, and calibration scores
- Python 3.8 or higher
- CUDA-compatible GPU (optional, CPU training supported)
- Minimum 8GB RAM (16GB recommended)
-
Clone the repository:
git clone https://github.com/arishs24/hacknation2025.git cd hacknation2025 -
Install dependencies:
pip install -r requirements.txt
-
Verify installation:
python -c "import torch, rdkit, transformers; print('Dependencies installed successfully')"
-
Train the complete pipeline:
python train.py
-
Test trained models:
python test_final_models.py
-
Run interactive predictions:
python predict.py
The system expects CSV files with the following columns:
target_entry: Protein identifiersequence: Amino acid sequencesmiles: Ligand SMILES stringpx: Binding affinity (negative log10 molar)label: Binary classification (0 or 1)
Default training files:
bindingdb_kinase_top10_train.csvbindingdb_kinase_top10_val.csvbindingdb_kinase_top10_test.csv
Machine Learning:
torch>=1.12.0
torchvision>=0.13.0
transformers>=4.20.0
scikit-learn>=1.1.0
Chemistry and Molecular Processing:
rdkit>=2022.03.0
Data Processing and Visualization:
pandas>=1.4.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0
Utility Libraries:
tqdm>=4.64.0
requirements.txt (Complete dependency list):
torch>=1.12.0
torchvision>=0.13.0
transformers>=4.20.0
scikit-learn>=1.1.0
rdkit>=2022.03.0
pandas>=1.4.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0
tqdm>=4.64.0
Alternative installation methods:
For conda users:
conda create -n binding-prediction python=3.9
conda activate binding-prediction
conda install pytorch torchvision -c pytorch
conda install rdkit -c conda-forge
pip install transformers scikit-learn pandas matplotlib seaborn tqdmFor Docker users:
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "train.py"]Minimum Requirements:
- CPU: 4 cores
- RAM: 8GB
- Storage: 5GB free space
- GPU: Optional (CUDA-compatible)
Recommended Requirements:
- CPU: 8+ cores
- RAM: 16GB
- Storage: 10GB free space
- GPU: NVIDIA GPU with 4GB+ VRAM
CPU Training:
- ESM2 protein embedding: 2-5 seconds per sequence
- Model training: 30-60 minutes for full pipeline
- Memory usage: 4-8GB
GPU Training:
- ESM2 protein embedding: 0.5-1 second per sequence
- Model training: 10-20 minutes for full pipeline
- VRAM usage: 2-4GB
hacknation2025/
├── train.py # Main training pipeline
├── step1compare.py # Baseline model comparison
├── predict.py # Prediction interface
├── test_final_models.py # Model evaluation
├── data.ipynb # Data exploration notebook
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── saved_models/ # Trained model artifacts
│ ├── baseline_logreg.pkl # Baseline logistic regression
│ ├── fusion_mlp.pth # Fusion neural network
│ ├── temperature_scaler.pth # Calibration model
│ └── protein_cache.pkl # Cached protein embeddings
├── lightning_logs/ # Training logs and metrics
└── __pycache__/ # Python cache files
Edit parameters in train.py:
# Model architecture
BATCH_SIZE = 32
LR = 2e-3
EPOCHS = 32
PROT_MODEL_NAME = "facebook/esm2_t6_8M_UR50D"
# Feature generation
ECFP_BITS = 2048
ECFP_RADIUS = 2
# Training options
EARLY_STOPPING_PATIENCE = 5
PLOT_TRAINING_CURVES = True
USE_STEP1_COMPARE = True# Pre-trained model loading
LOAD_PRETRAINED = False
LOAD_BASELINE = True
LOAD_FUSION = True
LOAD_SCALER = True
LOAD_PROTEIN_CACHE = True
FINE_TUNE_MODE = False# Train complete pipeline with default settings
python train.py# Comprehensive evaluation on test set
python test_final_models.py# Interactive prediction script
python predict.py
# Or programmatically:
from predict import predict_binding
sequence = "MGSNKSKPKDAS..." # Protein sequence
smiles = "CCN(CC)CCN(C)C(=O)..." # SMILES string
result = predict_binding(sequence, smiles)
print(f"Binding probability: {result['calibrated_probability']:.3f}")
print(f"Predicted px: {result['predicted_px']:.2f}")import pandas as pd
from predict import batch_predict_binding
# Load data
df = pd.read_csv('new_compounds.csv')
# Batch predictions
results = batch_predict_binding(
sequences=df['sequence'].tolist(),
smiles=df['smiles'].tolist()
)
# Save results
df['predicted_probability'] = [r['calibrated_probability'] for r in results]
df['predicted_px'] = [r['predicted_px'] for r in results]
df.to_csv('predictions.csv', index=False)The system reports comprehensive evaluation metrics:
Classification Metrics:
- AUROC (Area Under ROC Curve)
- AUPRC (Area Under Precision-Recall Curve)
- Accuracy and F1-score
- Brier Score (calibration quality)
Regression Metrics:
- Root Mean Square Error (RMSE)
- Mean Absolute Error (MAE)
- R-squared (R²)
- Pearson correlation coefficient
Typical Performance:
- Baseline AUROC: 0.75-0.85
- Fusion AUROC: 0.80-0.90
- px prediction R²: 0.60-0.80
- Calibration improvement: 10-20% better Brier scores
Installation Problems:
# RDKit installation issues
conda install rdkit -c conda-forge
# PyTorch CUDA compatibility
pip install torch --index-url https://download.pytorch.org/whl/cu118Memory Issues:
# Reduce batch size in train.py
BATCH_SIZE = 16 # or 8 for limited memory
# Use CPU-only mode
DEVICE = "cpu"Data Format Errors:
- Ensure CSV files have required columns
- Check SMILES validity using RDKit
- Verify protein sequences contain only standard amino acids
Speed Improvements:
- Use pre-computed protein embeddings (enabled by default)
- Enable GPU acceleration if available
- Use smaller ESM2 model for faster embedding
Memory Optimization:
- Reduce batch size for limited RAM
- Clear protein cache periodically for large datasets
- Use gradient checkpointing for very large models
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is released under the MIT License. See LICENSE file for details.
- BindingDB: Comprehensive binding affinity database
- Meta AI: ESM2 protein language model
- RDKit: Open-source molecular informatics toolkit
- PyTorch: Deep learning framework
- scikit-learn: Machine learning library
HackNation 2025 Project | Advanced Protein-Ligand Binding Prediction