Skip to content

arishs24/hacknation2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein-Ligand Binding Prediction Pipeline

A machine learning pipeline for predicting protein-ligand binding affinity and binary binding classification. The system combines molecular fingerprints with protein sequence embeddings using a three-model approach: baseline, fusion, and calibrated prediction models.

Project Description

This project implements a comprehensive protein-ligand binding prediction system designed for drug discovery applications. The pipeline processes protein sequences and ligand SMILES strings to predict binding affinity (px values) and binary binding classification (active/inactive).

Key Features:

  • Multi-model ensemble approach for robust predictions
  • Molecular fingerprint generation using ECFP4 (Extended Connectivity Fingerprints)
  • Protein sequence embedding using ESM2 transformer model
  • Temperature scaling for probability calibration
  • Comprehensive evaluation with multiple metrics
  • Modular architecture supporting both training and inference

Scientific Background: The system predicts binding affinity expressed as px values, which represent the negative logarithm (base 10) of binding affinity in molar concentration. Higher px values indicate stronger binding (e.g., px = 7.0 corresponds to 100 nM binding affinity).

Architecture Overview

Three-Model Pipeline

  1. Baseline Model: ECFP4 molecular fingerprints processed through logistic regression

    • Fast inference and interpretable results
    • Uses 2048-bit molecular fingerprints with radius 2
    • Serves as performance baseline and fallback option
  2. Fusion Model: Combined molecular and protein features through neural network

    • Concatenates ECFP4 fingerprints with ESM2 protein embeddings
    • Multi-layer perceptron with dropout regularization
    • Simultaneous binary classification and regression prediction
  3. Calibrated Model: Temperature scaling applied to fusion model outputs

    • Improves probability calibration for better uncertainty estimation
    • Uses validation data to learn optimal temperature parameter
    • Provides well-calibrated confidence scores

Technical Implementation

  • Molecular Features: ECFP4 fingerprints (2048 dimensions)
  • Protein Features: ESM2-t6-8M embeddings (320 dimensions)
  • Optimization: AdamW optimizer with gradient clipping
  • Training: Multi-task learning with classification and regression heads
  • Evaluation: Comprehensive metrics including AUROC, AUPRC, and calibration scores

Setup Instructions

Prerequisites

  • Python 3.8 or higher
  • CUDA-compatible GPU (optional, CPU training supported)
  • Minimum 8GB RAM (16GB recommended)

Installation

  1. Clone the repository:

    git clone https://github.com/arishs24/hacknation2025.git
    cd hacknation2025
  2. Install dependencies:

    pip install -r requirements.txt
  3. Verify installation:

    python -c "import torch, rdkit, transformers; print('Dependencies installed successfully')"

Quick Start

  1. Train the complete pipeline:

    python train.py
  2. Test trained models:

    python test_final_models.py
  3. Run interactive predictions:

    python predict.py

Data Preparation

The system expects CSV files with the following columns:

  • target_entry: Protein identifier
  • sequence: Amino acid sequence
  • smiles: Ligand SMILES string
  • px: Binding affinity (negative log10 molar)
  • label: Binary classification (0 or 1)

Default training files:

  • bindingdb_kinase_top10_train.csv
  • bindingdb_kinase_top10_val.csv
  • bindingdb_kinase_top10_test.csv

Dependencies and Environment

Core Dependencies

Machine Learning:

torch>=1.12.0
torchvision>=0.13.0
transformers>=4.20.0
scikit-learn>=1.1.0

Chemistry and Molecular Processing:

rdkit>=2022.03.0

Data Processing and Visualization:

pandas>=1.4.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0

Utility Libraries:

tqdm>=4.64.0

Environment Files

requirements.txt (Complete dependency list):

torch>=1.12.0
torchvision>=0.13.0
transformers>=4.20.0
scikit-learn>=1.1.0
rdkit>=2022.03.0
pandas>=1.4.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0
tqdm>=4.64.0

Alternative installation methods:

For conda users:

conda create -n binding-prediction python=3.9
conda activate binding-prediction
conda install pytorch torchvision -c pytorch
conda install rdkit -c conda-forge
pip install transformers scikit-learn pandas matplotlib seaborn tqdm

For Docker users:

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "train.py"]

System Requirements

Minimum Requirements:

  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 5GB free space
  • GPU: Optional (CUDA-compatible)

Recommended Requirements:

  • CPU: 8+ cores
  • RAM: 16GB
  • Storage: 10GB free space
  • GPU: NVIDIA GPU with 4GB+ VRAM

Hardware Considerations

CPU Training:

  • ESM2 protein embedding: 2-5 seconds per sequence
  • Model training: 30-60 minutes for full pipeline
  • Memory usage: 4-8GB

GPU Training:

  • ESM2 protein embedding: 0.5-1 second per sequence
  • Model training: 10-20 minutes for full pipeline
  • VRAM usage: 2-4GB

Project Structure

hacknation2025/
├── train.py                           # Main training pipeline
├── step1compare.py                    # Baseline model comparison
├── predict.py                         # Prediction interface
├── test_final_models.py              # Model evaluation
├── data.ipynb                         # Data exploration notebook
├── requirements.txt                   # Python dependencies
├── README.md                          # Project documentation
├── saved_models/                      # Trained model artifacts
│   ├── baseline_logreg.pkl           # Baseline logistic regression
│   ├── fusion_mlp.pth                # Fusion neural network
│   ├── temperature_scaler.pth        # Calibration model
│   └── protein_cache.pkl             # Cached protein embeddings
├── lightning_logs/                    # Training logs and metrics
└── __pycache__/                      # Python cache files

Configuration Options

Training Configuration

Edit parameters in train.py:

# Model architecture
BATCH_SIZE = 32
LR = 2e-3
EPOCHS = 32
PROT_MODEL_NAME = "facebook/esm2_t6_8M_UR50D"

# Feature generation
ECFP_BITS = 2048
ECFP_RADIUS = 2

# Training options
EARLY_STOPPING_PATIENCE = 5
PLOT_TRAINING_CURVES = True
USE_STEP1_COMPARE = True

Model Loading Options

# Pre-trained model loading
LOAD_PRETRAINED = False
LOAD_BASELINE = True
LOAD_FUSION = True
LOAD_SCALER = True
LOAD_PROTEIN_CACHE = True
FINE_TUNE_MODE = False

Usage Examples

Basic Training

# Train complete pipeline with default settings
python train.py

Model Evaluation

# Comprehensive evaluation on test set
python test_final_models.py

Making Predictions

# Interactive prediction script
python predict.py

# Or programmatically:
from predict import predict_binding

sequence = "MGSNKSKPKDAS..."  # Protein sequence
smiles = "CCN(CC)CCN(C)C(=O)..."  # SMILES string

result = predict_binding(sequence, smiles)
print(f"Binding probability: {result['calibrated_probability']:.3f}")
print(f"Predicted px: {result['predicted_px']:.2f}")

Batch Processing

import pandas as pd
from predict import batch_predict_binding

# Load data
df = pd.read_csv('new_compounds.csv')

# Batch predictions
results = batch_predict_binding(
    sequences=df['sequence'].tolist(),
    smiles=df['smiles'].tolist()
)

# Save results
df['predicted_probability'] = [r['calibrated_probability'] for r in results]
df['predicted_px'] = [r['predicted_px'] for r in results]
df.to_csv('predictions.csv', index=False)

Performance Metrics

The system reports comprehensive evaluation metrics:

Classification Metrics:

  • AUROC (Area Under ROC Curve)
  • AUPRC (Area Under Precision-Recall Curve)
  • Accuracy and F1-score
  • Brier Score (calibration quality)

Regression Metrics:

  • Root Mean Square Error (RMSE)
  • Mean Absolute Error (MAE)
  • R-squared (R²)
  • Pearson correlation coefficient

Typical Performance:

  • Baseline AUROC: 0.75-0.85
  • Fusion AUROC: 0.80-0.90
  • px prediction R²: 0.60-0.80
  • Calibration improvement: 10-20% better Brier scores

Troubleshooting

Common Issues

Installation Problems:

# RDKit installation issues
conda install rdkit -c conda-forge

# PyTorch CUDA compatibility
pip install torch --index-url https://download.pytorch.org/whl/cu118

Memory Issues:

# Reduce batch size in train.py
BATCH_SIZE = 16  # or 8 for limited memory

# Use CPU-only mode
DEVICE = "cpu"

Data Format Errors:

  • Ensure CSV files have required columns
  • Check SMILES validity using RDKit
  • Verify protein sequences contain only standard amino acids

Performance Optimization

Speed Improvements:

  • Use pre-computed protein embeddings (enabled by default)
  • Enable GPU acceleration if available
  • Use smaller ESM2 model for faster embedding

Memory Optimization:

  • Reduce batch size for limited RAM
  • Clear protein cache periodically for large datasets
  • Use gradient checkpointing for very large models

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is released under the MIT License. See LICENSE file for details.

Acknowledgments

  • BindingDB: Comprehensive binding affinity database
  • Meta AI: ESM2 protein language model
  • RDKit: Open-source molecular informatics toolkit
  • PyTorch: Deep learning framework
  • scikit-learn: Machine learning library

HackNation 2025 Project | Advanced Protein-Ligand Binding Prediction

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors