SOPE-MsL: Synergy-Optimized PLM Embeddings with Multi-Scale Learning for Interpretable Protein–Small Molecule Binding Site Prediction
SOPE-MsL (Synergy-Optimized PLM Embeddings with Multi-scale Learning) is a deep learning framework for interpretable prediction of protein-small molecule binding sites. It identifies and synergistically fuses complementary protein language model (PLM) embeddings, and processes them with a multi-scale CNN and attention architecture to achieve robust and explainable predictions.
SOPE-MsL leverages two large-scale pre-trained PLMs—ProstT5(link) and Ankh(link)—to generate informative residue-level embeddings. These embeddings are processed using PyTorch and the Hugging Face transformers library to extract rich protein representations.
To ensure optimal model selection, we systematically compared several widely used PLMs, including ESM-2 (link), ProtT5 (link), ESM-1b (link), and ProtBERT (link). Our comparative evaluation demonstrated that the combination of Ankh (link) and ProstT5 (link) consistently achieved superior performance over single-model and cross-category combinations, and was therefore selected as the optimal embedding fusion strategy for downstream tasks. By integrating optimal PLM embeddings with a multi-scale graph learning architecture, SOPE-MsL provides a robust and interpretable solution for predicting protein–small molecule interaction sites.
The source code was developed in Python 3.10 using PyTorch 2.6.0 with CUDA 11.8 support.
The required Python dependencies are listed below:
- Python: 3.10
- PyTorch: 2.6.0+cu118
- Torchvision: (match PyTorch version if needed)
- Torchaudio: 2.6.0+cu118
- CUDA: 11.8 (recommended)
- Transformers: (match your project, e.g., 4.46.x)
- SentencePiece: 0.1.99
- fair-esm: 2.0.0
- scikit-learn: 1.6.1
- pandas: 2.2.3
- matplotlib: 3.10.5
- shap: 0.48.0
- datasets: 2.21.0
- numpy / scipy: 2.2.4 / 1.15.2
SOPE-MsL uses three benchmark datasets for protein–small molecule binding residue prediction:
- SMB: A curated dataset of protein–small molecule binding sites.
- SJC: A dataset collected from [source/reference].
- UniProtSMB: Binding site annotations derived from UniProt entries.
All datasets are preprocessed into residue-level formats suitable for embedding extraction and model training. Users can also apply the same preprocessing steps to their own protein sequences.
cd to the SOPE-MsL/FeatureExtract directory.
Run python3 extract_ankh.py to generate residue-level embeddings from the Ankh model, which will be saved to the embedding/ankh_embedding folder.
Run python3 extract_prostT5.py to generate residue-level embeddings from the ProstT5 model, which will be saved to the embedding/prostt5_embedding folder.
- Navigate to the SOPE-MsL project root directory:
cd path/to/SOPE-MsL- Run the following command to train and test the model on all protein–small molecule binding residue datasets (SMB, SJC, and UniProtSMB):
python3 main.pyTo perform inference using a trained SOPE-MsL model, run the prediction script Prediction.py from the project root directory:
python3 Prediction.py