SOPE-MsL: Synergy-Optimized PLM Embeddings with Multi-Scale Learning for Interpretable Protein–Small Molecule Binding Site Prediction

SOPE-MsL (Synergy-Optimized PLM Embeddings with Multi-scale Learning) is a deep learning framework for interpretable prediction of protein-small molecule binding sites. It identifies and synergistically fuses complementary protein language model (PLM) embeddings, and processes them with a multi-scale CNN and attention architecture to achieve robust and explainable predictions.

SOPE-MsL leverages two large-scale pre-trained PLMs—ProstT5(link) and Ankh(link)—to generate informative residue-level embeddings. These embeddings are processed using PyTorch and the Hugging Face transformers library to extract rich protein representations.

To ensure optimal model selection, we systematically compared several widely used PLMs, including ESM-2 (link), ProtT5 (link), ESM-1b (link), and ProtBERT (link). Our comparative evaluation demonstrated that the combination of Ankh (link) and ProstT5 (link) consistently achieved superior performance over single-model and cross-category combinations, and was therefore selected as the optimal embedding fusion strategy for downstream tasks. By integrating optimal PLM embeddings with a multi-scale graph learning architecture, SOPE-MsL provides a robust and interpretable solution for predicting protein–small molecule interaction sites.

1. System Requirements

The source code was developed in Python 3.10 using PyTorch 2.6.0 with CUDA 11.8 support.
The required Python dependencies are listed below:

Python: 3.10
PyTorch: 2.6.0+cu118
Torchvision: (match PyTorch version if needed)
Torchaudio: 2.6.0+cu118
CUDA: 11.8 (recommended)
Transformers: (match your project, e.g., 4.46.x)
SentencePiece: 0.1.99
fair-esm: 2.0.0
scikit-learn: 1.6.1
pandas: 2.2.3
matplotlib: 3.10.5
shap: 0.48.0
datasets: 2.21.0
numpy / scipy: 2.2.4 / 1.15.2

2. Datasets

SOPE-MsL uses three benchmark datasets for protein–small molecule binding residue prediction:

SMB: A curated dataset of protein–small molecule binding sites.
SJC: A dataset collected from [source/reference].
UniProtSMB: Binding site annotations derived from UniProt entries.

All datasets are preprocessed into residue-level formats suitable for embedding extraction and model training. Users can also apply the same preprocessing steps to their own protein sequences.

3. How to use

1.Extract PLM embeddings:

cd to the SOPE-MsL/FeatureExtract directory.
Run python3 extract_ankh.py to generate residue-level embeddings from the Ankh model, which will be saved to the embedding/ankh_embedding folder.
Run python3 extract_prostT5.py to generate residue-level embeddings from the ProstT5 model, which will be saved to the embedding/prostt5_embedding folder.

2. Train and Test

Navigate to the SOPE-MsL project root directory:

cd path/to/SOPE-MsL

Run the following command to train and test the model on all protein–small molecule binding residue datasets (SMB, SJC, and UniProtSMB):

python3 main.py

3. Prediction / Inference

To perform inference using a trained SOPE-MsL model, run the prediction script Prediction.py from the project root directory:

python3 Prediction.py

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
Dateset		Dateset
FeatureExtract		FeatureExtract
RSMB		RSMB
Test		Test
embedding		embedding
pre_model/SMB		pre_model/SMB
Prediction.py		Prediction.py
README.md		README.md
date_loader.py		date_loader.py
focalLoss.py		focalLoss.py
main.py		main.py
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SOPE-MsL: Synergy-Optimized PLM Embeddings with Multi-Scale Learning for Interpretable Protein–Small Molecule Binding Site Prediction

1. System Requirements

2. Datasets

3. How to use

1.Extract PLM embeddings:

2. Train and Test

3. Prediction / Inference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SOPE-MsL: Synergy-Optimized PLM Embeddings with Multi-Scale Learning for Interpretable Protein–Small Molecule Binding Site Prediction

1. System Requirements

2. Datasets

3. How to use

1.Extract PLM embeddings:

2. Train and Test

3. Prediction / Inference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages