This repository contains the experiment code for ProtoCol, which performs ColBERT-style late-interaction retrieval for proteins. The current set of experiments instantiates ProtoCol with ESM-2 35M, along with baselines (trained mean-pool single-vector, MinHash Jaccard, MMseqs2, and frozen mean-pool ESM-2 650M) and a per-query latency benchmark.
- Python 3.10+
- A CUDA GPU is strongly recommended (the script runs on CPU but very slowly).
-
(Recommended) Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate
-
Install the Python dependencies:
pip install -r requirements.txt
-
Install MMseqs2 (needed for the MMseqs2 baseline). It is a system binary, not a pip package:
sudo apt-get install -y mmseqs2
or, with conda:
conda install -c bioconda mmseqs2
The script also tries to
apt-get installit automatically ifmmseqsis not on yourPATH, but installing it ahead of time avoids needing sudo at run time.
The datasets (SCOPe / Pfam) are downloaded automatically on first run via
curl, so an internet connection is required the first time. Model weights are
pulled from the Hugging Face Hub on first use.
The dataset is selected by the DATASET constant near the top of
protocol_experiments.py:
DATASET = "scope" # "scope" or "pfam"Make sure the constant is set to "scope" (this is the default), then run:
python protocol_experiments.pyEdit the constant to "pfam":
DATASET = "pfam"then run:
python protocol_experiments.pyTo run both back-to-back without manually editing the file, you can override the constant per run from the shell:
# SCOPe
python -c "import protocol_experiments as m; m.DATASET='scope'; m.main()"
# Pfam
python -c "import protocol_experiments as m; m.DATASET='pfam'; m.main()"- Downloads and parses the dataset (SCOPe superfamilies or Pfam clans as homology labels). This repo also contains a local copy of the SCOPe dataset in case the server is down.
- Splits into train/test with disjoint groups and subsamples the test set
(capped at
MAX_TEST_PROTEINS = 3000) for tractable evaluation. - Fine-tunes the last 3 layers of ESM-2 35M with ColBERT MaxSim + InfoNCE (3 epochs for SCOPe, 1 epoch for Pfam).
- Evaluates retrieval Recall@{1, 5, 10, 100} for ColBERT (untrained and trained) and all baselines.
- Runs a median per-query latency benchmark across methods.
Everything is printed to stdout. Each run ends with two tables:
- A retrieval summary (Recall@k per method).
- A combined latency + retrieval summary (Recall@k plus median ms/query).
Edit the constants at the top of protocol_experiments.py:
| Constant | Default | Meaning |
|---|---|---|
DATASET |
"scope" |
"scope" or "pfam" |
SEED |
42 |
Random seed |
MODEL_NAME |
facebook/esm2_t12_35M_UR50D |
Backbone encoder |
ESM_650M |
facebook/esm2_t33_650M_UR50D |
Large frozen baseline |
MAX_LEN |
256 |
Max tokenized sequence length |
MAX_TEST_PROTEINS |
3000 |
Test-set size cap |
KS |
(1, 5, 10, 100) |
Recall@k cutoffs |
LATENCY_NUM_QUERIES |
100 |
Queries timed in the latency benchmark |