# **S-PLM v2: Quickstart**

This notebook is a **usage example** of **S-PLM v2**.

* **Purpose:**

    1. Process PDB structures into the standardized inputs expected by our model.
    
    2. Generate **protein-level** and **residue-level** embeddings.
    
    3. Run sample evaluations and export metrics/logs.
* **Checkpoint:** An S-PLM v2 `.pth` checkpoint. Download from the provided [SharePoint link](https://mailmissouri-my.sharepoint.com/:u:/g/personal/wangdu_umsystem_edu/EUZ74fO3NOxHjTvc6uvKwDsB5fELaaw-oiPHFU9CJky_hg?e=4phwL0).



### **Environment Setup**

We **recommend** using an NVIDIA **A100** in Colab; other GPUs/CPU will work but may be slower or run into memory limits.


In [1]:
# Clone S-PLM
!git clone -q https://github.com/Yichuan0712/SPLM-V2-GVP /content/SPLMv2

# Install minimal deps
!pip install 'git+https://github.com/facebookresearch/esm.git' -q
!pip install 'git+https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup' -q
!pip install biopython -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for fair-esm (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for cosine_annealing_warmup (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install -q "torch==2.5.0" "torchvision==0.20.0" "torchaudio==2.5.0" \
  --index-url https://download.pytorch.org/whl/cu121
import torch
TORCH = "2.5.0"
CUDA = "cu" + torch.version.cuda.replace(".", "")
whl_url = f"https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html"
print("Using wheel URL:", whl_url)
!pip install -q pyg_lib torch-scatter torch-sparse torch-cluster torch-spline-conv \
    -f {whl_url}
!pip install -q torch-geometric

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m780.4/780.4 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m99.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m118.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m116.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m138.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

### **Prepare Checkpoint**

1. **Download the model** from the provided **[SharePoint link](https://mailmissouri-my.sharepoint.com/:u:/g/personal/wangdu_umsystem_edu/EUZ74fO3NOxHjTvc6uvKwDsB5fELaaw-oiPHFU9CJky_hg?e=4phwL0)** to your local machine.
2. **Upload to your Colab runtime** (Files pane → Upload to session storage), then set:




In [3]:
CHECKPOINT_PATH = "/content/checkpoint_0280000_gvp.pth"

3. **Faster option (recommended):** Mount Google Drive and copy the checkpoint from Drive into the Colab runtime.


In [4]:
from google.colab import drive, files
import os, shutil
drive.mount('/content/drive', force_remount=True)
shutil.copy("/content/drive/MyDrive/checkpoint_0280000_gvp.pth",
            "/content/checkpoint_0280000_gvp.pth")
CHECKPOINT_PATH = "/content/checkpoint_0280000_gvp.pth"

Mounted at /content/drive


### **Generate Sequence Embeddings**

Use GVP model to generate embeddings from FASTA sequences, with optional truncation and residue-level outputs.

* **Standard run:** produces **protein-level** embeddings from `.fasta` to `.pkl`
* **Truncated run:** sets `--truncate_inference 1 --max_length_inference 1022` to handle long sequences

* **Residue-level run:** adds `--residue_level`

**Inputs:** `--input_seq` (FASTA), `--config_path`, `--checkpoint_path`.

**Outputs:** pickled embeddings in the working directory (per protein or per residue, depending on flags).


In [5]:
import os
os.chdir('/content/SPLMv2')

# standard run
!python3 -m utils.generate_seq_embedding --input_seq /content/SPLMv2/dataset/protein.fasta \
  --config_path /content/SPLMv2/configs/config_plddtallweight_noseq_rotary_foldseek.yaml \
  --checkpoint_path /content/checkpoint_0280000_gvp.pth \
  --result_path ./

2025-12-09 20:26:28.487376: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-09 20:26:28.506096: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765311988.527581    3252 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765311988.534130    3252 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765311988.550614    3252 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [6]:
import os
os.chdir('/content/SPLMv2')

# truncate_inference with max_length_inference=1022
!python3 -m utils.generate_seq_embedding --input_seq /content/SPLMv2/dataset/protein.fasta \
--config_path /content/SPLMv2/configs/config_plddtallweight_noseq_rotary_foldseek.yaml \
--checkpoint_path /content/checkpoint_0280000_gvp.pth \
--result_path ./ --out_file truncate_protein_embeddings.pkl \
--truncate_inference 1 --max_length_inference 1022

import pickle
with open('truncate_protein_embeddings.pkl', 'rb') as f:
    data = pickle.load(f)

2025-12-09 20:28:00.066752: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-09 20:28:00.085274: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765312080.107039    3717 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765312080.113572    3717 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765312080.130214    3717 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [7]:
import os
os.chdir('/content/SPLMv2')

# residue_level representations
!python3 -m utils.generate_seq_embedding --input_seq /content/SPLMv2/dataset/protein.fasta \
--config_path /content/SPLMv2/configs/config_plddtallweight_noseq_rotary_foldseek.yaml \
--checkpoint_path /content/checkpoint_0280000_gvp.pth \
--result_path ./ --out_file truncate_protein_residue_embeddings.pkl \
--truncate_inference 1 --max_length_inference 1022 --residue_level

import pickle
with open('truncate_protein_residue_embeddings.pkl', 'rb') as f:
    data = pickle.load(f)

2025-12-09 20:29:04.756325: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-09 20:29:04.775314: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765312144.797184    4030 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765312144.803793    4030 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765312144.820386    4030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

### **Preprocess PDB**

First preprocess your PDB files using the provided script; only the resulting HDF5 files can be fed into the S-PLM v2 GVP model.


In [8]:
!python /content/SPLMv2/data/preprocess_pdb.py --data /content/SPLMv2/dataset/CATH_4_3_0_non-rep_pdbs/ --save_path /content/CATH_4_3_0_non-rep_gvp/ --max_workers 4

Processing files: 100% 1553/1553 [00:08<00:00, 180.87it/s]
{'protein_complex': 0, 'no_chain_id_a': 455, 'h5_processed': 1535, 'single_amino_acid': 0, 'error': 0}


### **Generate Structure Embeddings**

Use GVP model to produce **residue-level structure embeddings** from **preprocessed HDF5** inputs and save them to `protein_struct_embeddings.pkl`, then quickly print the loaded result for inspection.

**Inputs:** `--hdf5_path` (preprocessed data), `--config_path`, `--checkpoint_path`.

**Output:** `protein_struct_embeddings.pkl` in the current directory (embeddings per protein/chain).

**Note:** You **must preprocess** PDB first, the model only accepts the processed HDF5 tensors.


In [9]:
import os
os.chdir('/content/SPLMv2')
!python -m utils.generate_struct_embedding \
  --hdf5_path /content/CATH_4_3_0_non-rep_gvp/ \
  --config_path /content/SPLMv2/configs/config_plddtallweight_noseq_rotary_foldseek.yaml \
  --checkpoint_path /content/checkpoint_0280000_gvp.pth \
  --result_path ./ \
  --residue_level

import pickle
with open('protein_struct_embeddings.pkl', 'rb') as f:
    print(pickle.load(f))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        0.71435547],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.6845703 ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.5473633 ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.05685425,
        0.5966797 ]], dtype=float32), '4llgM00_3.10.20.510': array([[0.        , 0.        , 0.43237305, ..., 0.        , 0.        ,
        0.32885742],
       [0.        , 0.        , 0.4453125 , ..., 0.        , 0.        ,
        0.6611328 ],
       [0.        , 0.        , 0.4248047 , ..., 0.        , 0.        ,
        0.7089844 ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.5097656 ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.6826172 ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.5913086 ]], dty

### **Evaluate on CATH**

We use a CATH subset to assess the quality of structure embeddings on a clustering task, both through visualizations of the embedding space and through quantitative metrics such as silhouette scores and other clustering-based measures.

Build the GVP model structure model, run **CATH** evaluation with preprocessed HDF5 inputs, and save metrics/figures.
**Inputs:** `checkpoint_path`, `config_path`, and `cath_path` pointing to the CATH HDF5 directory `dataset/CATH_4_3_0_non-rep_h5/`.

**What it does:**

* Instantiates `StructRepresentModel` and sets `out_figure_path`.
* Calls `evaluate_with_cath_more_struct(...)` to compute clustering/quality metrics (Class/Architecture/Fold level, ARI, silhouette).
* Prints scores to stdout and writes a summary to `scores.txt` under `out_figure_path`.



In [19]:
!python cath_with_struct.py --checkpoint_path /content/checkpoint_0280000_gvp.pth \
--config_path /content/SPLMv2/configs/config_plddtallweight_noseq_rotary_foldseek.yaml \
--cath_path /content/SPLMv2/dataset/CATH_4_3_0_non-rep_h5/

2025-12-09 21:22:57.344134: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765315377.365520   20766 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765315377.371925   20766 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765315377.388307   20766 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1765315377.388341   20766 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1765315377.388344   20766 computation_placer.cc:177] computation placer alr