# **S-PLM v1: Sequence Embedding Quickstart**

This notebook is a **usage example** of **S-PLM v1**. It demonstrates how to convert amino-acid **sequences** into model **embeddings**, and how to run **downstream training/evaluation**.

* **Purpose:** Produce protein/residue embeddings; fine-tune/evaluate downstream tasks (GO/EC/fold/ER/SS).
* **Checkpoint:** An S-PLM v1 `.pth` checkpoint. Download from the provided [SharePoint link](https://mailmissouri-my.sharepoint.com/:u:/g/personal/wangdu_umsystem_edu/EUf7oNxn1OpCse64KNleK3cBJ396ORTN338eRWRA4Q792A?e=Fgkuzk).



### **Environment Setup**

We **recommend** using an NVIDIA **A100** in Colab; other GPUs/CPU will work but may be slower or run into memory limits.


In [None]:
# Clone S-PLM
!git clone -q https://github.com/duolinwang/S-PLM /content/S-PLM

# Install minimal deps
!pip install 'git+https://github.com/facebookresearch/esm.git' -q
!pip install 'git+https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup' -q
# for downstream tasks only
!pip install torchmetrics -q
!pip uninstall -y accelerate -q
!pip install accelerate==0.34.2 -q


In [None]:
!pip install biopython

### **Prepare Checkpoint**

1. **Download the model** from the provided **[SharePoint link](https://mailmissouri-my.sharepoint.com/:u:/g/personal/wangdu_umsystem_edu/EUf7oNxn1OpCse64KNleK3cBJ396ORTN338eRWRA4Q792A?e=Fgkuzk)** to your local machine.
2. **Upload to your Colab runtime** (Files pane → Upload to session storage), then set:




In [None]:
CHECKPOINT_PATH = "/content/checkpoint_0520000.pth"

3. **Faster option (recommended):** Mount Google Drive and copy the checkpoint from Drive into the Colab runtime.


In [None]:
from google.colab import drive, files
import os, shutil
drive.mount('/content/drive', force_remount=True)
shutil.copy("/content/drive/MyDrive/checkpoint_0520000.pth",
            "/content/checkpoint_0520000.pth")
CHECKPOINT_PATH = "/content/checkpoint_0520000.pth"

### **Minimal embedding code**
Use GVP model to generate embeddings from FASTA sequences, with optional truncation and residue-level outputs.

* **Standard run:** produces **protein-level** embeddings from `.fasta` to `.pkl`
* **Truncated run:** sets `--truncate_inference 1 --max_length_inference 1022` to handle long sequences

* **Residue-level run:** adds `--residue_level`

**Inputs:** `--input_seq` (FASTA), `--config_path`, `--checkpoint_path`.

**Outputs:** pickled embeddings in the working directory (per protein or per residue, depending on flags).


In [None]:
!git pull origin main

In [None]:
import os
os.chdir('/content/S-PLM')

# standard run
!python utils.generate_seq_embedding.py --input_seq /content/S-PLM/sample_fasta/protein.fasta \
  --config_path /content/S-PLM/configs/SPLM1_representation_config.yaml \
  --checkpoint_path /content/checkpoint_0520000.pth \
  --result_path ./  # --residue_level

### **General clustering evaluation (CATH / Kinase)**

We evaluate sequence embedding quality using clustering-based analyses. We report both visualizations (t-SNE scatter plots) and quantitative metrics (Calinski–Harabasz, ARI, silhouette).


### Inputs

* `checkpoint_path`: path to the pretrained model checkpoint (`.pth`)
* `config_path`: path to the YAML config used for the checkpoint
* Path to the evaluation dataset (format depends on `task`)

  * `cath_seq`: CATH FASTA file with CATH codes in headers (e.g., `1.10.10.2080|cath|...`)
  * `kinase_seq`: kinase FASTA file containing Kinase_group  (e.g., `...|Kinase_group=Other|...`)

### What it does
* **Computes embeddings** for all samples in the dataset.
* **Runs clustering evaluation** at one or more label granularities (e.g., CATH Class / Architecture / Fold, or Kinase Group).
* **Generates visualizations**:

  * Projects embeddings to 2D using t-SNE and saves scatter plots colored by ground-truth labels.
* **Computes clustering metrics**:

  * Calinski–Harabasz score (full space and t-SNE 2D)
  * Adjusted Rand Index (ARI) using k-means on the t-SNE space
  * Silhouette score in the full embedding space
* **Saves outputs**.




In [None]:
import os
os.chdir('/content/S-PLM')

!python cath_with_seq.py \
  --cath_seq ./dataset/Rep_subfamily_basedon_S40pdb.fa \
  --checkpoint_path /content/checkpoint_0520000.pth \
  --config_path /content/S-PLM/configs/SPLM1_representation_config.yaml

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

paths = [
    "/content/S-PLM/CATH_seqrep_1.png",
    "/content/S-PLM/CATH_seqrep_2.png",
    "/content/S-PLM/CATH_seqrep_3.png",
]

imgs = [Image.open(p) for p in paths]

total_width = sum(im.width for im in imgs)
max_height = max(im.height for im in imgs)

new_img = Image.new("RGB", (total_width, max_height), (255, 255, 255))
x = 0
for im in imgs:
    new_img.paste(im, (x, 0))
    x += im.width


dpi = 800
plt.figure(figsize=(total_width / dpi, max_height / dpi), dpi=dpi)
plt.imshow(new_img)
plt.axis("off")
plt.show()

In [None]:
import os
os.chdir('/content/S-PLM')

!python kinase_with_seq.py \
  --kinase_seq /content/S-PLM/dataset/kinase_alllabels.fa \
  --checkpoint_path /content/checkpoint_0520000.pth \
  --config_path /content/S-PLM/configs/SPLM1_representation_config.yaml

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("/content/S-PLM/kinase_group_seqrep.png")
plt.imshow(img)
plt.axis("off")
plt.show()

### **Downstream Tasks**

S-PLM v1 provides lightweight tuning code for **supervised downstream tasks** including **EC**, **GO**, **fold**, **ER**, and **secondary structure (SS)** prediction. You set hyperparameters in the corresponding config file and launch the task-specific training script with your pretrained checkpoint.

**Task scripts:** `train_ec.py`, `train_er.py`, `train_fold.py`, `train_go.py`, `train_ss.py`
**Configs:** `configs/config_{task}.yaml` (multiple examples provided)

S-PLM supports **fine-tuning top layers, Adapter Tuning, and LoRA**; pick the variant by choosing the matching config.


**Command template**

```bash
accelerate launch train_{task}.py \
  --config_path configs/<config_name>.yaml \
  --resume_path checkpoint_0520000.pth
```

**Examples**

```bash
# GO (BP / CC / MF)
!python train_go.py --config_path configs/bp_config_adapterH_adapterH.yaml --resume_path checkpoint_0520000.pth
!python train_go.py --config_path configs/cc_config_adapterH_adapterH.yaml --resume_path checkpoint_0520000.pth
!python train_go.py --config_path configs/mf_config_adapterH_adapterH.yaml --resume_path checkpoint_0520000.pth

# Fold classification
!python train_fold.py --config_path configs/fold_config_adapterH_finetune.yaml --resume_path checkpoint_0520000.pth

# Secondary structure
!python train_ss.py --config_path configs/ss_config_adapterH_finetune.yaml --resume_path checkpoint_0520000.pth
```

**Note:** Colab GPUs can be limited; to run in Colab, **reduce batch size** (see `configs/bp_config_adapterH_adapterH_ColabA10040G.yaml`).

In [None]:
!ln -s /content/S-PLM /S-PLM
!python /content/S-PLM/train_go.py --config_path /content/S-PLM/configs/bp_config_adapterH_adapterH_ColabA10040G.yaml --resume_path /content/checkpoint_0520000.pth