# **S-PLM v1: Sequence Embedding Quickstart**

This notebook is a **concise usage example** of **S-PLM v1**. It demonstrates how to convert amino-acid **sequences** into model **embeddings**.

* **Purpose:** Input protein sequences → output embeddings.
* **Checkpoint:** An S-PLM v1 `.pth` checkpoint. If you have access, download from the provided [SharePoint link](https://mailmissouri-my.sharepoint.com/:u:/g/personal/wangdu_umsystem_edu/EUf7oNxn1OpCse64KNleK3cBJ396ORTN338eRWRA4Q792A?e=Fgkuzk).
* **Workflow:** load YAML config → build `SequenceRepresentation` → load checkpoint → tokenize with `batch_converter` → forward pass → save embeddings.


### **Environment Setup**

We **recommend** using an NVIDIA **A100** in Colab; other GPUs/CPU will work but may be slower or run into memory limits.


In [None]:
# Clone S-PLM
!git clone -q https://github.com/duolinwang/S-PLM /content/S-PLM

# Install minimal deps
!pip install 'git+https://github.com/facebookresearch/esm.git' -q
!pip install 'git+https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup' -q
# for downstream tasks only
!pip install torchmetrics -q
!pip uninstall -y accelerate -q
!pip install accelerate==0.34.2 -q


### **Prepare Checkpoint**

1. **Download the model** from the provided **[SharePoint link](https://mailmissouri-my.sharepoint.com/:u:/g/personal/wangdu_umsystem_edu/EUf7oNxn1OpCse64KNleK3cBJ396ORTN338eRWRA4Q792A?e=Fgkuzk)** to your local machine.
2. **Upload to your Colab runtime** (Files pane → Upload to session storage), then set:




In [None]:
CHECKPOINT_PATH = "/content/checkpoint_0520000.pth"

3. **Faster option (recommended):** Mount Google Drive and copy the checkpoint from Drive into the Colab runtime.


In [None]:
from google.colab import drive, files
import os, shutil
drive.mount('/content/drive', force_remount=True)
shutil.copy("/content/drive/MyDrive/checkpoint_0520000.pth",
            "/content/checkpoint_0520000.pth")
CHECKPOINT_PATH = "/content/checkpoint_0520000.pth"

### **Minimal embedding code**

* Load the YAML config
* Initialize `SequenceRepresentation`
* Load the checkpoint
* Tokenize sequences with `batch_converter`
* Run a forward pass to obtain `protein_representation` and `residue_representation`


In [None]:
assert CHECKPOINT_PATH, "Set CHECKPOINT_PATH to your .pth (downloaded or uploaded)."

import sys, os, torch, yaml, numpy as np
sys.path.insert(0, "/content/S-PLM")
import yaml
from utils import load_configs, load_checkpoints_only
from model import SequenceRepresentation

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the configuration file
config_path = "/content/S-PLM/configs/representation_config.yaml"
with open(config_path) as f:
    dict_config = yaml.full_load(f)
configs = load_configs(dict_config)

# Create the model using the configuration file
model = SequenceRepresentation(logging=None, configs=configs).to(device)
model.eval()
load_checkpoints_only(CHECKPOINT_PATH, model)

# Create a list of protein sequences (edit these)
sequences = [
    "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEA",
    "CVKQANQALSRFIAPLPFQNTPVVE",
    "TMQYGALLGGKRLR",
]

esm2_seq = [(i, seq) for i, seq in enumerate(sequences)]
batch_labels, batch_strs, batch_tokens = model.batch_converter(esm2_seq)
batch_tokens = batch_tokens.to(device)

with torch.no_grad():
    protein_representation, residue_representation, mask = model(batch_tokens)

protein_embedding = protein_representation.detach().cpu().numpy()  # [N, D]
print("Protein embeddings shape:", protein_embedding.shape)
print("Residue embeddings tensor shape (with padding):", residue_representation.shape)


### **Downstream Tasks**

S-PLM v1 provides lightweight tuning code for **supervised downstream tasks** including **EC**, **GO**, **fold**, **ER**, and **secondary structure (SS)** prediction. You set hyperparameters in the corresponding config file and launch the task-specific training script with your pretrained checkpoint.

**Task scripts:** `train_ec.py`, `train_er.py`, `train_fold.py`, `train_go.py`, `train_ss.py`
**Configs:** `configs/config_{task}.yaml` (multiple examples provided)

S-PLM supports **fine-tuning top layers, Adapter Tuning, and LoRA**; pick the variant by choosing the matching config.


**Command template**

```bash
accelerate launch train_{task}.py \
  --config_path configs/<config_name>.yaml \
  --resume_path checkpoint_0520000.pth
```

**Examples**

```bash
# GO (BP / CC / MF)
!python train_go.py --config_path configs/bp_config_adapterH_adapterH.yaml --resume_path checkpoint_0520000.pth
!python train_go.py --config_path configs/cc_config_adapterH_adapterH.yaml --resume_path checkpoint_0520000.pth
!python train_go.py --config_path configs/mf_config_adapterH_adapterH.yaml --resume_path checkpoint_0520000.pth

# Fold classification
!python train_fold.py --config_path configs/fold_config_adapterH_finetune.yaml --resume_path checkpoint_0520000.pth

# Secondary structure
!python train_ss.py --config_path configs/ss_config_adapterH_finetune.yaml --resume_path checkpoint_0520000.pth
```

**Note:** Colab GPUs can be limited; to run in Colab, **reduce batch size** (see `configs/bp_config_adapterH_adapterH_ColabA10040G.yaml`).

In [None]:
!ln -s /content/S-PLM /S-PLM
!python /content/S-PLM/train_go.py --config_path /content/S-PLM/configs/bp_config_adapterH_adapterH_ColabA10040G.yaml --resume_path /content/checkpoint_0520000.pth