CA-MAP

This repository contains code to support the publication entitled "Context-aware Multi-Property Antibody Predictor: a Novel Framework Integrating Text and Protein Language Models". The acronym of this repository stands for CA-MAP: Context-aware Multi-Property Antibody Predictor

Overview

Recent advances in Machine Learning have transformed antibody development through in-silico models, accelerating therapeutic candidate identification. However, three critical challenges persist: rapid adaptation of property predictors to laboratory-specific assays with incomplete datasets; batch effects introducing systematic bias; assay costs necessitating efficient unseen property prediction.

We introduce a novel multi-modal architecture featuring specialized tokenization and embedding projection that integrates text and protein language models (pLM). Our framework enables prompting without dictionary merging across modalities, creating a compact model capable of in-context learning for multi-property prediction. The orchestrating model uses Mamba-based architecture with learnable tokens and projector layers, avoiding pLM-to-text projection while enabling inference-time adaptation without retraining.

This project implements a novel approach for antibody property prediction by combining:

Protein Language Models (PLM): ESM2 for sequence embeddings
Sentence Transformers: For property name embeddings
Mamba/Transformer Architecture: For sequence modeling
In-Context Learning: Few-shot prediction using context antibodies

The model predicts antibody developability properties such as hydrophobicity, solubility, stability, and charge heterogeneity from amino acid sequences. It aims learn correlation from existing properties provided in the context and predict new ones. It aims to to learn batch effect.

Architecture

Multi-Modal Token Embedding

Custom Tokenizer: Handles XML-formatted antibody data with placeholders
ESM2 Integration: Protein sequence embeddings via facebook/esm2_t6_8M_UR50D
Sentence Embeddings: Property name embeddings via paraphrase-multilingual-MiniLM-L12-v2
Value Embeddings: Direct property value projections

Model

Main model: State-space model implementation (model_mamba.py)

Repository Structure

├── inference_exp/           # Directory containing experiments for inference and evaluation
├── models/                 # Trained model weights
├── runs/                   # TensorBoard logs and cached data (including processed dataset)
├── out/                    # Results of experiments 
├── train.py                 # Main training script
├── model_mamba.py          # Mamba model implementation + Embeddings 
├── dataset.py              # Dataset and data loading utilities
├── datasetDevProp.py       # Developability property data handling
├── tokenizer.py            # Custom tokenizer for multi-modal inputs

Data Format

This codebased follows in-silico properties naming and data format from https://github.com/csi-greifflab/developability_profiling/blob/main/data/native/developability.csv

It then simulates prompts representing antibodies with their properties:

<antibody>
    <property name="thermostability">0.95</property>
    <property name="solubility">0.5</property>
    <seq>WVVNGNRICENWLGTFNYHS</seq>
    <seq-l>ARESTIHWSCIAAYIAPSKEICVITWN</seq-l>
</antibody>
[...]
<antibody>
    <property name="thermostability">0.51</property>
    <seq>MCWDHGEPQPNFAETMDWDIYEV</seq>
    <seq-l>ISQVRPMHSFIITWSDVSYI</seq-l>
</antibody>


<query-antibody>
    <seq>GIDNPWRFNDISVKCWMIY</seq><seq-l>FKAQKNCQTW</seq-l>
    <property name="thermostability" />
</query-antibody>

Dependencies

python environment with CUDA support can be created with conda as follows:

# create environment
conda env create -f requirements.yml
# alternatively if mamba package manager is installed, you can use (this has nothing to do with the mamba model architecture)
mamba env create -f requirements.yml

# activate environment with 
conda activate py-in-context-env

Note that: jupyter packages are not installed by default

Training

edit parameters in train.py main parameters for an initial training run are:

expId: defining the ID of the experiment (which will be used to name output files, including weights)
rawDataLocation: Dataset containing antibodies and properties. Follows the format in https://github.com/csi-greifflab/developability_profiling/blob/main/data/native/developability.csv

python train.py

weights will be saved in models/[expID]_[timestamp_start_training]best_weights.pkl (best weights according to validation set) and models/[expID][timestamp_start_training].pkl (final weights after all epochs) processed data splits and prompts will be saved in runs/tok/[expID]proc_dataset.pkl training information compatible with tensorboard be saved in runs/training_info/[expID][timestamp_start_training] . Tensorboard with training info and loss curves can be invoked while (or after) training as:

tensorboard --port [your_port] --logdir runs/training_info

Inference/Experiments

The inference scripts can be divided by experiment and should be stored in the inference_exp directory. One set of experiments is provided in inference_exp8.py

note that it requires running the training, weights are not provided.

cd inference_exp
python inference_exp8.py

This will generate predictions with performance metrics and visualizations in the out/[expID] directory. Example outputs are included in out/exp8. Note that 'inference_exp8.py' will also create CSV files with raw scores and ground truth (not currently included in out/exp8)

Citation

Luca Giancardo*, Melih Yilmaz, Edward Lee, Ke Ren, Yue Zhao, Gordon Trang, Kemal Sonmez, Lan Guo, Nina Cheng "Context-aware Multi-Property Antibody Predictor: a Novel Framework Integrating Text and Protein Language Models". (under review).

Pre-print availble at https://www.biorxiv.org/content/10.64898/2026.01.07.698270v1

License

This project is licensed under the CC-BY-NC-4.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CA-MAP

Overview

Architecture

Multi-Modal Token Embedding

Model

Repository Structure

Data Format

Dependencies

Training

Inference/Experiments

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
inference_exp		inference_exp
models		models
out		out
runs/tok		runs/tok
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
datasetDevProp.py		datasetDevProp.py
model_mamba.py		model_mamba.py
requirements.yml		requirements.yml
tokenizer.py		tokenizer.py
train.py		train.py

License

amazon-science/ca-map

Folders and files

Latest commit

History

Repository files navigation

CA-MAP

Overview

Architecture

Multi-Modal Token Embedding

Model

Repository Structure

Data Format

Dependencies

Training

Inference/Experiments

Citation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages