To determine whether and the extent to which training with natively paired antibody sequence data can improve antibody-specific language models (LMs), we trained three baseline antibody language model (BALM) variants: BALM-paired, which is trained using only natively paired training data, BALM-shuffled, which is trained using randomly paired trianing data, and BALM-unpaied, which is trained using the same antibody sequences but without pairing information. Additionally, we performed full fine-tuning of the state-of-the-art general protein LM ESM-2 using the same natively paired dataset used to train BALM-paired. The Jupyter notebooks in this repository contain all code necessary to re-train each of these models from scratch:
- BALM-paired: downloads training data (if necessary) and trains BALM-paired.
- BALM-shuffled: training data will need to be processed to randomly shuffle the pairing, then use the same training script as BALM-paired
- BALM-unpaired: downloads training data (if necessary) and trains BALM-unpaired.
- ESM-2 fine-tuning: downloads training data (if necessary) and performs full fine-tuning of ESM-2.
Weights for each of the aforementioned models can be downloaded from Zenodo.
BALM has been published in Patterns, and can be cited as:
Burbach, S.M., & Briney, B. (2024). Improving antibody language models with native pairing.
Patterns. https://doi.org/10.1016/j.patter.2024.100967
The current version of the BALM dataset (v2024.02.20) can be cited as:
Burbach SM, Briney B. Improving antibody language models with native pairing (v2024.02.20) [Data set].
Zenodo. 2023. https://doi.org/10.5281/zenodo.10684811