# Finetuning AlphaBind to your data and running inference
**2024-10-21**

This tutorial shows a user how to fine-tune Alphabind to their own specific PPI data. We will cover the following steps:

1. Preparing your data to be consumed by AlphaBind
2. Fine-tuning Alphabind to customize it to your protein-protein interaction (PPI) data
3. Running inference to predict on a new dataset

## 1) Prepare your data for AlphaBind

In order to fine-tune the AlphaBind pre-trained model checkpoint, you will need to convert your dataset into something that the model's inputs expect. AlphaBind predictions are based on the AlphaSeq assay. You must always format your data in the "YM" or "YeastMating" format. This requires 3 critical columns in your dataset:
- `sequence_a`: The peptide sequence of one of the proteins in a PPI
- `sequence_alpha`: The peptide sequence of the second protein in a PPI
- `Kd`: A measurement or proxy measurement for binding affinity. NOTE: The AlphaBind pre-trained model was trained on data expressed as log10-Kd. Your data does not necessarily need to be in the same range or scale, however we strongly recommend that it should follow the same sign convention (i.e. large negative values represent strong binders, and large positive values represent weak binders).

The above columns are case-sensitive and mandatory. While you do not need explicit AlphaSeq assay data to use AlphaBind, you need to derive the analogs to each of these columns in your native dataset.

### NOTE: `sequence_a` and `sequence_alpha` assignment convention

When assigning sequences, we recommend choosing a consistent convention for assigning functional classes of proteins to `sequence_a` vs. `sequence_alpha`. For example, in a dataset of antibody-antigen interactions, provide sequences for all antibodies and their variants in `sequence_a` only, and provide all antigens and their respective variants in `sequence_alpha` only.

**NOTE**: If you intend to run optimization to design your own sequences later, ensure the protein that you would want modifications to is `sequence_a`. For example, if you want to generate diverse antibodies against a known target, your `sequence_a` would be the antibodies and your `sequence_alpha` would be the target antigen sequence.

### NOTE: Max sequence length

The AlphaBind model pre-trained checkpoint supports a max combined sequence length (`sequence_a` and `sequence_alpha`) of 600 amino acids (AA). Ensure that the combined length of each PPI pair in your dataset is within this range.

### Example: Pre-processing the data from _Mason et al._

For an illustrative example, We will use an example from [Mason et al](https://www.nature.com/articles/s41551-021-00699-9). Here, we will transform the data into something that looks YM-like, even though the original experimental assay is not AlphaSeq.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler

### A. Basic EDA on the Dataset

The authors of the paper generated 3 libraries of Trastuzumab variants in the IgG format with mutations in the CDRH3 region. The starting sequence was a non-binding variant. Libraries were then transfected into mammalian cells and transfected into FACS. One round of enrichment occurred, to identify an Antigen binding (Ag+) and non-binding (Ag-) population. Another round of enrichment on the Ag+ population allowed further refinement of stronger binders (Ag+2) population. Thus, three binding populations are defined: Ag+2, Ag+, and Ag-.

The most relevant columns of the dataset are as follows:
- `AASeq`: CDRH3 sequence
- `Enrichment Ratio`: Proxy for affinity
- `VH Seq`: Full antibody sequence
- `Target Seq`: Target antigen sequence

In [None]:
df = pd.read_csv("alphabind/examples/data/raw/DSM_enrichment_full_seqs.csv")
df.head()

### B. Cleaning and preparing the data

In the dataset, we want to ensure the following:
- There is a uniquely assigned score to each PPI (i.e. exactly one `Kd` value for each unique `sequence_a` : `sequence_alpha` pair)
- [_Training Data Only_]: There are no missing or `NaN` values for the `Kd` score.

We check the following:
- Number of unique antibody sequences/CDHR3 sequences
- Number of unique antigens
- Number of NaN counts

In [None]:
print("Unique CDHR3 sequences", df["AASeq"].nunique(), "/", df.shape[0])
print("Unique VH sequences", df["VH Seq"].nunique(), "/", df.shape[0])
print("Unique antigens", df["Target Seq"].nunique(), "/", df.shape[0])
print("Missing affinities?", df["Enrichment Ratio"].isna().sum(), "/", df.shape[0])

We can observe the proxy for affinity, `Enrichment Ratio`, by looking at the distribution of weak and strong binders. You will see a bi-modal like distribution below.

In [None]:
f, ax = plt.subplots(1, 1)
sns.histplot(data=df, x="Enrichment Ratio", bins=20)

### C. Preparing your affinity-proxy data.

In the above example, the `Enrichment Ratio` is our proxy for affinity. Alphaseq data is typically in log 10 $K_d$, with a range between [-3, 3] on average. While unnecessary, **standard scaling the data** tends to be helpful (although effective ML models can learn simple linear transformations).

Because of the definition of $K_d$, the stronger the binder, the lower the $K_d$. Our optimization algorithms determine "better" sequences by going downhill in affinity; **this means if you are intending to use any of our built-in optimization methods, you will need to ensure the sign of your affinity proxy points to the desired direction trending negative**. In this dataset, the higher the Enrichment Ratio, the "better", thus we will perform the following operations to make the data more amenable to AlphaBind.

- (a) Standard scale the data
- (b) Flip the sign of the ratio for further optimization
- (c) Re-center at 2

Steps (a) and (c) are optional; step (b) is mandatory IF you choose to use our optimization algorithms.
We save this new data as `y` in our dataframe.

In [None]:
Kd_center = 2.0
scaler = StandardScaler()

df["Kd_transformed"] = scaler.fit_transform(df[["Enrichment Ratio"]])[:, 0]
df["Kd_transformed"] = -df["Kd_transformed"]
df["Kd_transformed"] = df["Kd_transformed"] + Kd_center

# Plot the new transform
f, ax = plt.subplots(1, 1)
sns.histplot(data=df, x="Kd_transformed", bins=20)

### D. Convert to a YM-format

We now want to convert this dataset into something that can be used for training. We will pull out the necessary columns.

We assign `VH Seq` to `sequence_a`, `Target Seq` to `sequence_alpha`, and `Enrichment Ratio` to `Kd`. Please note, the column names are case sensitive AND required for fine tuning. For simplicity, you may also include other columns like `description_a` or `description_alpha` to keep track of what these sequences are. Only the three aforementioned columns are necessary.

In [6]:
ymdf = pd.DataFrame()
ymdf["sequence_a"] = df["VH Seq"]
ymdf["sequence_alpha"] = df["Target Seq"]
ymdf["Kd"] = df["Kd_transformed"]
ymdf.to_csv(
    "alphabind/examples/data/preprocessed/train_data.csv",
    index=False,
)

## 2) Train and predict on your dataset

### Prerequisites

#### Data Pre-processing

This tutorial assumes the user has already prepared their data per section `1)` above. At minimum, you will need to ensure that your dataset has a `sequence_a`, `sequence_alpha`, and `Kd` column for training. 

#### Build the `alphabind` Docker image

The following steps depend on the `alphabind` Docker image, which you will need to build yourself due to BioNeMo's licensing restrictions. See the primary [`README.md`](../../../README.md) for detailed instructions of how to build that image.

### A. Configuring `.env` files

Sample config files for each respective step are provided in `alphabind/examples/finetuning_and_inference/conf`

These files specify environment variables which are read by Click to provision function arguments for each step. These environment variables are all prefixed with `ALPHABIND_`, but otherwise correspond exactly to the argument names defined in the respective module for each step.

For example, in `embed.env`, the `ALPHABIND_INPUT_FILEPATH` environment variable corresponds to the `input_filepath` argument of the `alphabind.features.build_features` module. For more details on each argument, consult the provided docstring in the source code. For convenience, we also provide a brief overview of the arguments here:

#### `embed.env`

- `ALPHABIND_INPUT_FILEPATH`: Path of your training CSV in the Docker guest. For this tutorial, this is `/mnt/data/preprocessed/train_data.csv`
- `ALPHABIND_OUTPUT_FILEPATH`: Destination path for the featurized CSV created from `ALPHABIND_INPUT_FILEPATH`, which includes the pre-computed ESM embeddings. For this tutorial, this is `/mnt/data/embeddings/train_data_featurized.csv`
- `ALPHABIND_EMBEDDING_DIR_PATH`: Destination directory path for the raw embeddings. e.g. `/mnt/data/embeddings/raw`
- `ALPHABIND_BATCH_SIZE`: Batch size to use for computing embeddings. Default is `16` that works well on `p3.2xlarge` instances.

#### `train.env`

- `ALPHABIND_DATASET_CSV_PATH`: Path inside the Docker guest to the pre-featurized training data. In this tutorial that is `/mnt/data/embeddings/train_data_featurized.csv`.
- `ALPHABIND_TX_MODEL_PATH`: Path inside the Docker guest to the AlphaBind model pre-trained checkpoint, `/mnt/models/alphabind_pretrained_checkpoint.pt`.
- `ALPHABIND_MAX_EPOCHS`: Number of epochs to train for. Note that the current code does not use an `EarlyStopping` callback, the model will always be trained for this many epochs. However, we only return the single best model by `val_loss`, so the returned model may be from an earlier epoch.
- `ALPHABIND_OUTPUT_MODEL_PATH`: Destination path in the Docker guest for the final best model by `val_loss`, e.g. `/mnt/data/models/model_trained.pt`. **CAUTION**: This path will be overwritten without warning if it already exists.

#### `predict.env`

- `ALPHABIND_PPI_DATASET_PATH`: Path in the Docker guest to the data to predict on. This data does *not* need to be pre-featurized, since the script will perform that step automatically. e.g. `alphabind/examples/data/preprocessed/train_data.csv`
- `ALPHABIND_TRAINED_MODEL_PATH`: Path in the Docker guest to the model to use for prediction, e.g. `/mnt/data/models/model_trained.pt`.
- `ALPHABIND_OUTPUT_DATASET_PATH`: Destination path in the Docker guest for labeled predictions, e.g. `/mnt/data/embeddings/train_data_predicted.csv`.


### B. Example workflow for fine-tuning a model on your data and using it to predict binding affinity

This workflow consists of 3 steps:
    - Pre-featurizing (embedding)
    - Model Training
    - Model Inference (prediction)

#### Embedding Step (for ESM model types only)

You will first pre-compute embeddings. This will create pre-computed ESM-2nv embeddings for your sequences, which will be used to accelerate model training in the next step. Depending on how many sequences you have, this can be a long step. The provided example `train_data.csv` completes in roughly 20 minutes on a `g5.4xlarge` instance.

You can do this as follows:

In [None]:
! docker run --rm -it --init --entrypoint python --gpus=all --shm-size=64G --env-file=alphabind/examples/finetuning_and_inference/conf/embed.env --name=alphabind_embed -v ./alphabind/examples/data:/mnt/data alphabind:latest -m alphabind.features.build_features

#### Train/Fine-tune your model

**NOTE:** This is frequently a long step (~2 hours on a `g5.4xlarge` instance). We provide an inline implementation in this notebook for convenience, but you may wish to manually run this step in a terminal emulator (e.g. `tmux`, `screen`) in order to increase resilience of the training job against e.g. temporary network interruptions to a remote instance.

Here, we illustrate fine-tuning the AlphaBind pre-trained checkpoint on novel experimental data. You can do this as follows:

In [None]:
! docker run --rm -it --init --entrypoint python --gpus=all --shm-size=64G --env-file=alphabind/examples/finetuning_and_inference/conf/train.env --name=alphabind_train -v ./alphabind/examples/data:/mnt/data -v ./alphabind/models:/mnt/models alphabind:latest -m alphabind.models.train_model

##### Checking model performance

Your most-recent training metrics should have been dumped to (relative to the repo root) `./alphabind/examples/data/models/logs/lightning_logs/version_0/metrics.csv` (or more generally, `version_N`, where `N` is the highest present version index). You can monitor this output file dynamically with `tail -f metrics.csv`. There are a few quantities you may want to consider:

- `train_loss`: Mini batch train loss
- `val_loss`: Validation loss per epoch
- `spearman_rho`: Validation Spearman Rho

#### Predict using your trained model

If you have a separate file to run inference on, you can run that as follows (here we simply predict on the original train set for illustration):

In [None]:
! docker run --rm -it --init --entrypoint python --gpus=all --shm-size=64G --env-file=alphabind/examples/finetuning_and_inference/conf/predict.env --name=alphabind_predict -v ./alphabind/examples/data:/mnt/data -v ./alphabind/models:/mnt/models alphabind:latest -m alphabind.models.predict_model

Prepare your prediction data similarly to your training data. You do not need to include a `Kd` feature; it is sufficient to have only properly setup `sequence_a` and `sequence_alpha` columns. Prediction dynamically runs BioNeMo under the hood, meaning embeddings are re-computed for each PPI pair of sequences.

## References

1. Mason, D.M., Friedensohn, S., Weber, C.R. et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat Biomed Eng 5, 600–612 (2021). https://doi.org/10.1038/s41551-021-00699-9 