# Building Scalable Drug Discovery Applications: Active Learning

Note: This notebook was last tested on and Amazon SageMaker Studio JupyterLab space on a ml.g4dn.xlarge instance.

### How can antibody-based drugs help fight disease?

Antibody drugs, also known as monoclonal antibodies or immunotherapies, are an important class of medications used to treat various diseases. 

Examples of diseases treated with antibody drugs include:

- Cancer: Drugs like trastuzumab (Herceptin) for HER2-positive breast cancer or pembrolizumab (Keytruda) for various cancers.

- Autoimmune disorders: Drugs such as adalimumab (Humira) for rheumatoid arthritis and Crohn's disease.

- Infectious diseases: Antibody cocktails for Ebola virus or COVID-19.

- Allergic conditions: Omalizumab (Xolair) for severe asthma and chronic hives.

- Neurological disorders: Drugs like ocrelizumab (Ocrevus) for multiple sclerosis.

Antibody drugs have revolutionized the treatment of many diseases, offering targeted therapies with often fewer side effects than traditional small-molecule drugs.

### How can active learning accelerate drug discovery?

Research team working on antibody-based drugs may need to generate and test many thousands of candidates before they find one with the best mix of properties. This can be very expensive and take a long time. 

Active learning is a machine learning technique where the predictive algorithm actively participates in the training data selection process, rather than passively learning from a fixed dataset. The key idea behind active learning is to reduce the amount of labeled training data needed to reach your goal by intelligently picking the right examples.

This is very useful when it is very expensive or difficult to generate training data, like in the case of drug development! Let's say we want to predict a property for 100,000 drug candidates. We could test them all, but it would take a while. Instead, we can test a few and use them to train a ML model. We then use the model to predict the property value for all candidates. Finally, we use a strategy to pick a few more samples and repeat until we reach our goal.

![Active learning can accelerate DMTL cycles](img/active_learning.png)

### What is the goal of this notebook?

In this example, we will use active learning to predict a property ("Factor X") for a large number of "nanobody" molecules. A nanobody is a small antibody fragment derived from camelids like camels and llamas. Nanobodies are much smaller than typical antibodies, allowing them to access targets and bind to regions that larger antibody molecules cannot reach. They are also very stable and can be easily engineered to modify their properties.

This workflow generates nanobody molecules based off a commonly-used scaffold, NbBCII10 humanized (FGLA mutant). The generated molecules share the same sequence as the scaffold EXCEPT for three so-called “complementarity determining regions” or CDRs, highlighted in orange below.  These sequence regions are responsible for much of the binding activity of various antibody formats, including nanobodies.

![Nanobody compared to an IgG antibody](img/nanobody.png)








---
## 1. Setup

In [None]:
%pip install -U -r requirements.txt

In [None]:
import biotite
from biotite.structure.io import pdb
from biotite.database import rcsb
import helpers
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

SEQ_GENERATION_SIZE = 50000
LAB_SUBMISSION_SIZE = 384

---
## 2. View Nanobody sequence and structure

Let's download and explore the structure of a nanobody. In this case we'll look at Cablivi (Caplacizumab), the first humanized nanobody-based drug approved by the US FDA in 2019.

In [None]:
pdb_id = "7eow"
stack = biotite.structure.io.pdb.get_structure(
    pdb.PDBFile.read(rcsb.fetch(pdb_id, "pdb"))
)
vf_factor = helpers.clean_structure(stack[0][stack.chain_id == "A"])
caplacizumab = helpers.clean_structure(stack[0][stack.chain_id == "B"])

caplacizumab_seq = biotite.structure.to_sequence(caplacizumab)[0][0]
cdr1 = list(range(25, 32))
cdr2 = list(range(51, 57))
cdr3 = list(range(98, 117))
cdrs = cdr1 + cdr2 + cdr3
cdrs_1_base = [i + 1 for i in cdrs]

preserved_regions = [
    (0, cdr1[0]),
    (cdr1[-1] + 1, cdr2[0]),
    (cdr2[-1] + 1, cdr3[0]),
    (cdr3[-1] + 1, len(caplacizumab_seq)),
]

print(caplacizumab_seq)
print(helpers.format_cdrs(caplacizumab_seq, cdrs))

The caplacizumab molecule is shown in blue below. The light-blue regions are the CDRs. These play the largest role in the drugs effect on its target and where we'll focus our attention.

In [None]:
import py3Dmol

view = py3Dmol.view(width=600, height=600)
view.addModel(helpers.to_pdb_string(vf_factor))
view.addModel(helpers.to_pdb_string(caplacizumab))
view.setStyle({"chain": "A"}, {"cartoon": {"color": "orange", "opacity": 0.6}})
view.setStyle({"chain": "B"}, {"cartoon": {"color": "blue", "opacity": 0.6}})
view.addStyle(
    {"chain": "B", "resi": cdrs_1_base},
    {"cartoon": {"color": "#57C4F8", "opacity": 1.0}},
)
view.zoomTo()
view.show()

---
## 3. Generate Sequence Variants

![Generate sequence variants using directed evolution](img/gen.png)

First, let's change some parts of the caplacizumab CDRs. This is similar to a process that natural antibodies go through in your body, known as affinity maturation. Sometimes researchers will also introduce mutations with chemicals or radiation to try and create new drugs. In our case, we'll create mutants computationally. There are some intelligent ways to do this, but for the sake of simplicity we'll do it randomly.

In [None]:
generated_seqs = helpers.random_mutation(
    wt_protein=str(caplacizumab_seq),
    n_output_seqs=SEQ_GENERATION_SIZE,
    preserved_regions=preserved_regions,
    max_mutations=10,
    annotate_hist=True,
)
print(f"Generated {len(generated_seqs)} sequences. Here's a preview:")
n_preview = 10
print(caplacizumab_seq)
for i in generated_seqs.sample(n_preview).itertuples():
    print(helpers.format_cdrs(i.seq, i.mutation, mask=True))

In this case, we have a (secret) way of calculating Factor X values from the amino acid sequence of each mutant. However, in the real world this probably won't be the case. Instead, we'll need to measure it experimentally in a laboratory. Examining ALL of the mutants could be expensive and take a long time. So instead, let's use ML to find the highest-scoring mutants with the smallest amount of lab data.

---
## 4. Select Samples

![Identify candidates for lab testing using a selection model](img/select.png)

Let's select a few samples and use them to train our model. We don't know much about these mutants yet, so we'll start by picking a few at random.

In [None]:
no_lab_data = generated_seqs[generated_seqs["lab_result"].isnull()]
selected_seqs = no_lab_data.sample(n=LAB_SUBMISSION_SIZE)
selected_seqs

---
## 5. Submit to Lab

![Submit selected samples for experimental testing](img/lab.png)

Next we'll submit them to the lab to test for "Factor X". Remember that in the real world this process could take days or even weeks, depending on the test!

In [None]:
lab_results = helpers.submit_seqs_to_lab(selected_seqs["seq"])

for result in lab_results.itertuples():
    generated_seqs.loc[[result.Index], ["lab_result"]] = result.result

display(lab_results.sort_values(by="result", ascending=False))

---
## 6. Fine-Tune Scoring Model

![Improve the scoring model using experimental results](img/ft.png)

Isn't science great? With these results in hand we're ready to train a model. We don't have enough data to train a model with perfect accuracy (yet), but that's ok - our job is to get close enough to identify the best mutants to test next round. In this case, we'll fine-tune a small protein language model that was previously trained to understand common patterns in naturally-occuring protein sequences.

In [None]:
scoring_model = helpers.train_scoring_model(
    lab_results,
    sequence_column="seq",
    results_column="result",
    epochs=3,
    model_name_or_path="facebook/esm2_t12_35M_UR50D",
)

---
## 7. Score Samples

![Predict high-performing variants using a scoring model](img/score.png)

Finally, we use our newly-trained model to predict the value of "Factor X" for all of our samples. This will give us a better idea of our mutants and help us pick another batch for lab testing.

In [None]:
predictions = helpers.run_scoring_model(generated_seqs, batch_size=1024)
generated_seqs["last_prediction"] = predictions
display(generated_seqs.sort_values(by="lab_result", ascending=False)[:5])

---
## 8. Put it all together!

Now let's try a few cycles. Feel free to modify the parameters below as you see fit. Remember that our goal is to find the mutant with the largest value of "Factor X", not necessarily to end up with the most accurate model.

In [None]:
SEQ_GENERATION_SIZE = 50000
MAX_MUTATIONS = 10
N_REPS = 3
LAB_SUBMISSION_SIZE = 384
MODEL_ID = "facebook/esm2_t12_35M_UR50D"
# MODEL_ID = "facebook/esm2_t30_150M_UR50D"
# MODEL_ID = "facebook/esm2_t33_650M_UR50D"

In [None]:
#################################
# Generate
#################################
print(f"Generating {SEQ_GENERATION_SIZE} random variants")
generated_seqs = helpers.random_mutation(
    wt_protein=str(caplacizumab_seq),
    n_output_seqs=SEQ_GENERATION_SIZE,
    preserved_regions=preserved_regions,
    max_mutations=MAX_MUTATIONS,
)

for rep in range(N_REPS):

    #################################
    # Select Samples for Lab Analysis
    ################################

    print("#" * 79 + "\n" + f"Starting rep {rep+1} of {N_REPS}" + "\n" + "#" * 79)
    no_lab_data = generated_seqs[generated_seqs["lab_result"].isnull()]
    if rep == 0:
        print(
            f"{rep+1}: Selecting a batch of {LAB_SUBMISSION_SIZE} samples without lab data"
        )
        selected_seqs = no_lab_data.sample(n=LAB_SUBMISSION_SIZE)
    else:
        print(
            f"{rep+1}: Selecting batch of the highest-scoring {LAB_SUBMISSION_SIZE} samples without lab data"
        )
        no_lab_data = no_lab_data.sort_values(by="last_prediction", ascending=False)
        selected_seqs = no_lab_data.sample(n=LAB_SUBMISSION_SIZE)

    #################################
    # Submit to Lab
    #################################

    print(f"{rep+1}: Submitting samples for lab analysis")
    lab_results = helpers.submit_seqs_to_lab(
        selected_seqs["seq"], delay=0.01, intro=True
    )

    for result in lab_results.itertuples():
        generated_seqs.loc[[result.Index], ["lab_result"]] = result.result

    # Skip the fine-tuning and scoring during the last round
    if rep + 1 < N_REPS:

        #################################
        # Fine-Tune on Lab Results
        #################################

        print(f"\n{rep+1}: Training scoring model on lab results")
        scoring_model = helpers.train_scoring_model(
            lab_results,
            sequence_column="seq",
            results_column="result",
            epochs=1,
            model_name_or_path=MODEL_ID if rep == 0 else "output",
        )

        #################################
        # Score all Generated Samples
        #################################

        print(f"\n{rep+1}: Using model to score all {SEQ_GENERATION_SIZE} sequences")
        predictions = helpers.run_scoring_model(generated_seqs, batch_size=1024)
        generated_seqs["last_prediction"] = predictions

    print(f"\n{rep+1}: Top-5 candidates so far")
    top_5 = generated_seqs.sort_values(by="lab_result", ascending=False)[:5]
    display(top_5)

In [None]:
print(
    str(0.39971).rjust(5)
    + "\t"
    + helpers.format_cdrs(str(caplacizumab_seq), cdrs, mask=False)
)
for i in top_5.itertuples():
    print(
        str(round(i.lab_result, 5)).rjust(5)
        + "\t"
        + helpers.format_cdrs(i.seq, i.mutation, mask=True)
    )

## 10. Your turn

Feel free to keep experimenting. Some things to try:

- Fine-tune one of the  the larger pre-trained models, like `esm2_t30_150M_UR50D` or `esm2_t33_650M_UR50D`
- Randomly generate more or fewer sequences
- Increase the number of possible mutations (as high as 30)
- Try generating a batch of variants, identifying the highest-scoring ones, then generating more variants from those! This method, called "directed evolution", can be a powerful technique for rapidly producing sequences with a desired property.
