# Building Scalable Drug Discovery Applications: Active Learning

![Active learning can accelerate DMTL cycles](img/active_learning.png)

Active learning is an effective way to minizmize the amount of experimental data needed to obtain a R&D result. It involves measuring data for a small set of samples, using the results to train a model, then using that model to predict a value for all of the samples. Several rounds of this can identify top-scoring candidates more quickly and cheaply than measuring them all through the lab.

---
## 1. Setup

In [None]:
%pip install -U -r requirements.txt

In [None]:
import biotite
from biotite.structure.io import pdb
from biotite.database import rcsb
import helpers
import numpy as np
import py3Dmol

import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

LAB_SUBMISSION_SIZE = 1000

---
## 2. View Nanobody sequence and structure

First, we'll download the structure of an existing monoclonal antibody drug, caplacizumab. We'll use this as our template for new designs.

In [None]:
pdb_id = "7eow"
stack = biotite.structure.io.pdb.get_structure(
    pdb.PDBFile.read(rcsb.fetch(pdb_id, "pdb"))
)
vf_factor = helpers.clean_structure(stack[0][stack.chain_id == "A"])
caplacizumab = helpers.clean_structure(stack[0][stack.chain_id == "B"])

caplacizumab_seq = biotite.structure.to_sequence(caplacizumab)[0][0]

cdr1 = list(range(25, 32))
cdr2 = list(range(51, 57))
cdr3 = list(range(98, 117))
cdrs = cdr1 + cdr2 + cdr3
cdrs_1_base = [i + 1 for i in cdrs]

preserved_regions = [
    (0, cdr1[0]),
    (cdr1[-1] + 1, cdr2[0]),
    (cdr2[-1] + 1, cdr3[0]),
    (cdr3[-1] + 1, len(caplacizumab_seq)),
]

print(caplacizumab_seq)
print(helpers.format_cdrs(caplacizumab_seq, cdrs))

In [None]:
view = py3Dmol.view(width=600, height=600)
view.addModel(helpers.to_pdb_string(vf_factor))
view.addModel(helpers.to_pdb_string(caplacizumab))
view.setStyle({"chain": "A"}, {"cartoon": {"color": "orange", "opacity": 0.6}})
view.setStyle({"chain": "B"}, {"cartoon": {"color": "blue", "opacity": 0.6}})
view.addStyle(
    {"chain": "B", "resi": cdrs_1_base},
    {"cartoon": {"color": "#57C4F8", "opacity": 1.0}},
)
view.zoomTo()
view.show()

---
## 3. Generate Sequence Variants

![Generate sequence variants using directed evolution](img/gen.png)

Next we'll generate random mutations in the CDRs for the caplacizumab sequence. We're not using any scientific logic to guide us here, the mutations are completely random!

In [None]:
generated_seqs = helpers.random_mutation(
    wt_protein=str(caplacizumab_seq),
    n_output_seqs=100000,
    preserved_regions=preserved_regions,
    max_mutations=10,
)
generated_seqs["lab_result"] = np.NaN
print(f"Generated {len(generated_seqs)} sequences")
n_preview = 10
print(caplacizumab_seq)
for i in generated_seqs[:n_preview].itertuples():
    print(helpers.format_cdrs(i.seq, i.mutation, mask=True))

---
## 4. Select Samples

![Identify candidates for lab testing using a selection model](img/select.png)

From our pool of randomly-generated candidates, we now select a batch to submit to the lab. 

In [None]:
no_lab_data = generated_seqs[generated_seqs["lab_result"].isnull()]
selected_seqs = no_lab_data[["seq"]][:LAB_SUBMISSION_SIZE]
selected_seqs

---
## 5. Submit to Lab

![Submit selected samples for experimental testing](img/lab.png)

In [None]:
lab_results = helpers.submit_seqs_to_lab(selected_seqs["seq"])

for result in lab_results.itertuples():
    generated_seqs.loc[[result.Index], ["lab_result"]] = result.result

display(generated_seqs)

---
## 6. Fine-Tune Scoring Model

![Improve the scoring model using experimental results](img/ft.png)

In [None]:
scoring_model = helpers.train_scoring_model(
    lab_results,
    sequence_column="seq",
    results_column="result",
    epochs=1,
    model_name_or_path="facebook/esm2_t30_150M_UR50D",
)

---
## 7. Score Samples with Regression Model

![Predict high-performing variants using a scoring model](img/score.png)

In [None]:
predictions = helpers.run_scoring_model(scoring_model, generated_seqs)
generated_seqs["prediction"] = predictions
generated_seqs.sort_values(by="prediction", ascending=False)

---
## 8. Your Turn

In [None]:
N_GENERATED_SEQS = 100000
MAX_MUTATIONS = 10
N_REPS = 3
LAB_SUBMISSION_SIZE = 1000

In [None]:
print(f"Generating {N_GENERATED_SEQS} sequences")
generated_seqs = helpers.random_mutation(
    wt_protein=str(caplacizumab_seq),
    n_output_seqs=N_GENERATED_SEQS,
    preserved_regions=preserved_regions,
    max_mutations=MAX_MUTATIONS,
)
generated_seqs["lab_result"] = np.NaN

for rep in range(N_REPS):
    print("#" * 25 + "\n" + f"Starting rep {rep+1} of {N_REPS}" + "\n" + "#" * 25)

    print(f"Select a batch of {LAB_SUBMISSION_SIZE} samples without lab data")
    no_lab_data = generated_seqs[generated_seqs["lab_result"].isnull()]
    if rep == 0:
        selected_seqs = no_lab_data["seq"][:LAB_SUBMISSION_SIZE]
    else:
        print("Selecting samples with highest predicted values")
        selected_seqs = no_lab_data.sort_values(by="last_prediction", ascending=False)[
            "seq"
        ][:LAB_SUBMISSION_SIZE]

    print("Submitting samples for lab analysis")
    lab_results = helpers.submit_seqs_to_lab(selected_seqs, delay=0.01)
    for result in lab_results.itertuples():
        generated_seqs.loc[[result.Index], ["lab_result"]] = result.result

    if rep+1 == N_REPS:
        break

    print("Training scoring model on lab results")
    scoring_model = helpers.train_scoring_model(
        lab_results,
        sequence_column="seq",
        results_column="result",
        epochs=1,
        model_name_or_path="output",
    )

    print(f"Using model to score all {N_GENERATED_SEQS} sequences")
    predictions = helpers.run_scoring_model(scoring_model, generated_seqs)
    generated_seqs[f"last_prediction"] = predictions

    display(generated_seqs)
    generated_seqs.count()

print("Best candidates")
best = generated_seqs.sort_values(by="lab_result", ascending=False)[:5]
display(best)

for i in best.itertuples():
    print(
        str(round(i.lab_result, 5)).rjust(5)
        + "\t"
        + helpers.format_cdrs(i.seq, i.mutation, mask=True)
    )
print(
    str(0.25609).rjust(5)
    + "\t"
    + helpers.format_cdrs(str(caplacizumab_seq), cdrs, mask=False)
)