# II. Building Scalable Drug Discovery Applications: Directed Evolution

Note: This notebook was last tested on and Amazon SageMaker Studio JupyterLab space on a ml.g4dn.xlarge instance.

In notebook one, we started by randomly generating 10s of thousands of mutants and then filtered that list down using active learning. In this notebook, we'll try a different approach. This time, we'll still randomly generate sequenve variants. However, once we've collected the first round of lab data, we'll use the best-performing mutants as the starting point for an additional round of mutation. In this way, we hope that the sequence distribution will shift towards our desired state.

---
## 1. Setup

In [None]:
%pip install -U -r requirements.txt
%pip install evo-prot-grad --no-deps

In [None]:

import helpers
import pandas as pd
import random
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

SEQ_GENERATION_SIZE = 5000
LAB_SUBMISSION_SIZE = 1000

caplacizumab_seq = "EVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTYYPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWGQGTQVTVSS"
cdrs = list(range(25, 32)) + list(range(51, 57)) + list(range(98, 117))
preserved_regions = [(0, 25), (32, 51), (57, 98), (117, 128)]

---
## 2. Generate Sequence Variants

As in notebook one, we start by randomly generating some mutations of the caplacizumab CDR sequences. However, in this case, we'll create a much smaller number.

In [None]:
generated_seqs = helpers.random_mutation(
    wt_protein=str(caplacizumab_seq),
    n_output_seqs=SEQ_GENERATION_SIZE,
    preserved_regions=preserved_regions,
    max_mutations=10,
    annotate_hist=False,
)

---
## 3. Score Mutants with Protein Language Model

Once again, we calculate the pseudo-log likelihood score for all mutants and remove those with less evolutionary likelihood than caplacizumab. 

In [None]:
wt_seq = str(caplacizumab_seq)
pllrs = helpers.compute_pseudo_log_likelihood_ratio(wt_seq, generated_seqs["seq"])
print(f"Records before filtering: {len(generated_seqs)}")
generated_seqs["pplr"] = pllrs
generated_seqs = generated_seqs[generated_seqs["pplr"] > 1]
print(f"Records after filtering: {len(generated_seqs)}")

---
## 4. Select Samples

Let's select a few samples and use them to train our model. We don't know much about these mutants yet, so we'll start by picking a few at random.

In [None]:
no_lab_data = generated_seqs[generated_seqs["lab_result"].isnull()]
selected_seqs = no_lab_data.sample(n=LAB_SUBMISSION_SIZE)
selected_seqs

---
## 5. Submit to Lab

Next we'll submit them to the lab to test for "Factor X".

In [None]:
lab_results = helpers.submit_seqs_to_lab(selected_seqs["seq"])

for result in lab_results.itertuples():
    generated_seqs.loc[[result.Index], ["lab_result"]] = result.result

display(lab_results.sort_values(by="result", ascending=False))

---
## 6. Fine-Tune Property Prediction Model

Once again, we fine-tune our protein language model on the lab data.

In [None]:
scoring_model = helpers.train_scoring_model(
    lab_results,
    sequence_column="seq",
    results_column="result",
    epochs=1,
    model_name_or_path="facebook/esm2_t12_35M_UR50D",
)

---
## 7. Generate New Mutations Based on Top-Scoring Sequences from Round 1

In [None]:
top_5 = lab_results.sort_values(by="result", ascending=False)[:5]
print("Parents:")
for parent in top_5["seq"]:
    print(helpers.format_cdrs(parent, cdrs))

print("Children:")
generated_seqs = helpers.uniform_crossover(top_5["seq"], 384)
for seq in random.sample(generated_seqs, 10):
    print(helpers.format_cdrs(seq, cdrs))

---
## 8. Score Samples with Property Prediction Model

Finally, we use our newly-trained model to predict the value of "Factor X" for all of our samples. This will give us a better idea of our mutants and help us pick another batch for lab testing.

In [None]:
generated_seqs = pd.DataFrame.from_dict({"seq": generated_seqs})
predictions = helpers.run_scoring_model(generated_seqs, batch_size=1024)
generated_seqs["last_prediction"] = predictions
display(generated_seqs.sort_values(by="last_prediction", ascending=False)[:5])

---
## 9. Put it all together!

Now let's try a few cycles. Feel free to modify the parameters below as you see fit. Remember that our goal is to find the mutant with the largest value of "Factor X", not necessarily to end up with the most accurate model.

In [None]:
SEQ_GENERATION_SIZE = 50000
MAX_MUTATIONS = 10
N_REPS = 3
LAB_SUBMISSION_SIZE = 384
MODEL_ID = "facebook/esm2_t12_35M_UR50D"

In [None]:
#################################
# Generate
#################################
# print(f"Generating {SEQ_GENERATION_SIZE} random variants")
# generated_seqs = helpers.random_mutation(
#     wt_protein=str(caplacizumab_seq),
#     n_output_seqs=SEQ_GENERATION_SIZE,
#     preserved_regions=preserved_regions,
#     max_mutations=MAX_MUTATIONS,
# )

#################################
# Filter by pseudo-log likelihood ratio
#################################
# print("Filtering by pseudo-log likelihood ratio")
# generated_seqs['pplr'] = helpers.compute_pseudo_log_likelihood_ratio(str(caplacizumab_seq), generated_seqs['seq'])
# generated_seqs = generated_seqs[generated_seqs['pplr'] > 1]

for rep in range(N_REPS):

    #################################
    # Select Samples for Lab Analysis
    ################################

    print("#" * 79 + "\n" + f"Starting rep {rep+1} of {N_REPS}" + "\n" + "#" * 79)
    no_lab_data = generated_seqs[generated_seqs["lab_result"].isnull()]
    print(
        f"{rep+1}: Selecting batch of the highest-scoring {LAB_SUBMISSION_SIZE} samples without lab data"
    )
    no_lab_data = no_lab_data.sort_values(by="last_prediction", ascending=False)
    selected_seqs = no_lab_data.sample(n=LAB_SUBMISSION_SIZE)

    #################################
    # Submit to Lab
    #################################

    print(f"{rep+1}: Submitting samples for lab analysis")
    lab_results = helpers.submit_seqs_to_lab(
        selected_seqs["seq"], delay=0.01, intro=True
    )

    for result in lab_results.itertuples():
        generated_seqs.loc[[result.Index], ["lab_result"]] = result.result

    # Skip the fine-tuning and scoring during the last round
    if rep + 1 < N_REPS:

        #################################
        # Fine-Tune on Lab Results
        #################################

        print(f"\n{rep+1}: Training scoring model on lab results")
        scoring_model = helpers.train_scoring_model(
            lab_results,
            sequence_column="seq",
            results_column="result",
            epochs=1,
            # model_name_or_path=MODEL_ID if rep == 0 else "output",
            model_name_or_path="output",
        )

        #################################
        # Score all Generated Samples
        #################################

        print(f"\n{rep+1}: Using model to score all {SEQ_GENERATION_SIZE} sequences")
        predictions = helpers.run_scoring_model(generated_seqs, batch_size=1024)
        generated_seqs["last_prediction"] = predictions

    print(f"\n{rep+1}: Top-5 candidates so far")
    top_5 = generated_seqs.sort_values(by="lab_result", ascending=False)[:5]
    display(top_5)

In [None]:
print(
    str(0.39971).rjust(5)
    + "\t"
    + helpers.format_cdrs(str(caplacizumab_seq), cdrs, mask=False)
)
for i in top_5.itertuples():
    print(
        str(round(i.lab_result, 5)).rjust(5)
        + "\t"
        + helpers.format_cdrs(i.seq, i.mutation, mask=True)
    )

## 10. Your turn

Feel free to keep experimenting. Some things to try:

- Fine-tune one of the  the larger pre-trained models, like `esm2_t30_150M_UR50D` or `esm2_t33_650M_UR50D`
- Randomly generate more or fewer sequences
- Increase the number of possible mutations (as high as 30)
- Try generating a batch of variants, identifying the highest-scoring ones, then generating more variants from those! This method, called "directed evolution", can be a powerful technique for rapidly producing sequences with a desired property.
