# III. Building Scalable Drug Discovery Applications: ML-Guided Directed Evolution

Note: This notebook was last tested on and Amazon SageMaker Studio JupyterLab space on a ml.g4dn.xlarge instance.

In notebook two, we used directed evolution to generate successive rounds of candidates with ever-improving properties. However, we still had to do a lot of filtering after the fact. Wouldn't it be great if we could take the predicted properties into account during the generation step itself?

[EvoProtGrad](https://github.com/NREL/EvoProtGrad) is a framework developed by the National Renewable Energy Laboratory that uses multiple ML models in a "plug and play" fashion to intelligently guide directed evolution campaigns. In this notebook, we'll explore how to use it in our design effort.

In [None]:
%pip install EvoProtGrad/

In [None]:
import helpers
import pandas as pd
import warnings
import evo_prot_grad
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModel

warnings.simplefilter(action="ignore", category=FutureWarning)

SEQ_GENERATION_SIZE = 1000
LAB_SUBMISSION_SIZE = 1000

caplacizumab_seq = "EVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTYYPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWGQGTQVTVSS"
cdrs = list(range(25, 32)) + list(range(51, 57)) + list(range(98, 117))
preserved_regions = [(0, 25), (32, 51), (57, 98), (117, 128)]

In [None]:
amplify_expert = evo_prot_grad.get_expert(
    "amplify", "mutant_marginal", temperature=1.0, device="cuda:0"
)

regression_expert = evo_prot_grad.get_expert(
    "esm_downstream_regression",
    "attribute_value",
    temperature=1.0,
    model=AutoModelForSequenceClassification.from_pretrained(
        "output", trust_remote_code=True
    ),
    tokenizer=AutoTokenizer.from_pretrained("output", trust_remote_code=True),
    device="cuda",
)

variants, scores = evo_prot_grad.DirectedEvolution(
    wt_protein=caplacizumab_seq,
    output="all",
    experts=[amplify_expert, regression_expert],
    parallel_chains=4,
    n_steps=20,
    max_mutations=15,
    verbose=True,
    preserved_regions=preserved_regions,
)()

In [None]:
generated_seqs = pd.DataFrame.from_dict(
    {
        "seq": [x.replace(" ", "") for xs in variants for x in xs],
        "scores": scores.flatten(),
    }
).sort_values(by="scores", ascending=False)

In [None]:
lab_results = helpers.submit_seqs_to_lab(generated_seqs["seq"], delay=0)

for result in lab_results.itertuples():
    generated_seqs.loc[[result.Index], ["lab_result"]] = result.result

display(lab_results.sort_values(by="result", ascending=False))