## ESM-2 ##

ESM-2 is a state-of-the-art protein model trained on a masked language modelling objective. It is suitable for fine-tuning on a wide range of tasks that take protein sequences as input. For detailed information on the model architecture and training data, please refer to the accompanying paper. You may also be interested in some demo notebooks (PyTorch, TensorFlow) which demonstrate how to fine-tune ESM-2 models on your tasks of interest.

Several ESM-2 checkpoints are available in the Hub with varying sizes. 

Larger sizes generally have somewhat better accuracy, but require much more memory and time to train.

Model weights are available here:

https://huggingface.co/facebook/esm2_t30_150M_UR50D

In [2]:
import os
import numpy as np
import pandas as pd
import ipywidgets as widgets
from pathlib import Path

from matplotlib import pyplot as plt

# Huggingface imports
import evo_prot_grad
from transformers import AutoTokenizer, EsmForMaskedLM

#PyTorch
import torch

# Appearance of the Notebook
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
np.set_printoptions(linewidth=110)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)

# Import this module with autoreload
%load_ext autoreload
%autoreload 2
import esm
from esm.evoprotgrad import EvoProtGrad
from esm.evoprotgrad import torch_device
print(f'Project module version: {esm.__version__}')
print(f'PyTorch version:        {torch.__version__}')

### Single Protein Evolution ###

The first method focuses on evolving a single protein sequence. The protein sequence is initially converted into a FASTA format, a widely used text-based format for representing nucleotide or peptide sequences. Each sequence is prefaced with a descriptive line starting with '>', followed by the sequence itself in subsequent lines.

The ESM-2 model and its tokenizer are then loaded as the expert system for directed evolution. The model, pretrained on vast protein sequence data, understands the complex relationships between amino acids. The tokenizer converts the protein sequences into a format that the ESM-2 model can process.

Directed evolution is initiated using the EvoProtGrad's DirectedEvolution class, specifying the ESM-2 model as the expert. The process involves running several parallel chains of Markov Chain Monte Carlo (MCMC) steps. Each chain explores the sequence space, proposing mutations at each step. The EvoProtGrad framework then evaluates these mutations based on the expert model's predictions, accepting mutations that are likely to improve the desired protein characteristics.

https://huggingface.co/blog/AmelieSchreiber/directed-evolution-with-esm2

In [3]:
def run_evo_prot_grad(raw_protein_sequence):
    # Convert raw protein sequence to the format expected by EvoProtGrad
    # Usually, protein sequences are handled in FASTA format, so we create a mock FASTA string
    fasta_format_sequence = f">Input_Sequence\n{raw_protein_sequence}"

    # Save the mock FASTA string to a temporary file
    temp_fasta_path = "temp_input_sequence.fasta"
    with open(temp_fasta_path, "w") as file:
        file.write(fasta_format_sequence)

    # Load the ESM-2 model and tokenizer as the expert
    esm2_expert = evo_prot_grad.get_expert(
        'esm',
        model=EsmForMaskedLM.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        tokenizer=AutoTokenizer.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        temperature=0.95,
        device='cuda'  # or 'cpu' if GPU is not available
    )

    # Initialize Directed Evolution with the ESM-2 expert
    directed_evolution = evo_prot_grad.DirectedEvolution(
        wt_fasta=temp_fasta_path,    # path to the temporary FASTA file
        output='all',               # can be 'best', 'last', or 'all' variants
        experts=[esm2_expert],       # list of experts, in this case only ESM-2
        parallel_chains=1,           # number of parallel chains to run
        n_steps=20,                  # number of MCMC steps per chain
        max_mutations=10,            # maximum number of mutations per variant
        verbose=True                # print debug info
    )

    # Run the evolution process
    variants, scores = directed_evolution()

    # Process the results
    #for variant, score in zip(variants, scores):
    #    print(f"Variant: {variant}, Score: {score}")

    return variants, scores

In [4]:
# Get the device for the model
device_dict = torch_device()
display(device_dict)
torch.set_float32_matmul_precision(precision='high')

In [5]:
raw_protein_sequence = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"  # Replace with your protein sequence
variants, scores = run_evo_prot_grad(raw_protein_sequence)

In [4]:
# Run class method
epg = EvoProtGrad()
output_dir = os.path.join(os.environ['HOME'], 'data', 'protein_evolution')
Path(output_dir).mkdir(parents=True, exist_ok=True)
var_df = epg.single_evolute(raw_protein_sequence=raw_protein_sequence, output_dir=output_dir)
display(var_df)

### Paired Protein Evolution ###

The second method extends this approach to paired protein sequences, separated by a specific marker – in this case, a string of 20 'G' amino acids. This unique separator or linker allows for the simultaneous evolution of two protein sequences while preserving their individual integrity and the relational context.

Similar to the single protein evolution, the paired sequences are formatted into a FASTA-like structure, replacing the ':' separator with the 'G' amino acid string. This modified sequence is then subjected to the directed evolution process, with the 'G' string region preserved to maintain the distinction between the two protein sequences.

During the evolution process, mutations are proposed and evaluated across both protein sequences, considering their combined context. The preserved region ensures that mutations do not disrupt the separator, maintaining the integrity of the paired format.

In [5]:
def run_evo_prot_grad_on_paired_sequence(paired_protein_sequence):
    # Replace ':' with a string of 20 'G' amino acids
    separator = 'G' * 20
    sequence_with_separator = paired_protein_sequence.replace(':', separator)

    # Determine the start and end indices of the separator
    separator_start_index = sequence_with_separator.find(separator)
    separator_end_index = separator_start_index + len(separator)

    # Format the sequence into FASTA format
    fasta_format_sequence = f">Paired_Protein_Sequence\n{sequence_with_separator}"

    # Save the sequence to a temporary file
    temp_fasta_path = "temp_paired_sequence.fasta"
    with open(temp_fasta_path, "w") as file:
        file.write(fasta_format_sequence)

    # Load the ESM-2 model and tokenizer as the expert
    esm2_expert = evo_prot_grad.get_expert(
        'esm',
        model=EsmForMaskedLM.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        tokenizer=AutoTokenizer.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        temperature=0.95,
        device='cuda'  # or 'cpu' if GPU is not available
    )

    # Initialize Directed Evolution with the preserved separator region
    directed_evolution = evo_prot_grad.DirectedEvolution(
        wt_fasta=temp_fasta_path,
        output='all',
        experts=[esm2_expert],
        parallel_chains=1,
        n_steps=20,
        max_mutations=10,
        verbose=True,
        preserved_regions=[(separator_start_index, separator_end_index)]  # Preserve the 'G' amino acids string
    )

    # Run the evolution process
    variants, scores = directed_evolution()

    # Process the results, replacing the 'G' amino acids string back to ':'
    #for variant, score in zip(variants, scores):
    #   evolved_sequence = variant.replace(separator, ':')
    #    print(f"Evolved Paired Sequence: {evolved_sequence}, Score: {score}")

    return variants, scores

In [6]:
paired_protein_sequence = "MLTEVMEVWHGLVIAVVSLFLQACFLTAINYLLSRHMAHKSEQILKAASLQVPRPSPGHHHPPAVKEMKETQTERDIPMSDSLYRHDSDTPSDSLDSSCSSPPACQATEDVDYTQVVFSDPGELKNDSPLDYENIKEITDYVNVNPERHKPSFWYFVNPALSEPAEYDQVAM:MASPGSGFWSFGSEDGSGDSENPGTARAWCQVAQKFTGGIGNKLCALLYGDAEKPAESGGSQPPRAAARKAACACDQKPCSCSKVDVNYAFLHATDLLPACDGERPTLAFLQDVMNILLQYVVKSFDRSTKVIDFHYPNELLQEYNWELADQPQNLEEILMHCQTTLKYAIKTGHPRYFNQLSTGLDMVGLAADWLTSTANTNMFTYEIAPVFVLLEYVTLKKMREIIGWPGGSGDGIFSPGGAISNMYAMMIARFKMFPEVKEKGMAALPRLIAFTSEHSHFSLKKGAAALGIGTDSVILIKCDERGKMIPSDLERRILEAKQKGFVPFLVSATAGTTVYGAFDPLLAVADICKKYKIWMHVDAAWGGGLLMSRKHKWKLSGVERANSVTWNPHKMMGVPLQCSALLVREEGLMQNCNQMHASYLFQQDKHYDLSYDTGDKALQCGRHVDVFKLWLMWRAKGTTGFEAHVDKCLELAEYLYNIIKNREGYEMVFDGKPQHTNVCFWYIPPSLRTLEDNEERMSRLSKVAPVIKARMMEYGTTMVSYQPLGDKVNFFRMVISNPAATHQDIDFLIEEIERLGQDL"  # Replace with your paired protein sequences
variants, scores = run_evo_prot_grad_on_paired_sequence(paired_protein_sequence)

In [7]:
print(len(variants))
print(len(scores))
print(variants[0])