## ESM-2: Paired Protein Evolution ##

The second method extends this approach to paired protein sequences, separated by a specific marker – in this case, a string of 20 'G' amino acids. This unique separator or linker allows for the simultaneous evolution of two protein sequences while preserving their individual integrity and the relational context.

Similar to the single protein evolution, the paired sequences are formatted into a FASTA-like structure, replacing the ':' separator with the 'G' amino acid string. This modified sequence is then subjected to the directed evolution process, with the 'G' string region preserved to maintain the distinction between the two protein sequences.

During the evolution process, mutations are proposed and evaluated across both protein sequences, considering their combined context. The preserved region ensures that mutations do not disrupt the separator, maintaining the integrity of the paired format.

Resources

Blog post: https://huggingface.co/blog/AmelieSchreiber/directed-evolution-with-esm2

Model weights: https://huggingface.co/facebook/esm2_t30_150M_UR50D

ESM GitHub repository: https://github.com/facebookresearch/esm

In [2]:
import os
import gc
import numpy as np
import pandas as pd
import ipywidgets as widgets
from pathlib import Path

from matplotlib import pyplot as plt

# Huggingface imports
import evo_prot_grad
from transformers import AutoTokenizer, EsmForMaskedLM

#PyTorch
import torch

# Appearance of the Notebook
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Import this module with autoreload
%load_ext autoreload
%autoreload 2
import esm
from esm.evoprotgrad import set_expert, EvoProtGrad
from esm.evoprotgrad import torch_device
print(f'Project module version: {esm.__version__}')
print(f'PyTorch version:        {torch.__version__}')

Project module version: 0.0.post1.dev23+gc9ac203
PyTorch version:        2.1.2+cu121


### Set the GPU device ###

In [1]:
# Where do we want to put the model weights.
project_dir = os.path.normpath('/n/data1/hms/ccb/projects/esm')
cache_dir = os.path.join(project_dir, 'model_weights')
Path(cache_dir).mkdir(exist_ok=True, parents=True)

# Get the device for the model
device_dict = torch_device()
display(device_dict)
torch.set_float32_matmul_precision(precision='high')

# Now, get the device name
device = device_dict.get('device')
print(device)

# Free up GPU memory
gc.collect()
torch.cuda.empty_cache()

# https://huggingface.co/facebook/esm2_t33_650M_UR50D
esm_checkpoints = {
    't48_15B': 'facebook/esm2_t48_15B_UR50D',
    't36_3B': 'facebook/esm2_t36_3B_UR50D',
    't33_650M': 'facebook/esm2_t33_650M_UR50D',
    't30_150M': 'facebook/esm2_t30_150M_UR50D',
    't12_35M': 'facebook/esm2_t12_35M_UR50D',
    't6/8M': 'facebook/esm2_t6_8M_UR50D',
    'default': 'facebook/esm2_t30_150M_UR50D'
}

!nvidia-smi

NameError: name 'os' is not defined

### Create the expert model ###

In [6]:
checkpoint = 't36_3B'
print(f'Loading model and weights for {checkpoint} model. This can take a while.')
expert = set_expert(checkpoint=checkpoint, device=device, cache_dir=cache_dir)

Loading model and weights for t36_3B model. This can take a while.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
def run_evo_prot_grad_on_paired_sequence(paired_protein_sequence, expert):
    # Replace ':' with a string of 20 'G' amino acids
    separator = 'G' * 20
    sequence_with_separator = paired_protein_sequence.replace(':', separator)

    # Determine the start and end indices of the separator
    separator_start_index = sequence_with_separator.find(separator)
    separator_end_index = separator_start_index + len(separator)

    # Format the sequence into FASTA format
    fasta_format_sequence = f">Paired_Protein_Sequence\n{sequence_with_separator}"

    # Save the sequence to a temporary file
    temp_fasta_path = "temp_paired_sequence.fasta"
    with open(temp_fasta_path, "w") as file:
        file.write(fasta_format_sequence)

    # Initialize Directed Evolution with the preserved separator region
    directed_evolution = evo_prot_grad.DirectedEvolution(
        wt_fasta=temp_fasta_path,
        output='all',
        experts=[expert],
        parallel_chains=1,
        n_steps=20,
        max_mutations=10,
        verbose=True,
        preserved_regions=[(separator_start_index, separator_end_index)]  # Preserve the 'G' amino acids string
    )

    # Run the evolution process
    variants, scores = directed_evolution()

    # Process the results, replacing the 'G' amino acids string back to ':'
    #for variant, score in zip(variants, scores):
    #   evolved_sequence = variant.replace(separator, ':')
    #    print(f"Evolved Paired Sequence: {evolved_sequence}, Score: {score}")

    return variants, scores

In [9]:
paired_protein_sequence = "MLTEVMEVWHGLVIAVVSLFLQACFLTAINYLLSRHMAHKSEQILKAASLQVPRPSPGHHHPPAVKEMKETQTERDIPMSDSLYRHDSDTPSDSLDSSCSSPPACQATEDVDYTQVVFSDPGELKNDSPLDYENIKEITDYVNVNPERHKPSFWYFVNPALSEPAEYDQVAM:MASPGSGFWSFGSEDGSGDSENPGTARAWCQVAQKFTGGIGNKLCALLYGDAEKPAESGGSQPPRAAARKAACACDQKPCSCSKVDVNYAFLHATDLLPACDGERPTLAFLQDVMNILLQYVVKSFDRSTKVIDFHYPNELLQEYNWELADQPQNLEEILMHCQTTLKYAIKTGHPRYFNQLSTGLDMVGLAADWLTSTANTNMFTYEIAPVFVLLEYVTLKKMREIIGWPGGSGDGIFSPGGAISNMYAMMIARFKMFPEVKEKGMAALPRLIAFTSEHSHFSLKKGAAALGIGTDSVILIKCDERGKMIPSDLERRILEAKQKGFVPFLVSATAGTTVYGAFDPLLAVADICKKYKIWMHVDAAWGGGLLMSRKHKWKLSGVERANSVTWNPHKMMGVPLQCSALLVREEGLMQNCNQMHASYLFQQDKHYDLSYDTGDKALQCGRHVDVFKLWLMWRAKGTTGFEAHVDKCLELAEYLYNIIKNREGYEMVFDGKPQHTNVCFWYIPPSLRTLEDNEERMSRLSKVAPVIKARMMEYGTTMVSYQPLGDKVNFFRMVISNPAATHQDIDFLIEEIERLGQDL"  # Replace with your paired protein sequences
variants, scores = run_evo_prot_grad_on_paired_sequence(paired_protein_sequence, expert=expert)

>Wildtype sequence: M L T E V M E V W H G L V I A V V S L F L Q A C F L T A I N Y L L S R H M A H K S E Q I L K A A S L Q V P R P S P G H H H P P A V K E M K E T Q T E R D I P M S D S L Y R H D S D T P S D S L D S S C S S P P A C Q A T E D V D Y T Q V V F S D P G E L K N D S P L D Y E N I K E I T D Y V N V N P E R H K P S F W Y F V N P A L S E P A E Y D Q V A M G G G G G G G G G G G G G G G G G G G G M A S P G S G F W S F G S E D G S G D S E N P G T A R A W C Q V A Q K F T G G I G N K L C A L L Y G D A E K P A E S G G S Q P P R A A A R K A A C A C D Q K P C S C S K V D V N Y A F L H A T D L L P A C D G E R P T L A F L Q D V M N I L L Q Y V V K S F D R S T K V I D F H Y P N E L L Q E Y N W E L A D Q P Q N L E E I L M H C Q T T L K Y A I K T G H P R Y F N Q L S T G L D M V G L A A D W L T S T A N T N M F T Y E I A P V F V L L E Y V T L K K M R E I I G W P G G S G D G I F S P G G A I S N M Y A M M I A R F K M F P E V K E K G M A A L P R L I A F T S E H S H F S L K K G A A A L G I G T D S 

In [10]:
!nvidia-smi

Mon May 20 09:27:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:DD:00.0 Off |                    0 |
| N/A   27C    P0              69W / 500W |  22451MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    