<a href="https://colab.research.google.com/github/fourmodern/toc_tutorial_colab/blob/main/teachopencadd/t110_esm2_peptide_optimization_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# ESM-2 Protein-Peptide Binding Optimization Tutorial

This notebook demonstrates how to use ESM-2, a protein language model from Facebook AI Research, to generate and optimize peptide binders for target proteins. The workflow includes:
1. Setting up the environment
2. Loading the ESM-2 model
3. Preparing a protein sequence
4. Generating peptide sequences
5. Optimizing binding affinity with evolutionary strategies

---


In [None]:

# Step 1: Setup
# Install necessary libraries
!pip install transformers torch esm


Collecting esm
  Downloading esm-3.0.6-py3-none-any.whl.metadata (9.4 kB)
Collecting torchtext (from esm)
  Downloading torchtext-0.18.0-cp310-cp310-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting biotite==0.41.2 (from esm)
  Downloading biotite-0.41.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.1 kB)
Collecting msgpack-numpy (from esm)
  Downloading msgpack_numpy-0.4.8-py2.py3-none-any.whl.metadata (5.0 kB)
Collecting biopython (from esm)
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting brotli (from esm)
  Downloading Brotli-1.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.5 kB)
Collecting jedi>=0.16 (from ipython->esm)
  Downloading jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Downloading esm-3.0.6-py3-none-any.whl (2.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m20.7 MB/s[0m


### Step 1: Environment Setup
We first install the required libraries: `transformers` for working with Hugging Face models, `torch` for PyTorch, and `esm` for protein language modeling tools.


In [None]:

# Step 2: Load ESM-2 model
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import pandas as pd
import numpy as np
from torch.distributions import Categorical

# Load the pre-trained ESM-2 model and tokenizer from Hugging Face
model = AutoModelForMaskedLM.from_pretrained("TianlaiChen/PepMLM-650M")
tokenizer = AutoTokenizer.from_pretrained("TianlaiChen/PepMLM-650M")

print("Model and tokenizer loaded successfully.")


Model and tokenizer loaded successfully.



### Step 2: Load ESM-2 Model
We use a pre-trained ESM-2 model available on Hugging Face's model hub. ESM-2 is designed for understanding protein sequences, making it suitable for predicting peptide interactions with target proteins.


In [None]:

# Step 3: Prepare a protein sequence
# Example protein sequence from UniProt (replace with an actual sequence)
protein_seq = "MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGLFKLRMLFKDDYPSSPPKCKFEPPLFHPNVYPSGTVCLSILEEDKDWRPAITIKQILLGIQELLNEPNIQDPAQAEAYTIYCQNRVEYEKRVRAQAKKFAPS"

# Tokenize the protein sequence for the model
inputs = tokenizer(protein_seq, return_tensors="pt")

print("Protein sequence tokenized.")
print("Inputs:", inputs)


Protein sequence tokenized.
Inputs: {'input_ids': tensor([[ 0, 20,  8,  6, 12,  5,  4,  8, 10,  4,  5, 16,  9, 10, 15,  5, 22, 10,
         15, 13, 21, 14, 18,  6, 18,  7,  5,  7, 14, 11, 15, 17, 14, 13,  6, 11,
         20, 17,  4, 20, 17, 22,  9, 23,  5, 12, 14,  6, 15, 15,  6, 11, 14, 22,
          9,  6,  6,  4, 18, 15,  4, 10, 20,  4, 18, 15, 13, 13, 19, 14,  8,  8,
         14, 14, 15, 23, 15, 18,  9, 14, 14,  4, 18, 21, 14, 17,  7, 19, 14,  8,
          6, 11,  7, 23,  4,  8, 12,  4,  9,  9, 13, 15, 13, 22, 10, 14,  5, 12,
         11, 12, 15, 16, 12,  4,  4,  6, 12, 16,  9,  4,  4, 17,  9, 14, 17, 12,
         16, 13, 14,  5, 16,  5,  9,  5, 19, 11, 12, 19, 23, 16, 17, 10,  7,  9,
         19,  9, 15, 10,  7, 10,  5, 16,  5, 15, 15, 18,  5, 14,  8,  2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1


### Step 3: Generate Peptide Sequence
Using PepMLM (Masked Language Modeling), we can generate a peptide sequence that is predicted to bind the input protein. The model predicts the most suitable peptide based on the given protein sequence.


In [None]:

def compute_pseudo_perplexity(model, tokenizer, protein_seq, binder_seq):
    sequence = protein_seq + binder_seq
    original_input = tokenizer.encode(sequence, return_tensors='pt').to(model.device)
    length_of_binder = len(binder_seq)

    # Prepare a batch with each row having one masked token from the binder sequence
    masked_inputs = original_input.repeat(length_of_binder, 1)
    positions_to_mask = torch.arange(-length_of_binder - 1, -1, device=model.device)

    masked_inputs[torch.arange(length_of_binder), positions_to_mask] = tokenizer.mask_token_id

    # Prepare labels for the masked tokens
    labels = torch.full_like(masked_inputs, -100)
    labels[torch.arange(length_of_binder), positions_to_mask] = original_input[0, positions_to_mask]

    # Get model predictions and calculate loss
    with torch.no_grad():
        outputs = model(masked_inputs, labels=labels)
        loss = outputs.loss

    # Loss is already averaged by the model
    avg_loss = loss.item()
    pseudo_perplexity = np.exp(avg_loss)
    return pseudo_perplexity


def generate_peptide_for_single_sequence(protein_seq, peptide_length = 15, top_k = 3, num_binders = 4):

    peptide_length = int(peptide_length)
    top_k = int(top_k)
    num_binders = int(num_binders)

    binders_with_ppl = []

    for _ in range(num_binders):
        # Generate binder
        masked_peptide = '<mask>' * peptide_length
        input_sequence = protein_seq + masked_peptide
        inputs = tokenizer(input_sequence, return_tensors="pt").to(model.device)

        with torch.no_grad():
            logits = model(**inputs).logits
        mask_token_indices = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
        logits_at_masks = logits[0, mask_token_indices]

        # Apply top-k sampling
        top_k_logits, top_k_indices = logits_at_masks.topk(top_k, dim=-1)
        probabilities = torch.nn.functional.softmax(top_k_logits, dim=-1)
        predicted_indices = Categorical(probabilities).sample()
        predicted_token_ids = top_k_indices.gather(-1, predicted_indices.unsqueeze(-1)).squeeze(-1)

        generated_binder = tokenizer.decode(predicted_token_ids, skip_special_tokens=True).replace(' ', '')

        # Compute PPL for the generated binder
        ppl_value = compute_pseudo_perplexity(model, tokenizer, protein_seq, generated_binder)

        # Add the generated binder and its PPL to the results list
        binders_with_ppl.append([generated_binder, ppl_value])

    return binders_with_ppl

def generate_peptide(input_seqs, peptide_length=15, top_k=3, num_binders=4):
    if isinstance(input_seqs, str):  # Single sequence
        binders = generate_peptide_for_single_sequence(input_seqs, peptide_length, top_k, num_binders)
        return pd.DataFrame(binders, columns=['Binder', 'Pseudo Perplexity'])

    elif isinstance(input_seqs, list):  # List of sequences
        results = []
        for seq in input_seqs:
            binders = generate_peptide_for_single_sequence(seq, peptide_length, top_k, num_binders)
            for binder, ppl in binders:
                results.append([seq, binder, ppl])
        return pd.DataFrame(results, columns=['Input Sequence', 'Binder', 'Pseudo Perplexity'])


### Step 5: Optimize Peptide Binding Affinity
We employ an evolutionary strategy, such as EvoProtGrad, to refine the generated peptide sequence. The goal is to enhance the binding affinity between the peptide and the target protein.


In [None]:
results_df = generate_peptide(protein_seq, peptide_length=15, top_k=3, num_binders=5)
print(results_df)

            Binder  Pseudo Perplexity
0  TDDEPEPLLYAALLE          11.393285
1  TQDEPEPLPYLAAEL           9.031120
2  FQDSPELLPLYLALL           9.081228
3  FDDSEELLLYYLLEL          14.481063
4  FEEEEELAPRLRAKL          10.555094


## In Silico Directed Evolution of the Peptide Binder with EvoProtGrad and ESM-2

In [None]:
!pip install evo_prot_grad
#del model
#torch.cuda.empty_cache()
import torch
import evo_prot_grad
from transformers import AutoTokenizer, EsmForMaskedLM



In [None]:
def run_evo_prot_grad_on_paired_sequence(paired_protein_sequence):
    # Replace ':' with a string of 20 'G' amino acids
    separator = 'G' * 20
    sequence_with_separator = paired_protein_sequence.replace(':', separator)

    # Determine the start and end indices of the first protein and the separator
    separator_start_index = sequence_with_separator.find(separator)
    first_protein_end_index = separator_start_index
    separator_end_index = separator_start_index + len(separator)

    # Format the sequence into FASTA format
    fasta_format_sequence = f">Paired_Protein_Sequence\n{sequence_with_separator}"

    # Save the sequence to a temporary file
    temp_fasta_path = "temp_paired_sequence.fasta"
    with open(temp_fasta_path, "w") as file:
        file.write(fasta_format_sequence)

    # Use a smaller ESM-2 model to reduce memory usage
    esm2_expert = evo_prot_grad.get_expert(
        'esm',
        model=EsmForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D"),  # Use a smaller ESM-2 model
        tokenizer=AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D"),
        scoring_strategy='pseudolikelihood_ratio',
        temperature=0.95,
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )

    # Initialize wildtype sequence for the expert with the correct format
    wildtype_sequence = sequence_with_separator.replace(" ", "")  # Make sure the input sequence has no spaces
    esm2_expert.init_wildtype(wildtype_sequence)

    # Initialize Directed Evolution with the preserved first protein and separator region
    directed_evolution = evo_prot_grad.DirectedEvolution(
        wt_fasta=temp_fasta_path,
        output='best',
        experts=[esm2_expert],
        parallel_chains=1,  # Reduce parallel chains to save memory
        n_steps=50,
        max_mutations=15,
        verbose=True,
        preserved_regions=[(0, first_protein_end_index), (separator_start_index, separator_end_index)]
    )

    # Run the evolution process
    variants, scores = directed_evolution()

    # Process the results and split them into Protein 1 and Protein 2
    for variant, score in zip(variants, scores):
        # Remove spaces from the sequence
        evolved_sequence_no_spaces = variant.replace(" ", "")

        # Split the sequence at the separator
        protein_1, protein_2 = evolved_sequence_no_spaces.split(separator)

        print(f"Protein: {protein_1}, Evolved Peptide: {protein_2}, Score: {score}")

In [None]:
# Example usage
paired_protein_sequence = "MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGLFKLRMLFKDDYPSSPPKCKFEPPLFHPNVYPSGTVCLSILEEDKDWRPAITIKQILLGIQELLNEPNIQDPAQAEAYTIYCQNRVEYEKRVRAQAKKFAPS:FDEDDPLAPRLLEEE"  # Replace with your paired protein sequences
run_evo_prot_grad_on_paired_sequence(paired_protein_sequence)

>Wildtype sequence: M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F D E D D P L A P R L L E E E


RuntimeError: Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor