**Protein-Protein Comparison**


The following notebook uses code from "Predicting Protein-Protein Interactions Using a Protein Language Model and Linear Sum Assignment" by Amelie Schreiber. One functionality of this notebook is to compare amino acid sequences of potential binders to determine which has the highest affinity for a particular protein. The ESM-2 model from Meta AI is used to determine the Masked Language Model loss (MLM), the metric used to quantify affinity.



**Import libraries**

- transformers: AutoModelForMaskedLM is used to load the pre trained model from Meta AI. AutoTokenizer is used to load the corresponding tokenizer. The tokenizer is necessary to format the input binder sequences such that they can be loaded into the model.

- torch: working with tensors

In [None]:
# Tutorial to look at potential binders of interest
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

**Load Model**

Load the model using the imported transformer functions and make sure that it has been set in evaluation mode.

In [None]:
# Load the base model and tokenizer
base_model_path = "facebook/esm2_t12_35M_UR50D"
model = AutoModelForMaskedLM.from_pretrained(base_model_path)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Ensure the model is in evaluation mode
model.eval()

**Set Protein and Binder Sequences**

Input the sequence of the protein of interest as a string and input all of the potential binders to be evaluated as an array of strings.

In [None]:
# Define the protein of interest and its potential binders
protein_of_interest = "LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPS"
potential_binders = ["CPPLENIDISGVDGDSATISFEPCREPVDYVVLHYGRAGDPGDWKTYFLPPGDTSFTLTGLEPGGWYRVELWCWRPGRCCEPQTEYFEV",
                "SEEEEERERALKEIIEETRRELKAAKAKHGKVVVVLIMASSTLEPEFILELSKALIKEMKSLFPNVVLIIVVVGLAPASLLARIRDVSLELAKYAKSLGIKVIVIVGNENEAVFVPAFEALGVEVIVDRTIIEIAAEELGLSEEEVLARFAAAAELLDELFAADPSLRERYARLDVAGATELLLERLRELFGAKVERHERLITVEVERVLTPDERRRVTAILLTPEAAREVVERLVDLVVDLILEKIAEGHNVLVLVFTPTIALAREVAALFEERRPLLEEAGAAVIIRLVARDPDTFLI",
                "SAAAAAQARLDAALAALREWLAARAREAIERYRDAKERVVEEEAITRDFHGVLTLEAVRIEVTPTTVAISARLRHASGQTVYLSILAPHDPAALEAALAIAELATRLALEAGYDLFVAVAFEPPGPVTPERWEEFAAFLELVAEDLRALLADAAAKGRPLLVVIVIVVNDDLAAHLPLESHTDPEAAAAAVATYVAEVEAKTGRKLTLPAEIAAALAAGASVVLVVARREDIAGVPARVEAALRAALAAA",
                "LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCW",
                "SNKTQLGSSG/ELEELLAKKEELLKKLYKELLKKGNVLVDTEYLKTLTEEELKEISKAYISEEEGMIILEFKGTYNGYLVIKHKDVETSEEVREEQKKLAEELKKKLEALGAEVREIEVKVKEEVKTEKEGNITKTTLTLEVEIDGEKVTLKLTEVEVEL",
                "AATAAALEHLEAAAAALKELAALVATEAADAAALKAKAEELAAKVREHLRAARAATGDTSLTDEDIDAFIQRILDAVDDAEAVKALYEELEAAIAAFRAAQEAAA",
                "GKLNIKVTFLSSGKEEKLAALKAHVDALVASIDTKASGAPPLKVEVKESESKETREIDGKTYEYGFTTVTYSFEGTNDILNQLANDIVTHISNTLKDLLIEIDIAATSDGDLNLTINITVNGVDTVILLNVSLTAGTNVNLTINITVTGATVTVHIIVSLTTTSAGSATVTINATAGAGATLNITLMGVFTNTAVKDVTVNVTTTATSGTVTVTLGPVTQASAAEMAAGVAAAREAAREEALREVARLTE"
                ]

**Compute MLM Loss Function**

The function to compute the MLM loss is defined below. the MLM loss is a metric to determine the affinity of the binders to the given protein.

In [None]:
def compute_mlm_loss(protein, binder, iterations=3):
    total_loss = 0.0

    for _ in range(iterations):
        # Concatenate protein sequences with a separator
        concatenated_sequence = protein + ":" + binder

        # Mask a subset of amino acids in the concatenated sequence (excluding the separator)
        tokens = list(concatenated_sequence)
        mask_rate = 0.15  # For instance, masking 15% of the sequence
        num_mask = int(len(tokens) * mask_rate)

        # Exclude the separator from potential mask indices
        available_indices = [i for i, token in enumerate(tokens) if token != ":"]
        probs = torch.ones(len(available_indices))
        mask_indices = torch.multinomial(probs, num_mask, replacement=False)

        for idx in mask_indices:
            tokens[available_indices[idx]] = tokenizer.mask_token

        masked_sequence = "".join(tokens)
        inputs = tokenizer(masked_sequence, return_tensors="pt", truncation=True, max_length=1024, padding='max_length')

        # Compute the MLM loss
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss

        total_loss += loss.item()

    # Return the average loss
    return total_loss / iterations

**Determine MLM Loss**

Loop through the potential binders and compute their losses.

In [None]:
# Compute MLM loss for each potential binder
mlm_losses = {}
for binder in potential_binders:
    loss = compute_mlm_loss(protein_of_interest, binder)
    mlm_losses[binder] = loss

**Determine which binders have the highest potential affinity**

Sort and print the binders by MLM loss to evaulate which have the greatest potential to bind to the given protein

In [None]:
# Rank binders based on MLM loss
ranked_binders = sorted(mlm_losses, key=mlm_losses.get)

print("Ranking of Potential Binders:")
for idx, binder in enumerate(ranked_binders, 1):
    print(f"{idx}. {binder} - MLM Loss: {mlm_losses[binder]}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/136M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



Ranking of Potential Binders:
1. SEEEEERERALKEIIEETRRELKAAKAKHGKVVVVLIMASSTLEPEFILELSKALIKEMKSLFPNVVLIIVVVGLAPASLLARIRDVSLELAKYAKSLGIKVIVIVGNENEAVFVPAFEALGVEVIVDRTIIEIAAEELGLSEEEVLARFAAAAELLDELFAADPSLRERYARLDVAGATELLLERLRELFGAKVERHERLITVEVERVLTPDERRRVTAILLTPEAAREVVERLVDLVVDLILEKIAEGHNVLVLVFTPTIALAREVAALFEERRPLLEEAGAAVIIRLVARDPDTFLI - MLM Loss: 5.946854591369629
2. GKLNIKVTFLSSGKEEKLAALKAHVDALVASIDTKASGAPPLKVEVKESESKETREIDGKTYEYGFTTVTYSFEGTNDILNQLANDIVTHISNTLKDLLIEIDIAATSDGDLNLTINITVNGVDTVILLNVSLTAGTNVNLTINITVTGATVTVHIIVSLTTTSAGSATVTINATAGAGATLNITLMGVFTNTAVKDVTVNVTTTATSGTVTVTLGPVTQASAAEMAAGVAAAREAAREEALREVARLTE - MLM Loss: 6.72139835357666
3. SAAAAAQARLDAALAALREWLAARAREAIERYRDAKERVVEEEAITRDFHGVLTLEAVRIEVTPTTVAISARLRHASGQTVYLSILAPHDPAALEAALAIAELATRLALEAGYDLFVAVAFEPPGPVTPERWEEFAAFLELVAEDLRALLADAAAKGRPLLVVIVIVVNDDLAAHLPLESHTDPEAAAAAVATYVAEVEAKTGRKLTLPAEIAAALAAGASVVLVVARREDIAGVPARVEAALRAALAAA - MLM Loss: 6.789663314819336
4. LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVL