# Inverse Folding

This notebooks shows a basic example of monomer inverse folding / sequence design with `prtm`.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from proteome import models
from proteome import protein
from proteome import visual
from proteome.query import caching

Let's load the structure of a designed protein.

In [None]:
pdb_str = protein.get_structure_from_pdb("5L33")
structure = protein.Protein37.from_pdb_string(pdb_str)

Now we'll define a dict with all of the inverse folding pipelines we want to try for this structure. In addition, we'll load a folding model to test the generated sequences. `OmegaFold` is a good choice for de novo designed structures because it doesn't rely on MSAs.

In [None]:
inverse_folders = {
    "ProteinMPNN": models.ProteinMPNNForInverseFolding(),
    "ESMIF": models.ESMForInverseFolding(),
    "ProteinSolver": models.ProteinSolverForInverseFolding(),
    # Note that this model requires pyrosetta installation
    # Comment it out otherwise
    "ProteinSeqDes": models.ProteinSeqDesForInverseFolding()
}
folder = models.OmegaFoldForFolding()

We choose to use the term `inverse folding` for these pipelines to clarify the expected inputs and outputs but the term `sequence design` is also commonly used. Aside from `ProteinSolver` which is nearly deterministic, all of the defined `inverse_folders` use some sampling procedure to create diverse sequences. Novel sequences are useful because they increase the odds of finding a sequence that actually folds into the designed structure in vitro.

In [None]:
designed_sequences = {}
aux_outputs = {}
for if_name, inverse_folder in inverse_folders.items():
    print(f"Running {if_name}...")
    designed_sequences[if_name] = []
    aux_outputs[if_name] = []
    # Generate 3 possible sequences with each inverse folder
    for _ in range(3):
        designed_sequence, aux_output = inverse_folder(structure)
        designed_sequences[if_name].append(designed_sequence)
        aux_outputs[if_name].append(aux_output)

`ProteinSeqDes` is notably slower than the other algorithms because it uses a learned potential function to run a traditional energy minimization procedure.  

Looking at the aux_outputs first, we get scores for each designed sequence (higher is better) or an estimated energy for the structure in the case of `ProteinSeqDes` (lower is better).

In [None]:
aux_outputs

Let's fold the first sequence designed by each model and then compare the results to our desired structure.

In [None]:
predicted_structures = {}
folder_aux_outputs = {}
for if_name, sequences in designed_sequences.items():
    # First the first designed sequence from each
    print(f"Folding {if_name} sequence...")
    predicted_structure, folder_aux_output = folder(sequences[0])
    predicted_structures[if_name] = predicted_structure
    folder_aux_outputs[if_name] = folder_aux_output

We'd expect that higher folding confidence should correlate with better sequences (or at least sequences that are similar to those found in the `PBD`). 

In [None]:
folder_aux_outputs

In [None]:
visual.view_aligned_structures_grid(
    [structure] + list(predicted_structures.values()), cmap="viridis", bfactor_is_confidence=True
)

In [None]:
visual.view_superimposed_structures(structure, predicted_structures["ProteinMPNN"])