# Folding

This notebooks shows a basic example of monomer folding with `prtm`.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from prtm import models
from prtm import visual
from prtm.query import caching

To get started let's define a simple protein sequence that we'd like to fold. It should be all uppercase; missing residues can be specified with `X`.

In [None]:
sequence = "MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH"

## Folding model comparison

Now we'll define a dict with all of the folding pipelines we want to try for this sequence. We'll avoid instantiating the pipelines for now to save memory. Upon instantiation model weights are downloaded to the default `torch.hub` directory (usually `~/.cache/torch/hub/checkpoints`) and the model is moved to the currently available `GPU`.

In [None]:
folders = {
    "OpenFold": models.OpenFoldForFolding,
    "OmegaFold": models.OmegaFoldForFolding,
    "RoseTTAFold": models.RoseTTAFoldForFolding,
    "ESMFold": models.ESMForFolding,
    "DMPFold": models.DMPFoldForFolding,
}

Three of the models we're using require MSAs for inference (`OpenFold`, `RoseTTAFold`, `DMPFold`). `prtm` will perform MSA queries automatically using `jackhmmer` and the databases released with `AlphaFold`. Once queries are completed the results are cached locally in a simple `sqlite` database. Any folding pipelines that require MSAs will first check the cache before recomputing. By default the caching is stored in `~/.prtm/queries.db`. Caching is responsive to both the input sequence and the parameters used in the querying pipeline. To save time for this example, we'll use pre-computed MSAs by changing the default cache path.

In [None]:
caching.set_db_path("./cached_queries_for_testing.db")

For simplicity we'll be using the default model weights for each folding pipeline; however, some have multiple options that can be tested. We can get a list to choose from with the `available_models` property.

In [None]:
# List models to choose from for models.OpenFoldForFolding
folders["OpenFold"].available_models

Let's do some folding! Every pipeline in `prtm` has at least two outputs. The last output is always a dictionary of `aux_outputs` that include things like confidence scores, loss metrics, etc. The first output of folding models is a protein structure class that we'll discuss in detail in another notebook.

In [None]:
predicted_structures = {}
aux_outputs = {}
for folder_name, fold_pipeline in folders.items():
    print(f"Running {folder_name}...")
    # Initialize the folder model with defaults
    folder = fold_pipeline()
    # Run pipelines with the __call__ method
    pred_structure, aux = folder(sequence)
    predicted_structures[folder_name] = pred_structure
    aux_outputs[folder_name] = aux

Let's check the outputs.

In [None]:
aux_outputs

In [None]:
predicted_structures

As described the `aux_outputs` just contain global measures of structure confidence. The predicted strctures are a few different kinds of `protein` classes. The number after `Protein` defines the number of atoms in the structure. In this case, `Protein37` and `Protein14` are two different ways of representing a protein structure with sidechains included while `Protein5` and `Protein4` are structures that only include atoms in the backbone (`N`, `CA`, `C`, `O`, `CB`). To view any structure with per-residue confidence predictions we simply call `show` on the structure.

In [None]:
# A structure with sidechains
# We can color the structure with any matplotlib colormap
predicted_structures["OmegaFold"].show(cmap="jet")

In [None]:
# A structure without sidechains
predicted_structures["RoseTTAFold"].show(cmap="jet")

Really for this kind of comparison we'd like to see the structures together in a single figure. There are two options. First, we can superimpose any two structures:

In [None]:
# The first structure is shown with some opacity for ease of visualization
visual.view_superimposed_structures(
    predicted_structures["OpenFold"], predicted_structures["ESMFold"], color1="green"
)

Second we can view all of the structures in a grid with locked views:

In [None]:
# When working with a mixture of structures that don't all have
# sidechains it's usually better to turn them off
visual.view_aligned_structures_grid(
    list(predicted_structures.values()), cmap="jet", show_sidechains=False
)

Finally we can export any `protein` structure to `PDB`. Behind-the-scenes `prtm` will ensure that structures with and without sidechains are written correctly so there's no need to do any manual conversions.

In [None]:
for folder_name, pred_structure in predicted_structures.items():
    with open(f"{folder_name}_prediction.pdb", mode="w") as f:
        f.writelines(pred_structure.to_pdb())

## Conformation Sampling

All the folding models we've looked at so far are (nearly) deterministic. We can sample possible conformations by using `EigenFold` instead. `EigenFold` is built on top of `OmegaFold` but adds a sampling procedure during structure decoding.

In [None]:
fold_sampler = models.EigenFoldForFoldSampling(random_seed=0)

In [None]:
sampled_structures = []
sampled_aux_outputs = []
for _ in range(5):
    sampled_structure, sampled_aux = fold_sampler(sequence)
    sampled_structures.append(sampled_structure)
    sampled_aux_outputs.append(sampled_aux)

In [None]:
sampled_aux_outputs

In [None]:
sampled_structures

This time we get `elbo` values which in this case is a measure of the likelihood for a structure. The structure returned by `EigenFold` are `CA` traces which means that only a single backbone atom per residue is predicted. The visualization tools for these structures are a bit different, but we can still call the `show` method to view them.

In [None]:
# We can't specify pyplot colormaps any more, we just get basic color names of can pass HEX colors
sampled_structures[0].show(cmap="green")

Like before we can superimpose structures to see more easily where they differ.

In [None]:
visual.view_superimposed_ca_traces(sampled_structures)

Although this is a very simple structure, this comparison shows us where parts of the structure that are likely less stable (the results overlap nicely with the confidence predictions of the other folding models).