# Folding

This notebooks shows a basic example of monomer folding with `prtm`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from proteome import models
from proteome import visual
from proteome.query import caching



2023-09-23 17:47:28.061647: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


PyRosetta-4 2023 [Rosetta PyRosetta4.conda.linux.cxx11thread.serialization.CentOS.python310.Release 2023.27+release.e3ce6ea9faf661ae8fa769511e2a9b8596417e58 2023-07-07T12:00:46] retrieved from: http://www.pyrosetta.org
(C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team.


To get started let's define a simple protein sequence that we'd like to fold. It should be all uppercase; missing residues can be specified with `X`.

In [5]:
sequence = "MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH"

## Folding model comparison

Now we'll define a dict with all of the folding pipelines we want to try for this sequence. We'll avoid instantiating the pipelines for now to save memory. Upon instantiation model weights are downloaded to the default `torch.hub` directory (usually `~/.cache/torch/hub/checkpoints`) and the model is moved to the currently available `GPU`.

In [7]:
folders = {
    "OpenFold": models.OpenFoldForFolding,
    "OmegaFold": models.OmegaFoldForFolding,
    "RoseTTAFold": models.RoseTTAFoldForFolding,
    "ESMFold": models.ESMForFolding,
    "DMPFold": models.DMPFoldForFolding,
}

Three of the models we're using require MSAs for inference (`OpenFold`, `RoseTTAFold`, `DMPFold`). `prtm` will perform MSA queries automatically using `jackhmmer` and the databases released with `AlphaFold`. Once queries are completed the results are cached locally in a simple `sqlite` database. Any folding pipelines that require MSAs will first check the cache before recomputing. By default the caching is stored in `~/.proteome/queries.db`. Caching is responsive to both the input sequence and the parameters used in the querying pipeline. To save time for this example, we'll use pre-computed MSAs by changing the default cache path.

In [8]:
caching.set_db_path("./cached_queries_for_testing.db")

For simplicity we'll be using the default model weights for each folding pipeline; however, some have multiple options that can be tested. We can get a list to choose from with the `available_models` property.

In [10]:
# List models to choose from for models.OpenFoldForFolding
folders["OpenFold"].available_models

['finetuning-3',
 'finetuning-4',
 'finetuning-5',
 'finetuning_ptm-2',
 'finetuning_no_templ_ptm-1']

Let's do some folding! Every pipeline in `prtm` has at least two outputs. The last output is always a dictionary of `aux_outputs` that include things like confidence scores, loss metrics, etc. The first output of folding models is a protein structure class that we'll discuss in detail in another notebook.

In [12]:
predicted_structures = {}
aux_outputs = {}
for folder_name, fold_pipeline in folders.items():
    print(f"Running {folder_name}...")
    # Initialize the folder model with defaults
    folder = fold_pipeline()
    # Run pipelines with the __call__ method
    pred_structure, aux = folder(sequence)
    predicted_structures[folder_name] = pred_structure
    aux_outputs[folder_name] = aux

Running OpenFold...
Running jackhmmer on uniref90 database...
Running jackhmmer on smallbfd database...
Running jackhmmer on mgnify database...
58 sequences found in uniref90
110 sequences found in smallbfd
9 sequences found in mgnify
Running OmegaFold...
Running RoseTTAFold...
Running jackhmmer on uniref90 database...
Running jackhmmer on smallbfd database...
Running jackhmmer on mgnify database...
58 sequences found in uniref90
110 sequences found in smallbfd
9 sequences found in mgnify


  assert input.numel() == input.storage().size(), (


Running ESMFold...
Running DMPFold...
Running jackhmmer on uniref90 database...
Running jackhmmer on smallbfd database...
Running jackhmmer on mgnify database...
58 sequences found in uniref90
110 sequences found in smallbfd
9 sequences found in mgnify


Let's check the outputs.

In [15]:
aux_outputs

{'OpenFold': {'mean_plddt': 90.01612854003906},
 'OmegaFold': {'confidence': 0.9373629689216614},
 'RoseTTAFold': {'mean_plddt': 0.8220178484916687},
 'ESMFold': {'mean_plddt': 85.26000213623047},
 'DMPFold': {'confidence': 42.578731536865234}}

In [16]:
predicted_structures

{'OpenFold': <proteome.protein.Protein37 at 0x7f6dcad83c40>,
 'OmegaFold': <proteome.protein.Protein14 at 0x7f6dd8ea5a80>,
 'RoseTTAFold': <proteome.protein.Protein4 at 0x7f6dcaf5ed70>,
 'ESMFold': <proteome.protein.Protein37 at 0x7f6dca96a9e0>,
 'DMPFold': <proteome.protein.Protein5 at 0x7f6dcae79ba0>}

As described the `aux_outputs` just contain global measures of structure confidence. The predicted strctures are a few different kinds of `protein` classes. The number after `Protein` defines the number of atoms in the structure. In this case, `Protein37` and `Protein14` are two different ways of representing a protein structure with sidechains included while `Protein5` and `Protein4` are structures that only include atoms in the backbone (`N`, `CA`, `C`, `O`, `CB`). To view any structure with per-residue confidence predictions we simply call `show` on the structure.

In [20]:
# A structure with sidechains
# We can color the structure with any matplotlib colormap
predicted_structures["OmegaFold"].show(cmap="jet")

<py3Dmol.view at 0x7f6dcad1c760>

In [21]:
# A structure without sidechains
predicted_structures["RoseTTAFold"].show(cmap="jet")

<py3Dmol.view at 0x7f6f7ff72290>

Really for this kind of comparison we'd like to see the structures together in a single figure. There are two options. First, we can superimpose any two structures:

In [22]:
# The first structure is shown with some opacity for ease of visualization
visual.view_superimposed_structures(
    predicted_structures["OpenFold"], predicted_structures["ESMFold"], color1="green"
)

<py3Dmol.view at 0x7f6dcaec9660>

Second we can view all of the structures in a grid with locked views:

In [25]:
# When working with a mixture of structures that don't all have
# sidechains it's usually better to turn them off
visual.view_aligned_structures_grid(
    list(predicted_structures.values()), cmap="jet", show_sidechains=False
)

<py3Dmol.view at 0x7f6f8496aef0>

Finally we can export any `protein` structure to `PDB`. Behind-the-scenes `prtm` will ensure that structures with and without sidechains are written correctly so there's no need to do any manual conversions.

In [None]:
for folder_name, pred_structure in predicted_structures.items():
    with open(f"{folder_name}_prediction.pdb", mode="w") as f:
        f.writelines(pred_structure.to_pdb())

## Conformation Sampling

All the folding models we've looked at so far are (nearly) deterministic. We can sample possible conformations by using `EigenFold` instead. `EigenFold` is built on top of `OmegaFold` but adds a sampling procedure during structure decoding.

In [9]:
fold_sampler = models.EigenFoldForFoldSampling(random_seed=0)

In [10]:
sampled_structures = []
sampled_aux_outputs = []
for _ in range(5):
    sampled_structure, sampled_aux = fold_sampler(sequence)
    sampled_structures.append(sampled_structure)
    sampled_aux_outputs.append(sampled_aux)

100%|███████████████████████████████| 144/144 [00:02<00:00, 48.56it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 57.15it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 55.43it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 56.04it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 55.15it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 55.44it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 54.46it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 55.34it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 54.23it/s]
100%|███████████████████████████████| 144/144 [00:02<00:00, 55.30it/s]


In [11]:
sampled_aux_outputs

[{'elbo': 0.03906942158937454},
 {'elbo': 0.02422763779759407},
 {'elbo': 0.1434585452079773},
 {'elbo': 0.07470780611038208},
 {'elbo': 0.04018719866871834}]

In [13]:
sampled_structures

[<proteome.protein.ProteinCATrace at 0x7ff8e6a10a00>,
 <proteome.protein.ProteinCATrace at 0x7ff8e6a112d0>,
 <proteome.protein.ProteinCATrace at 0x7ff8e6a11240>,
 <proteome.protein.ProteinCATrace at 0x7ff8e6a10100>,
 <proteome.protein.ProteinCATrace at 0x7ff8e6a10130>]

This time we get `elbo` values which in this case is a measure of the likelihood for a structure. The structure returned by `EigenFold` are `CA` traces which means that only a single backbone atom per residue is predicted. The visualization tools for these structures are a bit different, but we can still call the `show` method to view them.

In [18]:
# We can't specify pyplot colormaps any more, we just get basic color names of can pass HEX colors
sampled_structures[0].show(cmap="green")

NGLWidget()

Like before we can superimpose structures to see more easily where they differ.

In [19]:
visual.view_superimposed_ca_traces(sampled_structures)

NGLWidget()

Although this is a very simple structure, this comparison shows us where parts of the structure that are likely less stable (the results overlap nicely with the confidence predictions of the other folding models).