# Unconditional Design

This notebook covers algorithms for creating unconditional protein structures.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from proteome import models
from proteome import protein
from proteome import visual

In unconditional design, we generate structures without any constraints other than the number of residues. Primarily this is useful for benchmarking different diffusion spaces and can also create unique and interesting structures that may be useful as training data for other prediction tasks. One other interesting distinction is that `RFDiffusion` is the only model finetuned from a pretrained folding model (e.g., `RoseTTAFold`), in general this seems to results in more designable structures.

In [None]:
structure_designers = {
    "FoldingDiff": models.FoldingDiffForStructureDesign(),
    "Genie": models.GenieForStructureDesign(),
    "SE3Diffusion": models.SE3DiffusionForStructureDesign(),
    # Because RFDiffusion has multiple models for slightly different
    # sets of inputs we're going to set the model_name to auto
    # this won't download or load any particular weights until inference
    "RFDiffusion": models.RFDiffusionForStructureDesign(model_name="auto"),
}

The outputs for each of these pipelines is a protein structure without a corresponding amino acid sequence. To create a sequence and test the structures designability we'll add an inverse folding and folding pipeline as well.

In [None]:
# Outputs are CA traces for some algorithms so we'll need that model loaded
inverse_folder = models.ProteinMPNNForInverseFolding(model_name="ca_only_model-20")
folder = models.OmegaFoldForFolding()

## Structure Design

To fairly compare all of the methods we'll generate structures with 128 residues. This requires passing an `InferenceConfig` to the pipelines. Relevant configs can be imported from `models` with the naming convention of `{model_name}_config.InferenceConfig`. `RFDiffusion` is an exception to this pattern and described in more detail in a standalone notebook.

In [None]:
inference_params = {
    # Maximum sequence length is 128 for FoldingDiff so we keep it
    # for all pipelines
    "FoldingDiff": models.foldingdiff_config.InferenceConfig(seq_len=128),
    # Depending on the model Genie can generate up to 256 residue structures
    "Genie": models.genie_config.InferenceConfig(seq_len=128),
    # SE3Diffusion was tested up to 500 residues
    "SE3Diffusion": models.se3_diffusion_config.InferenceConfig(length=128),
    # RFDiffusion has a very different inference config setup that we'll
    # discuss in a dedicated notebook.
    "RFDiffusion": models.rfdiffusion_config.UnconditionalSamplerConfig(
        contigmap_params=models.rfdiffusion_config.ContigMap(contigs=["128-128"]),
    ),
}

Generate structures with the given inference parameters.

In [None]:
designed_structures = {}
aux_outputs = {}
for sd_name, structure_designer in structure_designers.items():
    print(f"Running {sd_name}...")
    designed_structure, aux_output = structure_designer(inference_params[sd_name])
    
    designed_structures[sd_name] = designed_structure
    aux_outputs[sd_name] = aux_output

In this case there are no `aux_outputs` from these models.

In [None]:
aux_outputs

In [None]:
designed_structures

The designed structures are mostly `CA` traces except for `RFDiffusion`. Since these models don't design a corresponding sequence to pair with the structure the prediction of sidechains is useless because they're residue specific. If we print the sequence of any of the designed structures we'll get a string of `glycines` because it's the only residue without a non-hydrogen sidechain.

In [None]:
print("Structure sequence:", designed_structures["Genie"].sequence())

In [None]:
designed_structures["Genie"].show()

In [None]:
# For consistency convert and show this structure as a CA trace
designed_structures["RFDiffusion"].to_ca_trace().show()

Now let's design sequences for each of the generated structures.

In [None]:
designed_sequences = {}
sequence_aux_outputs = {}
for sd_name, designed_structure in designed_structures.items():
    designed_sequence, sequence_aux_out = inverse_folder(designed_structure)
    designed_sequences[sd_name] = designed_sequence
    sequence_aux_outputs[sd_name] = sequence_aux_out

In [None]:
sequence_aux_outputs

Finally we'll fold the designed sequences with `OmegaFold` and compare the result to the unconditionally designed structure.

In [None]:
folded_structures = {}
folder_aux_outputs = {}
for sd_name, designed_sequence in designed_sequences.items():
    predicted_structure, folder_aux_out = folder(designed_sequence)
    folded_structures[sd_name] = predicted_structure
    folder_aux_outputs[sd_name] = folder_aux_out

In [None]:
sd_name = "RFDiffusion"
visual.view_superimposed_ca_traces(
    [designed_structures[sd_name].to_ca_trace(), folded_structures[sd_name].to_ca_trace()]
)

## Joint sequence-structure design

`Protein Generator` is similar to `RFDiffusion` but designs an amino acid sequence jointly with the structure. This allows us to skip the inverse folding step and directly compare a structure folded given the designed sequence against the designed structure.

In [None]:
joint_designer = models.ProteinGeneratorForJointDesign(model_name="auto")
joint_designer_params = models.protein_generator_config.InferenceConfig(
    contigmap_params=models.protein_generator_config.ContigMap(contigs=["128-128"]),
)

In [None]:
designed_structure, designed_sequence, aux_output = joint_designer(joint_designer_params)

In [None]:
predicted_structure, _ = folder(designed_sequence)

In [None]:
visual.view_superimposed_structures(designed_structure, predicted_structure)