# Protein

This notebooks covers the usage of the `protein.ProteinBase` API and describes how structures with different numbers of atoms are handled.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from proteome import protein



Let's grab a somewhat complex structure from the `PDB` and load it into a protein structure. The main structure represention is `Protein37` which means a structure with 37 possible non-hydrogen atoms per residue. 

In [3]:
pdb_str = protein.get_structure_from_pdb("6QNO")
structure = protein.Protein37.from_pdb_string(pdb_str)

Structure exists: '/home/conradry71/.proteome/pdb_downloads/pdb6qno.ent' 


## Some basics

Let's show the structure.

In [4]:
# Unlike predicted structures, the b-factors in PDB files are not
# confidence scores in the range 0-100
structure.show(
    bfactor_is_confidence=False,
)

<py3Dmol.view at 0x7ff2644a73a0>

The `Protein37` class is effectively a PDB file that's been parsed into numpy arrays with some extra sugar that makes it interactive in a notebook. The fields that define the structure can be seen from the `fields` property.

In [5]:
structure.fields

['atom_positions',
 'aatype',
 'atom_mask',
 'residue_index',
 'chain_index',
 'b_factors',
 'parents',
 'parents_chain_index',
 'hetatom_positions',
 'hetatom_names',
 'remark']

These fields were adopted from `AlphaFold` / `OpenFold` with some extras like the `hetatom` fields to support models like `RFDiffusion`. Briefly the fields represent:

- `atom_positions`: The XYZ coordinates of each atom
- `aatype`: The amino-acid type indices
- `atom_mask`: Mask with zeros where an atom is missing / doesn't exist
- `residue_index`: The index of each residue in `aatype`; usually but not always sequential
- `chain_index`: The chain indices corresponding to each residue
- `b_factors`: Residue position confidence
- `hetatom_positions`: The XYZ coordinates of `hetatoms` (i.e., a bound ligand)
- `hetatom_names`: The names of the `hetatoms`.
- `parents`, `parents_chain_index`, `remark` some header fields used by PDB

Most of these fields are arrays that aren't particularly human readable. We can get useful information about the structure from a few handy properties.

In [6]:
# Show the shape of the structure
print("Shape:", structure.shape)
# Show the chain names
print("Chain names:", structure.chains)
# Show the sequence for a particular chain
print("Chain A sequence", structure.sequence("A"))

Shape: ProteinShape(num_residues=1332, num_atoms=37, num_chains=6)
Chain names: ABGHLR
Chain A sequence TLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLLGAGESGKSTIVKQMGIVETHFTFKDLHFKMFDVGGQRSERKKWIHCFEGVTAIIFCVALSDYNRMHESMKLFDSICNNKWFTDTSIILFLNKKDLFEEKIKKSPLTICYPEYAGSNTYEEAAAYIQCQFEDLNKRKDTKEIYTHFTCATDTKNVQFVFDAVTDVIIKNNLKDCGLF


For structures with multiple chains like this one we can do handy things like easily extract a certain chain as a new structure.

In [7]:
structure_b = structure.get_chain("B")
structure_b.show(
    cmap="cool",
    bfactor_is_confidence=False,
    show_sidechains=False,
)

<py3Dmol.view at 0x7ff2644a4e50>

## Structure atom numbers

The `Protein37` structure is the main representation that we use for `PDB` files because it has the nice property that for every residue a particular index in the array always represent the same kind of atom. For example, if we wanted to get the positions of all `CG2` atoms we can index into the `atom_positions` with `structure.atom_positions[:, 7]`. The main disadvantage of the atom 37 representation is no residue has all possible atoms and there is a lot of empty space in the array.

In [8]:
all_atom_mask = structure.atom_mask.ravel()
filled = all_atom_mask.sum()
total = len(all_atom_mask)
print(f"Fraction of filled positions: {filled / total:.3f}")

Fraction of filled positions: 0.211


In this case only 21% of the atoms represented in the array are actually needed! That's not usually an issue when working with individual structures but can be limiting when trying to train models on GPUs with finite memory. In practice, most models use a denser 14 atom representation of the structure. The only disadvantage here is that the identity of an atom depends on the residue in question. The mapping of atoms is specified by `proteome.constants.residue_constants.restype_atom14_to_atom37`. Let's see how much space using 14 atoms saves us.

In [9]:
structure14 = structure.to_protein14()
all_atom_mask14 = structure14.atom_mask.ravel()
filled = all_atom_mask14.sum()
total = len(all_atom_mask14)
print(f"Fraction of filled positions: {filled / total:.3f}")

Fraction of filled positions: 0.556


Nice! This structure has much less wasted space and is more than 2x smaller. As a quick sidenote there's an extension of the 14 atom structure that has 27 atoms to accomodate hydrogens (`Protein27`). This is used by some models but generally isn't very useful because our `pdb_parser` ignores hydrogens. The remaining representations to consider further simplify structures down to just the backbone with 5, 4, 3 or 1 atom.

- `Protein5`: The backbone atoms including `O` and `CB`
- `Protein4`: The backbone atoms including `O`
- `Protein3`: The backbone atoms (`N`, `CA`, `C`)
- `ProteinCATrace`: The `CA` atoms only

We can easily convert to any of these when needed.

In [10]:
structure5 = structure.to_protein5()
structure4 = structure.to_protein4()
structure3 = structure.to_protein3()
structure_ca = structure.to_ca_trace()

Importantly, these backbone only representations are destructive. When created from `Protein37` or `Protein14` the sidechain atoms are lost permanently. We can still convert back to these types but any sidechains atoms will be masked.

In [11]:
# Show that now even more of the atoms are masked out
structure3_to_37 = structure3.to_protein37()
all_atom_mask3_to_37 = structure3_to_37.atom_mask.ravel()
filled = all_atom_mask3_to_37.sum()
total = len(all_atom_mask3_to_37)
print(f"Fraction of filled positions: {filled / total:.3f}")

Fraction of filled positions: 0.081


So why is it important to have all of these representations? It allows us to easily chain together models with different input and output shapes. For example, `RFDiffusion` creates structures with 14 atoms but an inverse folding model like `ProteinMPNN` only accepts a structure with backbone atoms. We can easily chain inputs and outputs together by calling `structure.to_protein4()` in this case. Internally, every pipeline has a particular representation that it expects and we perform the conversion automatically.