# Structures with Biopython


So far we used `BioPython`for sequences, now we learn how to also use it for 3D structures. 

Bio.PDB is a Biopython module to easily access structure data.

For further information read the Biopython [Tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf) or the [FAQs](http://biopython.org/DIST/docs/tutorial/biopdb_faq.pdf).

As an example we look at a Dopamin transporter Protein [4XP1](https://www.rcsb.org/structure/4XP1).



## Step1: Dowloading and reading in the structure file

Biopython is able to read `PDB` and `mmCIF` files. Both are available on the PDB database (e.g. [4XP1](https://www.rcsb.org/structure/4XP1))

Download both manually, move them to your working directory and have a brief look at them via vim.

While the file format is well defined (PDB [short description](https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html); [detailed documentation](http://www.wwpdb.org/documentation/file-format)) and [correspondences](http://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html) to `mmCIF` , actually getting information directly from the files is tedious.

The first lines (the header) contain information about the Authors who published the file, the experiment, further information about the Protein and all kinds of additional remarks. Information about atom positions, names and residues are given in lines starting with `ATOM`. 

Open the PDB file in a texteditor. Do you find the authors and the experiment with which the structure was obtained? How would you calculate the distances between two individual atoms or group of atoms?

Since we are too lazy to write that kind of code we use the Bio.PDB module.

First we have to import it.

In [None]:
import Bio
print("Biopython v" + Bio.__version__)

In [None]:
from Bio.PDB import *

We use the PDBParser class to read in the data, such that it is usable in Python. A Parser simply takes some input data (here some text file) and converts it to some data structure.


In [None]:
parser_pdb = PDBParser() # Creation of the parser object
structure_pdb = parser_pdb.get_structure("4XP1","4xp1.pdb") # 1st argument is a user defined name, the second the Path to the file

We can also use the MMCIFParser.

In [None]:
parser_cif = MMCIFParser()
structure_cif = parser_cif.get_structure("4XP1", "4xp1.cif")

For writing a structure file we use the PDBIO class.

In [None]:
io = PDBIO() # create io object
io.set_structure(structure_pdb) 
io.save("4xp1_new.pdb")

Compare your own structure `4xp1_new.pdb` with the file from the PDB `4xp1.pdb`. Are there any differences?

### Detour: Visualization

To view the structures in the notebook, we use a widget (for everyday use I recommend Pymol etc.).

Run those commands in the terminal.


`conda config --add channels conda-forge`

`conda install nglview -c bioconda`

After the installation is finished, reopen the jupyter notebook but first you need to enable the nglview extension:

`jupyter-nbextension enable nglview --py --sys-prefix`

`jupyter-notebook biopython2.ipynb`




In [None]:
import nglview as nv
view = nv.show_biopython(structure_pdb)
view.clear_representations()
view.add_ball_and_stick() #view as ball and stick 
view

It might look nicer with the protein in ribbon presentation.

In [None]:
view.clear_representations()
#add ribbons
view.add_cartoon('protein')
#add ball and stick for non-rotien
view.add_ball_and_stick('not protein')
view

Zooming into the structure we realize that there are three different chains and some ligands.

## Step2: Acess data

### Header information

We can now easily access the information from the header.

In [None]:
resolution = structure_pdb.header["resolution"]
print("The resolution is: ",resolution, "A")
keywords = structure_pdb.header["keywords"]
print("Keywords: " , keywords)

Other keys are name, head, deposition, release_date, strucure_method, resolution, structure_reference, journal_reference, author and compound.

Use the appropriate keys to now easily get the autor names.
How was the structure obtained.

### Object hierarchy

The hierarchy of structure objects is the following:

A structure can consist of several models.

A model consists of chains.

A chain consists of residues.

A residue consists of atoms. 



![](http://biopython.org/wiki/Smcra.png)



But what can we do with our structure object?

In [None]:
print(dir(structure_pdb))

Apparenlty quite a lot.

Let's start by looking how many models we have.

In [None]:
for model in structure_pdb:
    print("Model ",model)

We have one model.

Next we access this model and check how many chains this model has.

In [None]:
model = structure_pdb[0]
for chain in model:
    print("chain object: ", chain," chain id: ", chain.id)

There are three chains in the Model, with ids `A`, `L` and `H`.

The model is called the parent_entity of the chains `A`, `L` and `H` and those chains are called the child_entities of the model. In the same way the residues are parent_entities of atoms and atoms have no child_entities.

The generals syntax to access the child_entity is: 
        child_entity = parent_entity[child_id].
        
Print all residues and their names (residue.resname) in chain L.

We also can directly get the entire list of child_entities.

In [None]:
chain_L = model["L"]
L_list = chain_L.get_list()
print(L_list)

We can also get the parent_entity from a child.

In [None]:
model_from_child = chain_L.get_parent()
print(model_from_child)
print(model)

It is the same model from which we originally obtained the chain.

Let's further look at the residues.

In [None]:
for residue in L_list:
    print(residue.id)

The residue_id has three elements:

- The first is the `hetero-field` (hetfield). It is blank for standard amino acids (or nucleid acids), 'W' for water molecules and 'H_' followed by the residue name for hetero residues.

- The second is the `sequence identifier` (resseq), which is an integer that describes the position in the chain.

- The third is the `insertion code` (icode). This string is mostly empty but can be useful in insertion mutants to keep the numbering scheme. (e.g.  wild type: ..., (' ', 35, ' '), (' ', 36, ' '), ... ; mutant: ..., (' ', 35, 'A'),(' ', 35, 'B'),(' ', 36, ' '), ...)

For blank `hetero-field` and `insertion code` the `sequence identifier` can be used to access residues.

In [None]:
residue_33 = chain_L[33]
print(residue_33)
residue_33 = chain_L[(' ',33,' ')]
print(residue_33)

What are the following functions doing?

In [None]:
print(model.has_id("A"))
print(model.has_id("B"))
print(chain_L.has_id(33))

In [None]:
print(len(model))
print(len(residue_33))
print(len(chain_L))

In [None]:
print(chain_L.get_full_id())
print(residue_33.get_full_id())
print(residue_33["N"].get_full_id())

The first entry is the structure id `4XP1` you gave as a name when loading the structure.

Print all atoms in residue_33. What are the atom identifiers?

In [None]:
for atom in residue_33:
    print(atom, " ", atom.id)
print(residue_33["N"])

Knowing all the ids we can also directly access an atom.

In [None]:
atom = structure_pdb[0]["L"][33]["CA"]
print(atom.get_full_id())

Some more atom methods are get_name(), get_id(), get_coord(), get_vector(), get_bfactor() and get_occupancy().
Try them! What is the difference between get_coord() and get_vector()?

## Step3: Using the data

We want to find out where the dopamine binds to the protein.

First we find the dopamine (`LDP`) residue



In [None]:
for residue in structure_pdb[0].get_residues():
    if residue.resname == "LDP":
        LDP = residue
        break
print(LDP)

Next we want to find all other residues with $\alpha$-carbon within a certain distance.

We can do this via the coordinates.

In [None]:
res_56_CA = structure_pdb[0]['A'][56]['CA']
print(res_56_CA.coord)

Write a function that returns the distance of two atoms (e.g. res_56_CA and res_58_CA).

In [None]:
res_58_CA = structure_pdb[0]['A'][58]['CA']

For atom objects the minus operator is overloaded to return the distance. Check whether your function gives the same results.

In [None]:
print(res_56_CA-res_58_CA)

Now we only have to apply this for all residues.

Chose an appropriate cutoff (lengths are given in $\mathring{A}$)

In [None]:
cutoff = 10

binding_residues_pdb = []

for residue in structure_pdb[0].get_residues():
    #skip the LDP residue
    if residue == LDP:
        continue
    #skip hetero residues
    elif residue.id[0].startswith("H"):
        continue
    #skip water residues
    elif residue.id[0].startswith("W"):
        continue
    else:
        alpha_carbon = residue['CA']
        distances = []
        #make a list of all distances between the alpha carbon and the atoms in LDP
        for atom in LDP:
            distances.append(alpha_carbon - atom)
        #check whether the smalles distance is smaller than the cutoff
        if min(distances) < cutoff:
            binding_residues_pdb.append(residue)
            
print(binding_residues_pdb)

Let's view this.

In [None]:
#view = nv.demo()
view = nv.show_biopython(structure_pdb)

# use hex values for now.
residues = structure_pdb[0].get_residues()
#this is a bit of a hack to set the binding residues to red in the visualization
colors = ['0x0000FF' if r not in binding_residues_pdb else '0xFF0000' for r in residues]
view._set_color_by_residue(colors, component_index=0, repr_index=0)
view

### Detour: The direct way

For this protein there is also a direct way to access this information as it is already provided in the mmCIF file header.

We parse this kind of information into a dictionary.

In [None]:
cif_dict = MMCIF2Dict.MMCIF2Dict('4xp1.cif')
print(cif_dict.keys())

There is a lot of information, but fortunately there is a [documentation](http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/) that explains all the keys.

Use the key "_citation.title" to learn the publication title.

In [None]:
print(cif_dict["_citation.title"])

We are interested in the binding sites.

In [None]:
print(cif_dict["_struct_site.details"][6],"has id", cif_dict["_struct_site.id"][6])

The binding site for the dopamin (LDP) has the id `AC7`

Can you find this information directly in the cif file?


It should look like this.

```
loop_
_struct_site.id
_struct_site.pdbx_evidence_code
_struct_site.pdbx_auth_asym_id
_struct_site.pdbx_auth_comp_id
_struct_site.pdbx_auth_seq_id
_struct_site.pdbx_auth_ins_code
_struct_site.pdbx_num_residues
_struct_site.details
AC1 Software A NA  701 ? 5 'binding site for residue NA A 701'
AC2 Software A NA  702 ? 5 'binding site for residue NA A 702'
AC3 Software A CL  703 ? 4 'binding site for residue CL A 703'
AC4 Software A MAL 704 ? 4 'binding site for residue MAL A 704'
AC5 Software A MAL 705 ? 4 'binding site for residue MAL A 705'
AC6 Software A P4G 707 ? 1 'binding site for residue P4G A 707'
AC7 Software A LDP 708 ? 9 'binding site for residue LDP A 708'
AC8 Software A EDO 709 ? 2 'binding site for residue EDO A 709'
AC9 Software A Y01 710 ? 4 'binding site for residue Y01 A 710'
AD1 Software A CLR 711 ? 5 'binding site for residue CLR A 711'
AD2 Software L NA  301 ? 4 'binding site for residue NA L 301'
AD3 Software A NAG 706 ? 1 'binding site for Mono-Saccharide NAG A 706 bound to ASN A 141'
```

Using those dictionary entries to get the residues of the binding site.

In [None]:
site_id = cif_dict['_struct_site_gen.site_id']
site_chain = cif_dict['_struct_site_gen.auth_asym_id']
site_resnum = cif_dict['_struct_site_gen.auth_seq_id']
site_resname = cif_dict['_struct_site_gen.label_comp_id']


cif_binding_residues = []
for bind_id, chain, res_num, name in zip(site_id, site_chain, site_resnum, site_resname):
    if bind_id == "AC7":
        print(bind_id, chain, res_num, name)
        try:
            cif_binding_residues.append(structure_cif[0][chain][int(res_num)])
        except:
            continue
print([x.id for x in cif_binding_residues])

Why do we get an error when not using error handling (try/except)?

Does the binding site differ from the one you calculated earlier?

## Step4: More useful tools

### Vectors

Atomic coordinates also have a vector representation via atom.get_vector().
This can be used to calculate distances.

In [None]:
diff = res_56_CA.get_vector() - res_58_CA.get_vector()
print("Distance from vectors: ", np.sqrt(diff * diff))
print("Distance from overloaded minus: ", res_56_CA-res_58_CA)

The vector module is also useful to calculate angles and dihedrals.

In [None]:
res_100_CA = structure_pdb[0]['A'][100]['CA']
res_150_CA = structure_pdb[0]['A'][150]['CA']
vector1 = res_56_CA.get_vector()
vector2 = res_100_CA.get_vector()
vector3 = res_150_CA.get_vector()
angle = calc_angle(vector1, vector2, vector3)
print("The calculated angle is: ",angle)
vector4 = res_58_CA.get_vector()
dihedral = calc_dihedral(vector1,vector2,vector3,vector4)
print("The calculated dihedral is: ", dihedral)

We already used the dot product (`*`) and there are also other operations implemented, like the cross product (`**`), matrix multiplication, the norm, and some to calculate roation matrices. 

In [None]:
print(vector1**vector2)
print(vector1.norm())

We can use this to obtain a position estimate of virtual $\beta$-carbon to some Glycin residue. 

In [None]:
#get some Glycine
for residue in structure_pdb[0].get_residues():
    if residue.resname== "GLY":
        gly =residue
        break
##get vectors of the coordinates for N,Ca and CA
n = gly["N"].get_vector()
c = gly["C"].get_vector()
ca = gly["CA"].get_vector()
##calculate a matrix that rotates the N atom 
n = n - ca #center at origin
c = c - ca #center at origin
rot = rotaxis(-np.pi * 120/180,c) #the second argument is the axis
##apply rotation
cb_origin = n.left_multiply(rot)
cb = cb_origin + ca

### Dowloading directly from the PDB

In [None]:
pdblist = PDBList()
pdblist.retrieve_pdb_file('4XP1')

### Skipping hierarchy

We can directly iterate over all atoms/residues in a structure/model/chain via get_atoms() and get_residues(). 

In [None]:
residues = structure_pdb.get_residues()
for residue in residues:
    print(residue)

In [None]:
atoms = chain_L.get_atoms()
for atom in atoms:
    print(atom)

The Selection.unfold_entities function works similar to get lists of atoms/residues...

In [None]:
atom_list = Selection.unfold_entities(chain_L,"A")
print(atom_list)

Here "A" stands for atom, but "R" (residue), "C" (chain), "M" (model) and "S" (structure) are also possible.

This function also works up in hierarchy, which is useful to get a list of unique parent entities.


In [None]:
chain_list = Selection.unfold_entities(res_list,"C")
print(chain_list)

### Sequence

We can also get the sequences of the chains. Therefore we first get polypeptide objects with the PPBuilder and then their sequences.

In [None]:
from Bio.PDB.Polypeptide import *
ppb=PPBuilder()
for polypeptide in ppb.build_peptides(structure_pdb):
    print(polypeptide)
    print(polypeptide.get_sequence())

### Superimposing

A Superimposer object allows us to superimpose two lists of atoms by minimizing their `RMSD`. The who lists need to have the same number of atoms, then the Superimposer can calculate and apply appropriate rotation an translation matrices:

 - Create a Superimposer object. sup = Superimposer()
 - Set the atoms that are fixed and those that are to be moved. sup.set_atoms(fixed,moving) ( fixed and moving are lists of atoms)
 - apply the rotation/translation. sup.apply(moving)
 - Acces the matrix (sup.rotran) and the RMSD (sup.rms)


We will use this in the exercises.

### Writing a part of a structure

We have already seen, that by default the `PDBIO` class writes the whole structure. We can change this behaviour:

In [None]:
class ChainLSelect(Select):
    def accept_chain(self, chain):
        if chain.get_id() =="L":
            return True
        else:
            return False
io = PDBIO()
io.set_structure(structure_pdb)
io.save("chain_L.pdb",ChainLSelect())

Here we write only chain L. We can change accept_model(model), accept_residue(residue) and accept_atom(atom) in the same way. Return `True` when output is desired and `False` otherwise.