In [25]:
from Bio import PDB

# Structural Similarity Demo
____
In the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb), we explored the differences in the primary sequence (amino acid sequence) of TMPRSS2 between human and other animal homologs. We identified some amino acid properties (amino acid volume, hydrophobicity, charge/acidity) and constructed an algorithm for determining if a mutation at a single site has a large, moderate, or minimal effect, based on changes in these properties.

It is natural to ask the question: **why do these amino acid properties matter, and how does changing these properties affect the structure and function of the protein?** This is the question we will be answering in this structural similarity demo. We will split this question into 2-3 separate parts:
1. Qualitatively explore how amino acid properties affect 2º and 3º structure of the protein, and how they might affect function.
2. Develop a semi-quantitative model that explains how single site amino acid mutations could affect the overall function of the TMPRSS2 protein, compared to the human homolog. We will do this by extending the sequence iterator algorithm that we developed in the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb).
3. **Optional**: compare our semi-quantitative model to other variant -> function models in the literature, such as [SIFT](http://www.sbg.bio.ic.ac.uk/~phyre2/html/page.cgi?id=index) and [PhyreRisk](http://phyrerisk.bc.ic.ac.uk/search?action=fresh-search&searchTerm=O15393).

# Part #1
____
## Intro to Bio.PDB and NGL
In order to view protein structures inside of Jupyter notebooks, we will use a molecular structure viewer called NGL. It has a handy `nglview` extension that supports Jupyter notebooks ([GitHub](https://github.com/arose/nglview)).

In [2]:
import nglview

Using `nglview` is simple: we load a protein structure as a PDB file, either from the [RCSB](http://rcsb.org):

In [31]:
# load "3pqr" from RCSB PDB and display viewer
rcsb_view = nglview.show_pdbid("3pqr") 
rcsb_view

NGLWidget()

### NGL viewer key bindings
* **Rotate molecule**: left click and drag
* **Zoom**: scroll up/down
* **Translate molecule**: right click and drag

We can also load protein structures that are stored locally as `*.pdb` files. In this example, we use a PDB file that was generated using the [Swiss-Model homology model generator](https://swissmodel.expasy.org/repository/uniprot/O15393). It is the **estimated** structure of human TMPRSS2 isoform 1:

In [8]:
!ls ../../data/structures/

O15393_swiss_model.pdb


In [57]:
tmprss2_view = nglview.show_file(
    '../../data/structures/O15393_swiss_model.pdb')
tmprss2_view

NGLWidget()

Another cool feature of `nglview` is that it has Biopython support: [Bio.PDB](https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ) structures can be loaded. Let's try loading our Swiss-Prot structure this way:

In [34]:
# instantiate a BioPython PDBParser
parser = PDB.PDBParser()

# use the PDBParser to load our PDB file
# into a Bio.PDB.Structure.Structure instance
bio_pdb_structure = parser.get_structure(
    "protein", '../../data/structures/O15393_swiss_model.pdb')

# load the Bio.PDB.Structure.Structure into nglview
bio_pdb_view = nglview.show_biopython(bio_pdb_structure)
bio_pdb_view

NGLWidget()

## Looking at the catalytic triad
In the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb), we compiled a list of residues that the literature implicates as important in binding and cleaving the S protein. These residues include the catalytic triad (H296, D345 and S441) and important binding residues (D435, K223, and K224).

The numberings of these amino acids were for isoform 1, which is shorter than isoform 2. In the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb), we were looking at the sequence of isoform 2, so we had to adjust these positions by 37 amino acids. Here, we are looking a the sequence of isoform 1, so we do not need to make this adjustment. `nglview` also uses the standard 1-indexing for residue numbering, so we do not need to subtract 1 to get the index in a 0-indexed list, like we did for the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb).

Let's start by converting the list of important residues to a Python list:

In [11]:
resi_interest = [296, 345 , 441, 435, 223, 224]

We will first look at the H296 residue in the catalytic triad. We can zoom in and show this residue as [sticks representation](https://github.com/arose/nglview#representations) in `nglview`:

In [32]:
# add licorice (AKA sticks) representation, only for residue 296
tmprss2_view.add_representation('licorice', selection='296')

# center the view on residue 296
tmprss2_view.center(selection='296')

# show the protein viewer
tmprss2_view

NGLWidget(n_components=1, picked={'bond': {'atomIndex1': 1169, 'atomIndex2': 1171, 'bondOrder': 1}, 'atom1': {…

We can easily change the `selection` parameter so that we zoom in and show sticks for all three residues in the catalytic triad:

In [63]:
# define the catalytic residue selection the integer
# residue numbers need to be converted to strings
resi_interest_as_str = [str(position) for position in resi_interest]
cata_triad_sele = ", ".join(resi_interest_as_str[:3])
cata_triad_sele

'296, 345, 441'

In [64]:
# add licorice (AKA sticks) representation, only for the catalytic triad
tmprss2_view.add_representation('licorice', selection=resi_interest_selection)

# center the view on the catalytic triad
tmprss2_view.center(selection=resi_interest_selection)

# show the protein viewer
tmprss2_view

NGLWidget(n_components=1)