In [None]:
from Bio import PDB

# Structural Similarity Demo
____
In the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb), we explored the differences in the primary sequence (amino acid sequence) of TMPRSS2 between human and other animal homologs. We identified some amino acid properties (amino acid volume, hydrophobicity, charge/acidity) and constructed an algorithm for determining if a mutation at a single site has a large, moderate, or minimal effect, based on changes in these properties.

It is natural to ask the question: **why do these amino acid properties matter, and how does changing these properties affect the structure and function of the protein?** This is the question we will be answering in this structural similarity demo. We will split this question into 2-3 separate parts:
1. Qualitatively explore how amino acid properties affect 2º and 3º structure of the protein, and how they might affect function.
2. Develop a semi-quantitative model that explains how single site amino acid mutations could affect the overall function of the TMPRSS2 protein, compared to the human homolog. We will do this by extending the sequence iterator algorithm that we developed in the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb).
3. **Optional**: compare our semi-quantitative model to other variant -> function models in the literature, such as [SIFT](http://www.sbg.bio.ic.ac.uk/~phyre2/html/page.cgi?id=index) and [PhyreRisk](http://phyrerisk.bc.ic.ac.uk/search?action=fresh-search&searchTerm=O15393).

# Part #1
____
## Intro to Bio.PDB and NGL
In order to view protein structures inside of Jupyter notebooks, we will use a molecular structure viewer called NGL. It has a handy `nglview` extension that supports Jupyter notebooks ([GitHub](https://github.com/arose/nglview)).

In [None]:
import nglview

Using `nglview` is simple: we load a protein structure as a PDB file, either from the [RCSB](http://rcsb.org):

In [None]:
# load "3pqr" from RCSB PDB and display viewer
rcsb_view = nglview.show_pdbid("3pqr") 
rcsb_view

### NGL viewer key bindings
* **Rotate molecule**: left click and drag
* **Zoom**: scroll up/down
* **Translate molecule**: right click and drag

We can also load protein structures that are stored locally as `*.pdb` files. In this example, we use a PDB file that was generated using the [Swiss-Model homology model generator](https://swissmodel.expasy.org/repository/uniprot/O15393). It is the **estimated** structure of human TMPRSS2 isoform 1:

In [None]:
!ls ../../data/structures/

In [None]:
tmprss2_view = nglview.show_file(
    '../../data/structures/O15393_swiss_model.pdb')
tmprss2_view

Another cool feature of `nglview` is that it has Biopython support: [Bio.PDB](https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ) structures can be loaded. Let's try loading our Swiss-Prot structure this way:

In [None]:
# instantiate a BioPython PDBParser
parser = PDB.PDBParser()

# use the PDBParser to load our PDB file
# into a Bio.PDB.Structure.Structure instance
bio_pdb_structure = parser.get_structure(
    "protein", '../../data/structures/O15393_swiss_model.pdb')

# load the Bio.PDB.Structure.Structure into nglview
bio_pdb_view = nglview.show_biopython(bio_pdb_structure)
bio_pdb_view

## Looking at the catalytic triad
In the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb), we compiled a list of residues that the literature implicates as important in binding and cleaving the S protein. These residues include the catalytic triad (H296, D345 and S441) and important binding residues (D435, K223, and K224).

The numberings of these amino acids were for isoform 1, which is shorter than isoform 2. In the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb), we were looking at the sequence of isoform 2, so we had to adjust these positions by 37 amino acids. Here, we are looking a the sequence of isoform 1, so we do not need to make this adjustment. `nglview` also uses the standard 1-indexing for residue numbering, so we do not need to subtract 1 to get the index in a 0-indexed list, like we did for the [sequence similarity demo](../seq_sim_demo/20200706_seq_sim.ipynb).

Let's start by converting the list of important residues to a Python list:

In [None]:
resi_interest = [296, 345 , 441, 435, 223, 224]

We will first look at the H296 residue in the catalytic triad. We can zoom in and show this residue as [sticks representation](https://github.com/arose/nglview#representations) in `nglview`:

In [None]:
# add licorice (AKA sticks) representation, only for residue 296
tmprss2_view.add_representation('licorice', selection='296')

# center the view on residue 296
tmprss2_view.center(selection='296')

# show the protein viewer
tmprss2_view

We can easily change the `selection` parameter so that we zoom in and show sticks for all three residues in the catalytic triad:

In [None]:
# define the catalytic residue selection the integer
# residue numbers need to be converted to strings
resi_interest_as_str = [str(position) for position in resi_interest]
cata_triad_sele = ", ".join(resi_interest_as_str[:3])
cata_triad_sele

## Exercise
* Turn the above cell into a function that takes a list and returns a string

In [None]:
# add licorice (AKA sticks) representation, only for the catalytic triad
tmprss2_view.add_representation('licorice', selection=cata_triad_sele)

# center the view on the catalytic triad
tmprss2_view.center(selection=cata_triad_sele)

# show the protein viewer
tmprss2_view

# What is the effect of changing an amino acid at a single site?
As an example of how we might answer this question, let's look at a few examples of residues that are important for TMPRSS2's enzymatic activity and its stability.

To do this, I used the [mutagenesis tool in Pymol](https://pymolwiki.org/index.php/Mutagenesis) to make single-site amino acid substitutions at various positions. I then saved the mutated structure as a new PDB file at `O15393_swiss_model_mut1.pdb`.

In [None]:
# residues of interest around K225
lys_mut = [224, 220, 219, 225, 222]

In [None]:
# load the mutated structure
mut_view = nglview.show_file(
    '../../data/structures/O15393_swiss_model_mut1.pdb')

# load the wild type structure in the same view
mut_view.add_component(
    '../../data/structures/O15393_swiss_model.pdb')

# remove representations for the wild type structure
# we will add custom representations ourselves
mut_view.component_1.clear_representations()

# set the second structure to have licorice representation
# in grey coloring, only for the residues that were mutated
mut_view.component_0.add_licorice(selection='224, 220, 219, 225, 222')

# for the WT structure, show the old residues if they were mutated
mut_view.component_1.add_licorice(selection='224, 225', color='cyan')

# zoom to this area
mut_view.center(selection='224, 220, 219, 225, 222')

# show the view
mut_view

In [None]:
# show only the first (0) or second (1) structure
# mut_view.show_only(indices=[0])
# mut_view.show_only(indices=[1])

## Questions
1. How do you think the `K224L` mutation would affect stability or function of the protein?
2. What about the `H225L` mutation?

# What happens if you mutate a residue in the catalytic triad?

In [None]:
cata_triad_sele

In [None]:
cat_triad_mut_view = nglview.show_file(
    '../../data/structures/O15393_swiss_model_mut1.pdb')
cat_triad_mut_view.add_component(
    '../../data/structures/O15393_swiss_model.pdb')

cat_triad_mut_view.component_1.clear_representations()
cat_triad_mut_view.component_0.add_licorice(selection=cata_triad_sele)
cat_triad_mut_view.component_1.add_licorice(selection='441', color='cyan')
cat_triad_mut_view.center(selection=cata_triad_sele)

cat_triad_mut_view

## Questions
1. What is the effect of the `S441G` mutation on enzyme activity?

# Disrupting a disulfide

In [None]:
disulf_mut_view = nglview.show_file(
    '../../data/structures/O15393_swiss_model_mut1.pdb')
disulf_mut_view.add_component(
    '../../data/structures/O15393_swiss_model.pdb')

disulf_mut_view.component_1.clear_representations()
disulf_mut_view.component_0.add_licorice(selection='172, 231')
disulf_mut_view.component_1.add_licorice(selection='231', color='cyan')
disulf_mut_view.center(selection='172, 231')

disulf_mut_view

## Questions
1. What is the predicted effect of `C231A` mutation in respect to protein stability?

## Exercise
1. Another important residue in this structure is `D435`, which [Meng et al.](https://www.biorxiv.org/content/10.1101/2020.02.08.926006v3.full) suggest is important for binding the S protein substrate. What is the structural context of `D435`? Show the context using `nglview` or Pymol.
2. Qualitatively, would be the expected effect on TMPRSS2 stability and function if we mutated `D435` to each of the following amino acids:
    1. Glycine
    2. Tryptophan
    3. Asparagine
    
Optionally, show a detailed comparison of each mutation using Pymol, using Pymol's `mutagenesis` tool.