# Final Project

Elizabeth Sun, Bioinformatics, Professor Jamie Schiffer, 12/7/2021

### Scientific Question: To what extent are the RNA-dependent RNA polymerase domains from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and Ebola virus structurally similar or different? 

RNA-dependent RNA polymerases (RdRp) are a class of versatile polymerases found in RNA viruses that are responsible for replicating the viral genome as well as for carrying out transcription. Despite the divergence in RdRp sequences, most RdRps have the same core structural features or, in other words, conserved features (Venkataraman et al., 2018).

Remdesivir (GS-5734) is a nucleotide analog prodrug has been clinically proven to be efficacious against both Ebola virus disease and COVID-19, otherwise known as SARS-CoV-2. A 2016 study using rhesus monkeys as a model organism found that once-daily intravenous administration of Remdesivir protected 100% of animals infected with Ebola virus protein (EBOV) against Ebola virus disease, ameliorating clinical disease signs and pathophysiological markers (Warren et al., 2016). A recent clinical study done with human patients infected with SARS-CoV-2 found that patients who received Remdesivir had a lower median recovery time, higher likelihood of clinical improvement, and lower estimates of mortality as opposed to those who received the placebo (Beigel et al., 2020). 

The triphosphate form of Remdesivir (RDV-TP) has been biochemically shown to act as an inhibitor of the RdRps of many viruses, including the ones mentioned above. Upon inhibition binding, it inhibits RdRp activity as a nonobligate delayed chain terminator at position i+3 in the chain. RDV-TP is successful in this task because of its uniquely high selectivity over the incroporation of its natural nucleotide counterpart ATP (Gordon et al., 2020). 

For this project, the NCBI Protein database was used to collect the protein sequences of EBOV RdRp and Sars-CoV-2 RdRp. The NCBI Protein database can be accessed through the following url: https://www.ncbi.nlm.nih.gov/protein. Based on the website, it is "a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq, and TPA, as well as records from SwissProt, PIR, PRF, and PDB."

In addition, Protein Data Bank was used for a variety of tasks. First, it was used to find the electron microscopy-determined structure of SARS-CoV-2 RdRP. Secondly, it was employed in the BLAST code to search for template structures to use for EBOV RdRP homology structure modeling. The Protein Data Bank can be accessed through the following url: https://www.rcsb.org/. Based on the website, it is an archive of "the 3D shapes of proteins, nucleic acids, and complex assemblies."

SWISS-MODEL was another outside resource that was used to search for template structures and to build several homology models of EBOV RdRp based on various templates. SWISS-MODEL can be accessed through the following url: https://swissmodel.expasy.org/. Based on the website, it is a "fully automated protein structure homology-modelling server" that exists to "make protein modelling accessible to all life science researchers." 

### Hypothesis: If the nucleotide analog prodrug Remdesivir can successfully inhibit the spread of both SARS-CoV-2 and Ebola virus in a host organism and if it is known that Remdesivir works by targeting viral RNA-dependent RNA polymerases, then the RNA-dependent RNA polymerase domains of SARS-CoV-2 and Ebola virus must be similar in structure, particularly at the drug's binding site. 

Deriving and understanding the structures of the SARS-CoV-2 RdRp and EBOV RdRp is first necessary to completing a thorough evaluation of the similarities and differences between the two protein structures. The electron-microscopy-derived structure of SARS-CoV-2 RdRp was found through a search of the Protein Data Bank. This structure was saved to this Jupyter Notebook as a PDB file and displayed using NGLviewer for qualitative analysis. 

A search through Protein Data Bank was unable to identify an experimentally-determined structure of EBOV RdRp. As such, homology structure modeling of EBOV RdRp was done. First, a BLAST search was employed in Jupyter Notebook to search for template structures.  Several models of EBOV RdRp were then constructed on SWISS-MODEL using a variety of selected template structures from the BLAST search.* The better models were chosen, saved to this Jupyter Notebook as PDB files, and displayed using NGLviewer for qualitative analysis and comparison against the SARS-CoV-2 RdRP structure.

*Homology structure modeling was done using SWISS-MODEL, an outside resource, as opposed to directly in this Jupyter Notebook because of the incompatibility of many modeling libraries with Jupyter Notebook. For example, the Protein class of ssbio, a Python package for structural systems biology, was unable to be fully imported into Jupyter Notebook because of its reliance on the IUPAC module from Bio.Alphabet, which has been removed from Biopython and can no longer be imported even on its own. 

### Part 1: Load the Packages

Packages loaded include the following:

• Biopython: This is a freely availably open source tool that took rise from an international collaboration of volunteer developers. Biopython provides Python libraries for a wide range of bioinformatic problems and analyses. Biopython includes a whole host of modules for performing  variety of tasks relating to computational biology. Some of these tasks include reading and writing different sequence file formats, performing multiple sequence alignments, and dealing with 3D macro molecular structures. Biopython also has the capability of interacting with common tools such as BLAST, ClustalW, and EMBOSS. Additionally, it can access key online databases and provide numerical methods for statistical learning. For more information on Biopython, visit the Biopython tutorial and cookbook page found here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec3  

• Pytraj: Pytraj is a front-end cpptraj package aimed to provide tools for interactive data anslysis for molecular dynamics simulations. In other words, Pytraj is an extension of the cpptraj program intended to expand the flexibiility of cpptraj, exposing cpptraj's functionality to the rest of the Python ecosystem, including numpy, pandas, and matplotlib. Pytraj can be used for basic tasks, including loading files to memory and computing RMSD. The main advantage of using Pytraj is that it can be used in conjunction with several other Python packages as outlined above. Additionally, it supports more than 80 types of data analyses, various file formats, and interactive analysis with large trajectory data that does not fit to memory, in addition to being very high-speed. For more information on Pytraj, visit the developer's page on Github found here: https://github.com/Amber-MD/pytraj

• Nglviewer: This is a Jupyter/IPython widget for interactively viewing molecular structures in addition to trajectories from molecular dynamics simulations. NGLviewer offers fast and scalable molecular graphics. It supports showing data from the file-system, online databases, and from objects of many popular analysis libraries including mdtraj, pytraj, mdanalysis, rdkit, and more. Unlike many standalone visualization packages like VMD, Chimera, and PyMol, NGLviewer integrates computation and visualization directly in Jupyter Notebook. It is also unique in that it offers the integration of many third-part libraries for computations on molecular data. For more information on NGLviewer, visit the NGLviewer manual: http://nglviewer.org/ngl/api/manual/

In [1]:
pip install Bio

Note: you may need to restart the kernel to use updated packages.


In [2]:
#This is necessary for performing the BLAST search 
from Bio.Blast import NCBIWWW

In [3]:
pip install pytraj

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install nglview

Note: you may need to restart the kernel to use updated packages.


In [5]:
conda install nglview -c conda-forge

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [6]:
conda upgrade nglview --force



         See 'conda update --help' for details about the --force-reinstall
         and --clobber flags.


Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/elizabethsun/opt/anaconda3

  added / updated specs:
    - nglview



Preparing transaction: done
Verifying transaction: failed

CondaVerificationError: The package for nglview located at /Users/elizabethsun/opt/anaconda3/pkgs/nglview-3.0.3-pyh8a188c0_0
appears to be corrupted. The path 'etc/jupyter/nbconfig/notebook.d/nglview-js-widgets.json'
specified in the package manifest cannot be found.

CondaVerificationError: The package for nglview located at /Users/elizabethsun/opt/anaconda3/pkgs/nglview-3.0.3-pyh8a188c0_0
appears to be corrupted. The path 'share/jupyter/labextensions/nglview-js-widgets/package.json'
specified in the package manifest cannot be found.

CondaVerificationError: The package for nglview located at /Users/elizabethsun/opt/a

In [17]:
import ipywidgets

In [18]:
import pytraj as pt
import nglview as nv

ModuleNotFoundError: No module named 'nglview.widget'

Explanation for above ModuleNotFoundError: Although previously working the week before, the import of nglview no longer works because Jupyter Notebook cannot find a module named nglview.widget despite several attempts at installing nglview in various ways shown above (i.e. pip install, conda install, conda upgrade, import ipywidgets, and other methods not shown in thie final jupyter notebook). This errors may have been introduced through attempts at correcting another widget-display error that has been persisting.

### Part 2A: Load in the sequence data and perform a BLAST Search for template structures (first bioinformatics method)

Homology structure modeling refers to the construction of an model protein with a known sequence an unknown structure that is related to at least one other protein with both a known sequence and known structure. The quality of a homology structure model depends on the degree of similarity between the target sequence and the template sequences. Because of this, a BLAST search needs to be done to find sequences with the highest similarities to the target sequence. BLAST stands for Basic Local Alignment Search Tool. The tool is designed to compare nucleotide or protein sequences to sequence databases in order to find regions of similarity between biological sequences and ultimately output a list of sequences ordered from highest to lowest similarity to a target sequence.

A FASTA file is a text file used for representing either nucleotide sequences or amino acid (protein) sequences. In a FASTA file, each sequence begins with a single-line description, containing a > symbol in the first column and the sequence name. This is followed by lines of sequence data. 

In the following code, protein sequence data from a FASTA file is read in. This data is displayed in an output. A BLAST search is conducted with the qblast method from NCBIWWW (imported above) and the output is stored in result_handle. The result_handle is then read and the results are displayed in a file titled "EBOVresults.xml," saved into the Jupyter Notebook. 

In [7]:
#Read in the FASTA file containing EBOV RdRp protein sequence data taken from the NCBI Protein Database
#Define the global variable sequence_data and display contents in output
sequence_data = open("EBOVRdRp.fasta.rtf").read()
sequence_data

"{\\rtf1\\ansi\\ansicpg1252\\cocoartf2513\n\\cocoatextscaling0\\cocoaplatform0{\\fonttbl\\f0\\fmodern\\fcharset0 Courier;}\n{\\colortbl;\\red255\\green255\\blue255;\\red0\\green0\\blue0;\\red255\\green255\\blue255;}\n{\\*\\expandedcolortbl;;\\cssrgb\\c0\\c0\\c0;\\cssrgb\\c100000\\c100000\\c100000;}\n\\margl1440\\margr1440\\vieww10800\\viewh8400\\viewkind0\n\\deftab720\n\\pard\\pardeftab720\\partightenfactor0\n\n\\f0\\fs26 \\cf2 \\cb3 \\expnd0\\expndtw0\\kerning0\n\\outl0\\strokewidth0 \\strokec2 >sp|Q05318.2|L_EBOZM RecName: Full=RNA-directed RNA polymerase L; Short=Protein L; AltName: Full=Large structural protein; AltName: Full=Replicase; AltName: Full=Transcriptase; Includes: RecName: Full=RNA-directed RNA polymerase; Includes: RecName: Full=GTP phosphohydrolase; Includes: RecName: Full=GDP polyribonucleotidyltransferase; AltName: Full=PRNTase; Includes: RecName: Full=mRNA cap methyltransferase; AltName: Full=mRNA (guanine-N(7)-)-methyltransferase; Short=G-N7-MTase; AltName: Full=mR

In [8]:
#Perform Protein BLAST search on sequence_data using Protein Data Bank as the database that is sourced
#Define global variable result_handle to store results of BLAST
#Display result_handle in output
result_handle = NCBIWWW.qblast("blastp", "pdb", sequence_data)
result_handle

<_io.StringIO at 0x7fae3c3b2790>

In [9]:
#Read in result_handle and store in newly defined global variable blast_results
#Store contents of blast_results in file titled "EBOVresults.xml," saved into your Jupyter Notebook
with open('EBOVresults.xml', 'w') as save_file: 
    blast_results = result_handle.read()
    save_file.write(blast_results)

### Part 2B: Create homology structure models based on the top hits from BLAST search (second bioinformatics method)

The top two sequence results from the above BLAST were used as template structures for the first two homology structure models constructed. The top two sequences were the RNA methyltransferase of Sudan Ebola virus and the RNA-directed RNA polymerase L from Parainfluenza virus 5 L-P complex. 

A PDB file consists of several lines of information about a protein in a text file. Each line of information contains different types of records, arranged in a specific order to describe a structure in words. The PDB files of the X-ray-diffraction-derived structure of the former protein and the electron-microscopy-derived structure of the latter protein were downloaded. These two PDB files in addition to the EBOV RdRp FASTA file were uploaded to SWISS-MODEL to build the two models. These two models are displayed below in Part 3 of the notebook. 

### Part 2C: Create homology structure models based on the lower hits from BLAST search (second bioinformatics method)

Upon building the first two models, the SWISS-MODEL Homology Reports for the first two models were read and evaluated. The Global Model Quality Estimation (GMQE) values in particular were evaluated. GMQE is expressed as a number between 0 and 1 and predicts the expected accuracy of a model built with that alignment and that template as indicated by the coverage of the target sequence. Higher GMQEs indicate higher reliability. To learn more about GMQE and other numerical values listed in a typical SWISS-MODEL Homology Report, visit this tutorial: https://www.bonvinlab.org/education/molmod_online/modelling/ 

The GMQEs of the two models built on th RNA methyltransferase of Sudan Ebola virus and the RNA-directed RNA polymerase L from Parainfluenza virus 5 L-P complex were 0.08 and 0.33, respectively. 

Further homology structure modeling was done in an attempt to obtain higher GMQE values. A number of models were constructed based on lower hits from the BLAST search, but two models in particular were chosen for this notebook due to high GMQE. The first was the model based on the cryo-EM-determined Rabies SAD-B19 L-P complex. This structure was the 6th hit on BLAST and the model had a GMQE of 0.36. The second was the model based on the cryo-EM-determined Vesicular Stomatitis Virus L protein. This structure was the 7th hit on BLAST and the model had a GMQE of 0.36. The same protocol, as outlined above, was used to make these models on SWISS-MODEL. These two additional models are displayed bwlo in Part 3 of the notebook. 

### Part 3: Display all of the protein structures using NGLviewer (plotting method)

The PDB file of the electron-microscopy-derived structure of SARS-CoV-2 was downloaded from Protein Data Bank. In addition, the PDB files of all four homology models described above were downloaded from SWISS-MODEL. These files were then uploaded into Jupyter Notebook. The following code reads in each of these PDB files and displays them in an interactive widget. Snapshots of each of these models were taken from similar perspectives, one of which shows the active site, and then displayed in outputs. 

#### SARS-CoV-2 Structure from PDB

In [19]:
#Define a global trajectory variable traj that loads the pdb file that holds the model
#Define a global variable view that shows traj in a widget
traj = pt.load('6m71.pdb')
view = nv.show_pytraj(traj)
view

NameError: name 'nv' is not defined

In [20]:
#render_image() outputs the above structure from the exact perspective selected using interactive widget
view.render_image()

NameError: name 'view' is not defined

In [21]:
#_display_image() prints the rendered image
view._display_image()

NameError: name 'view' is not defined

#### EBOV RdRP Model based on RNA methyltransferase of Sudan Ebola Virus

In [None]:
#accidental blank

In [10]:
#Define a global trajectory variable traj that loads the pdb file that holds the model
#Define a global variable view that shows traj in a widget
traj = pt.load('modelmethyltransferase.pdb')
view = nv.show_pytraj(traj)
view

NameError: name 'nv' is not defined

#### EBOV RdRP Model based on RNA-directed RNA polymerase L from Parainfluenza virus 5 L-P complex

In [8]:
#Define a global trajectory variable traj that loads the pdb file that holds the model
#Define a global variable view that shows traj in a widget
traj = pt.load('modelparainfluenza.pdb')
view = nv.show_pytraj(traj)
view

NameError: name 'nv' is not defined

In [11]:
#render_image() outputs the above structure from the exact perspective selected using interactive widget
view.render_image()

NameError: name 'view' is not defined

In [12]:
#_display_image() prints the rendered image
view._display_image()

NameError: name 'view' is not defined

#### EBOV RdRP Model based on Rabies SAD-B19 L-P complex 

In [9]:
#Define a global trajectory variable traj that loads the pdb file that holds the model
#Define a global variable view that shows traj in a widget
traj = pt.load('modelrabies.pdb')
view = nv.show_pytraj(traj)
view

NameError: name 'nv' is not defined

In [13]:
#render_image() outputs the above structure from the exact perspective selected using interactive widget
view.render_image()

NameError: name 'view' is not defined

In [14]:
#_display_image() prints the rendered image
view._display_image()

NameError: name 'view' is not defined

#### EBOV RdRP Model based on Vesicular Stomatitis Virus L Protein 

In [10]:
#Define a global trajectory variable traj that loads the pdb file that holds the model
#Define a global variable view that shows traj in a widget
traj = pt.load('modelvesicular.pdb')
view = nv.show_pytraj(traj)
view

NameError: name 'nv' is not defined

In [15]:
#render_image() outputs the above structure from the exact perspective selected using interactive widget
view.render_image()

NameError: name 'view' is not defined

In [16]:
#_display_image() prints the rendered image
view._display_image()

NameError: name 'view' is not defined

Explanation for NameErrors: nv is not defined because nglviewer could not be properly imported as nv. See the above error explanation under "Load Packages" for more information as to why. Because nv is unrecognized, view cannot be properly defined either. 

Additional errors once nglviewer is imported: Although previously working the week before, nglviewer no longer displays interactive widgets. This is likely because of the entrance of a bug into the program. Despite various efforts to ameliorate this over the course of the last week, it has been unsuccessful. 

### Part 4: Analysis of the results

The above protein structures were successfully displayed in a Jupyter notebook from the previous week before a series of unbeknownst and unsolvable errors took rise. The analysis conducted below will be based entirely on memory of these previous structures since most recent code could not properly display these structures. 

The active sites of SARS-CoV-2 RdRp and EBOV RdRp were determined through rotation of the interactive widgets that these structures were displayed in. Determination of the EBOV RdRp active site was most accurately done with the latter three structures that beared relatively higher GMQE values. That is to say, the first structure constructed based on the RNA methyltransferase of Sudan Ebola Virus was discarded from analysis based on its very low GMQE value of 0.08. This low GMQE value likely stems from the fact that the template used for this model was the only non-RdRp template in the sample size above. In summary, for the reason that this template was non-RdRp as well as low in GMQE, this was discarded. 

Qualitative and visual analysis of the active sites of SARS-CoV-2 RdRp and the three EBOV RdRp models chosen showed that the active site pocket of the SARS-CoV-2 RdRp was significantly larger than that of the predicted EBOV RdRp. This large active site may suggest that SARS-CoV-2 may be even more conducive to drug binding not only by Remdesivir, but also by a host of existing and investigational drugs (Vatansever et al., 2021). Besides this significant difference, the two structures look fairly similar otherwise. Both the SARS-CoV-2 RdRp and predicted EBOV RdRp have the same conserved domains that most other viral RdRps have outside of the active site. These include the finger domain, thumb domain, and palm domain which assist in both binding of RdRp to nucleotides as well as binding of RdRp to the nucleotide analog Remdesivir. 