# Bioinformatics Platform using JupyterLab

Author: Robert Bradford

This is a thesis project using JupyterLab to facilitate the study of BIOC-4010 course. This tutorial implements the following basic modules using Biopython, NGLView。
1. BLAST
2. Display a 3D structure

Requirements:

To run this notebook successfully, it is recommended to use Miniconda + JupyterLab and install the  required packages and extensions. The notebook shall also work on [Google Colab](https://colab.research.google.com/) or [Binder](https://jupyter.org/binder) but this has not been tested.

The following packages are required and can installed using conda. It is recommended to create a new environment and install these packages. You can use the [nglview-jupyterlab.sh script](https://github.com/nglviewer/nglview/blob/master/devtools/nglview-jupyterlab.sh) to install the nglview related packages.
* python 3.8+
* jupyterlab 2.1+
* biopython 1.7+
* ipywidgets 7.5+
* nodejs 12.0.0+, required for the jupyter-labextensions
* nglview 2.7+
    * if you do not use the nglview-jupyterlab.sh script, run the following two commands manually after you install jupyterlab
    * `jupyter-labextension install @jupyter-widgets/jupyterlab-manager`
    * `jupyter-labextension install nglview-js-widgets@$nglviewversion` where `$nglviewversion` is the version of the `nglview` installed package, which can be inspected with `conda list nglview`.
    

This notebook has been test on:
1. miniconda3
2. 

In [1]:
!pip3 install jupyterlab
!pip3 install biopython
!pip3 install ipywidgets
!pip3 install nglview
!jupyter-labextension install @jupyter-widgets/jupyterlab-manager
!jupyter-labextension install nglview-js-widgets@$nglviewversion



You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
Building jupyterlab assets (production, minimized)
An error occurred.
RuntimeError: JupyterLab failed to build
See the log file for details:  /var/folders/xn/_8hvmw8d3tsfm77f_6kvmy180000gn/T/jupyterlab-debug-abc1rkxn.log
Building jupyterlab assets (production, minimized)
An error occurred.
RuntimeError: JupyterLab failed to build
See the log file for details:  /var/folders/xn/_8hvmw8d3tsfm77f_6kvmy180000gn/T/jupyterlab-debug-5zahd4wl.log


## Setup environment

In [2]:
import Bio
from Bio.Seq import Seq
import ipywidgets as widgets
from Bio.PDB import *
import os
import sys
import nglview as nv
import platform
print("Python version",sys.version_info)
print("Biopython version", Bio.__version__)



Python version sys.version_info(major=3, minor=9, micro=6, releaselevel='final', serial=0)
Biopython version 1.79


## Chapter 1 Substitution Matrices
One of the fundamental operations in bioinformatics is comparing two sequences, either nucleic acids or proteins and line them up to archieve maximal levels of identity. This is called _**pairwise sequence alignment**_. Quantification of the similarity of the two sequences is important for establishing whether they are homologs. There are two widely used methods developed to quantify the similarity.

- Method 1: align closely related homologs and counting the frequencies of amino acid substitutions.
- Method 2: use a database of aligned sequences derived from protein domains that have a particular structure or function. The frequencies of amino acid substitutions are recorded.
These two methods gave rise to the *PAM* and *BLOSUM* series of amino acid substitution matrices, respectively. These substitution matrices are used in many sequence alignment tools.



In [3]:
# this scriptlet display BLOSUM62 and PAM250 matrices
from Bio.Align import substitution_matrices as smatrices
blosum62 = smatrices.load("BLOSUM62")
pam250 = smatrices.load("PAM250")
print(blosum62)
print("-"*80)
print(pam250)

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
     A    R    N    D    C    Q    E    G    H    I    L    K    M    F    P    S    T    W    Y    V    B    Z    X    *
A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 -1.0  0.0 -2.0 -1.0 -1.0 -1.0 -1.0 -2.0 -1.0  1.0  0.0 -3.0 -2.0  0.0 -2.0 -1.0  0.0 -4.0
R -1.0  5.0  0.0 -2.0 -3.0  1.0  0.0 -2.0  0.0 -3.0 -2.0  2.0 -1.0 -3.0 -2.0 -1.0 -1.0 -3.0 -2.0 -3.0 -1.0  0.0 -1.0 -4.0
N -2.0  0.0  6.0  1.0 -3.0  0.0  0.0  0.0  1.0 -3.0 -3.0  0.0 -2.0 -3.0 -2.0  1.0  0.0 -4.0 -2.0 -3.0  3.0  0.0 -1.0 -4.0
D -2.0 -2.0  1.0  6.0 -3.0  0.0  2.0 -1.0 -1.0 -3.0 -4.0 -1.0 -3.0 -3.0 -1.0  0.0 -1.0 -4.0 -3.0 -3.0  4.0  1.0 -1.0 -4.0
C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 -4.0 -3.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0 -3.0 -1.0 -1.0 -2.0 -2.0 -1.0 -3.0 -3.0 -2.0 -4.0
Q -1.0  1.0  0.0  0.

## Chapter 2 Molecular Graphics

In this section, we use NGLView to display a structure of interest.


In [4]:
#In this block, the user will be prompted to enter the name of the pdb code that they would like to look at.
text_box = str(input("Please enter the PDB code of the structure: "))

Please enter the PDB code of the structure: 101


In [5]:
# we minic the folder structure of PDB database and save the pdf file in the corresponding folder.
first_pdb_file = PDBList()
name = first_pdb_file.retrieve_pdb_file(text_box)
protein_file = ""
last_value = -1
print(name[-2])

if (platform.system() == 'Windows'):
    while (name[last_value] != '\\'):
        protein_file += name[last_value] #Each of the letters is added to the protein_name string, starting from the last letter.
        #The value of last_value (which is supposed to immitate the index) is reduced by one, and since it's negative, the constantly decreasing value goes towards the beginning of the string.
        last_value -= 1 

else:
    while (name[last_value] != '/'):
        protein_file += name[last_value] #Each of the letters is added to the protein_name string, starting from the last letter.
        #The value of last_value (which is supposed to immitate the index) is reduced by one, and since it's negative, the constantly decreasing value goes towards the beginning of the string.
        last_value -= 1 

protein_file = protein_file[::-1] #The protein name is reversed, to get a proper pdb file format.

protein_name = ""

index = 0
while(protein_file[index] != '.'):
    protein_name += protein_file[index]
    index += 1

protein_name = protein_name.upper()
print("The name of the protein is, " + protein_file)



Downloading PDB structure '101'...
Desired structure doesn't exists
i
The name of the protein is, 101.cif


In [6]:
protein_class = str(protein_file[1]) + str(protein_file[2])

#We create an instance of the MMCIF Parser, to load the protein file.
parser = MMCIFParser()

path = os.path.join(protein_class, protein_file)
structure = parser.get_structure(protein_name, path)
path = os.path.join(protein_class, protein_file)
structure = parser.get_structure(protein_name, path)

FileNotFoundError: [Errno 2] No such file or directory: '01/101.cif'

In [None]:

def clean_protein(obj):
    print(', '.join([a for a in dir(obj) if not a.startswith('_')]))
    
clean_protein(structure)

In [None]:
view_one = nv.show_biopython(structure)
view_one

In [None]:
view_one.render_image()

In [None]:
view_one._display_image()

In [None]:
view_two = nv.show_biopython(structure)
view_two.add_ball_and_stick()
view_two

In [None]:
# clean up the folder if neccessary
os.remove(path) #Removes the file after the user is done looking at the protein.
os.rmdir(protein_class)

### Notes for Faraz

https://docs.google.com/spreadsheets/d/14cIpDAfU4V1t9oN3_GGRJkDfyjBScvCRTUaSK74-1Hs/edit#gid=1005545640

### Notes for Ben
How to write in Jupyter Notebook
https://gtribello.github.io/mathNET/assets/notebook-writing.html

To do List:
* Refer to Google Sheets doc to navigate Faraz with how to design the end-user portion-- make it somewhat similar to the PyMOL model.
* Figure out things we need to write for Ch3-6
    * Find key points to implement that can be testable material
    * BioPython package already has the PAM matrix and BLOSUM62
* NGLView has a widget to add- allows you to zoom in/out-- its in the developers webpage.
* 

# Chapter 2- Databases and Other Resources

**After Dr. Tong's lectures, you are now probably familiar with databases. To summarize, databases vary however they consist of very similar traits that will ultimately help you navigate through nucleotide/protein/etc. data to explore properties of a given protein. In regards to literature databases, especially Web of Science, please refer to Leddy Library, mainly Roger Reka to properly structure your search so you can be as accurate as possible.**


## Database Resources
* NCBI (National Center for Biotechnology Information)- based out of the USA
* BLAST (Basic local alignment search tool)- Amino acid sequences of proteins, nucleotides of DNA/RNA sequences
* GenBank (Nucleotide & Protein sequences)
* Protein structure searches
* NCBI Epigenomics database (epigenetics) - ceased operation in 2016
* Taxonomy
* UniProt (the Universal Protein resource)- based out in Switzerland
* UniProt is a collaboration between the Swiss Institute of Bioinformatics(SIB), Protein Information Resource (PIR), and the European Bioinformatics Institute.
    * UniProtKB
    * UniRef
    * UniParc
* DDBJ (DNA Data Bank of Japan)
* International Nucleotide Sequence Database Collaboration (INSDC)

## Literature Resources
* PubMed (biomedical literature)
* Web of Science
* Database that offers scientific literature with an excellent search query.
* Interesting Misc. Resources(may not be relevant to this course but may be :
* Research Hub 
 
If you ever have trouble accessing scientific literature, try inputting it through this website (SciHub). Note: at times, the articles may not contain all of the figures/examples. 

If you really need an article, please contact your instructor to request access from the author. The instructor has more leverage than a student when emailing another professor. 

Regarding some key terms, please refer to these below: 

## Redundant but still Important Terms:

**If you do not know these terms, please google/research these to understand them because they are going to be referenced going forward in the notebook.**

* p53
* UTR (Untranslated region(s))
* Coding strand
* Template strand
* intron
* exon

## Topics that are new

**If you know these terms, great! You are already ahead but please still review before we continue to get into the workbook.**

* FASTA Format
* RefSeq
* Alternative splicing
* Protein Data Bank (more specifically on how they retrieve their data in such an accurate method)
* MDM2 (hint: it's a ligase)
* Li-Fraumeni Syndrome (very specific case but can be used in a UniprotKB exercise)

# Chapter 3 Molecular Evolution

Over the course of time, we know that mutations occur. To avoid repetition, here are some main concepts to remember while progressing through this chapter(not going to be covered but you should know them by now):

* 

## Some topics to jog your memory:
* p53
* conserved region(s)
* catalytic site

# Chapter 4 Protein Basics & Scoring Matrices

# Chapter 5 Sequence Alignment

# Chapter 6 BLAST

# Chapter 7 Protein Structure (basics)

# Chapter 8 Molecular Phylogeny

# Chapter 9 Genomics

# Chapter 10 Gene Expression Analysis & Proteonomics
