# L1: Molecules, Protein Structure


## Overview

Our goal is to introduce the notebook style of this course and to build familiarity with protein structure and sequence files while gaining  intuition into the utility of protein structures. In part 1, we will visualize protein structures and determine if we can say anything about the protein's function. In part 2, we will then search for the proteins using their amino acid sequence to check their actual function. 

In this notebook we will take a look at several protein structures. These structures were predicted automatically from alphafold and were downloaded from the [AlphaFold database](https://alphafold.ebi.ac.uk/).

There are two types of files that we will work with here (.pdb and .fasta files). Both of these files are plain text files (you can open them in any text editor). Each file has a standard format for the information it represents.
* .pdb files: protein structure, this file specifies the coordinates of each atom in each amino acid of the protein
* .fasta files: protein sequence, this file lists the amino acid sequence of the protein (Note: .fasta files can contain either amino acid or nucleotide sequences)

We will first load the .pdb files and visulaize the protein structures. Next we will load the .fasta files and search the NCBI database via BLAST to identify the protein.

In [1]:
# Imports
# part 1
import nglview as nv
# part 2
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML



## Part 1: Visualize Structures

In this section we will load and visualize the protein structure files. Replace the text "ENTER FILE NAME HERE" with the name of each of the protein .pdb files and run the code block. Once the visualization is generated, you can click and drag to spin the protein around and scroll to zoom in and out. Copy the code block and repeat this for each of the protein files.

Here we are using a simple approach to view the proteins with the nglview package. There are many more sophisticated tools that allow you to visualize proteins and run analyses on them such as "molecular docking". If you are interested in exploring protein structure analysis further a popular and more powerful software tool is [pymol](https://pymol.org/2/).

Note: in google chrome you need to make sure that the "hardware acceleration" option is enabled for this viewer to work. (settings -> system -> Use hardware acceleration when available)

In [11]:
file = "P1_Structure.pdb"

view = nv.show_file(file)
view

NGLWidget()

In [12]:
file = "P2_Structure.pdb"

view = nv.show_file(file)
view

NGLWidget()

In [15]:
file = "P3_Structure.pdb"

view = nv.show_file(file)
view

NGLWidget()

In [13]:
file = "P4_Structure.pdb"

view = nv.show_file(file)
view

NGLWidget()

In [14]:
file = "P5_Structure.pdb"

view = nv.show_file(file)
view

NGLWidget()

### QUESTIONS: 

1. After viewing each protein write down a hypothesis for what you think the function of that protein may be (hint: it's okay if you have no idea, just make something up that seems plausible)
2. Are there any features of the structures that stand out? Do you notice any consistent structures across proteins?

## Part 2: Search Sequences

Here we will use the biopython implementation of of the NCBI BLAST web server (Bio.Blast.NCBIWWW) to search for our sequence in the ncbi database. BLAST (Basic Local ALignment Search Tool) is a bioinformatics tool that aligns sequences with each other to find matches. 

For our sequences we should be able to find very good (if not exact) matches. We will load our sequence and use the "blastp" program and the "nr" database to align our sequence to the NCBI non-redundant database of protein sequences. We will search until we find 10 total matches. We will save our BLAST results as an .xml file and then load them and print out all of the matches below a specified e-value. The e-value is a representation of how likely a given match would be expected by chance.

Run this part of the notebook and answer the questions for at least one of the proteins by setting the protein_name variable.

In [7]:
# Load Sequence File
protein_name = "P1"
protein_sequence = SeqIO.read(protein_name+"_Sequence.fasta",format="fasta")

In [8]:
# Run BLAST (this will take several minutes to run)
blast_result = NCBIWWW.qblast("blastp","nr",protein_sequence.seq,hitlist_size=10)

In [9]:
# Save BLAST results
result_file = open(protein_name+"_blast.xml","w")
result_file.write(blast_result.read())
result_file.close()
blast_result.close()

In [10]:
# Print out BLAST matches with E-value below a given threshold
E_VALUE_THRESH = 1e-20
for record in NCBIXML.parse(open(protein_name+"_blast.xml")):
    if record.alignments:
        for align in record.alignments: 
            for hsp in align.hsps: 
                #if hsp.expect < E_VALUE_THRESH
                print(f"\n")
                print("match: %s " % align.title)
                print(hsp.expect)



match: ref|WP_000170339.1| glucokinase [Escherichia albertii] >gb|EFZ2303342.1| glucokinase [Shigella boydii] >gb|AHE61450.1| glucokinase [Escherichia albertii KF1] >gb|AUS66356.1| glucokinase [Escherichia albertii] >gb|EAB1452419.1| glucokinase [Escherichia albertii] >gb|EDS92476.1| glucokinase [Escherichia albertii TW07627] 
0.0


match: gb|MCZ8700358.1| glucokinase [Escherichia albertii] 
0.0


match: ref|WP_113623213.1| glucokinase [Escherichia albertii] >gb|EEW0112471.1| glucokinase [Escherichia albertii] >gb|EFB7458226.1| glucokinase [Escherichia albertii] >gb|EFO4719699.1| glucokinase [Escherichia albertii] >gb|MCE7716207.1| glucokinase [Escherichia albertii] >gb|MCZ8599115.1| glucokinase [Escherichia albertii] 
0.0


match: ref|WP_059218301.1| glucokinase [Escherichia albertii] >gb|WDB49009.1| glucokinase [Escherichia albertii] 
0.0


match: gb|EJJ6388920.1| glucokinase [Escherichia albertii] >gb|EJZ0948492.1| glucokinase [Escherichia albertii] >gb|EKB0154155.1| glucokinase [

### QUESTIONS:

3. From the top BLAST matches identify the name of your protein and which organism it likely came from.
4. Now that you know the name and organism of your protein, look online and provide a brief description of the function of this protein. UniProt is a good database of protein information to try searching. https://www.uniprot.org/