# BF527: Applications in Bioinformatics

>**Note:** Please submit the Jupyter notebook through Blackboard. Your code should follow the guidelines laid out in class, including commenting. Partial credit will be given for nonfunctional code that is logical and well commented. This assignment must be completed on your own.

## Homework 8

### See [Blackboard](https://learn.bu.edu) for assignment and due dates

---

## Problem 8.1 (40%):

#### Go to the PDB website and open the page for the structure with PDB ID 3BMP.

* Use __Pfam__, Uniprot, Google or Wikipedia to find some information about this protein. How long is the protein? Which superfamily does the protein belong to? What is the protein’s function, and the evolutionary history of the superfamily? What domains and enzymatic properties does the protein have?

#### Explore the 3D structure of “3BMP” using the "3D View" tab on the PDB website.

* Generate two informative pictures of this structure by manipulating the various style options (you can fine tune these options through the right-click menu). Include screen shots with your homework submission and explain the biological meaning of the different styles.

#### Use the other information tabs to answer the remaining questions.

* There are some “dots” buried in the structure—what do these represent? __Hint: try hovering over them with your pointer.__

* Describe the secondary structure composition of this protein. Is there a prevalence of one type of secondary structure?

* Does the protein belong to a family recognized by SCOP, CATH, and/or PFam?

* Is the protein similar to any other human proteins? To what degree?

__Hints__: You can download a fasta record from the PDB website. You can restrict blast to only look in the human database.

* How was the 3D structure and view of this protein generated?

__Hint__: The "Experiment" tab on the PDB website has some information that may help here

---

## Problem 8.2 (60%):

__Your task is to write a python script to parse a PDB file__. A typical PDB format file will contain atomic coordinates for proteins, as well as small molecules, ions and water. Each atom is entered as a line of information that starts with a keyword ATOM or HETATM. By tradition, the ATOM keyword is used to identify proteins or nucleic acid atoms, and keyword HETATM is used to identify atoms in small molecules. Following this keyword, there is a list of information about the atom, including its name, its number in the file, the name and number of the residue it belongs to, one letter to specify the chain (in oligomeric proteins), its x, y, and z coordinates. Download the raw data for 3BMP. (__Hint: under “Download files” select "PDB Format"__.) Your Python script should do the following things:

* Open the 3BMP.pdb file in order to parse it line by line. __Hint__: PDB files can be a little hard to read because the lines will have varied numbers of spaces so that the columns line up exactly in a flat file. If you tried opening the file (in a text editor), you’ll also realize that it has a LOT of different information in it. You are only interested in rows that begin with “ATOM”. The best way to separate individual components of a line is by slicing, e.g. to get just “ATOM” you could use line[0:4]. __Splitting on a variable (e.g. '```\t```') will not work__.
* Amino acids are made of Carbon (C), Nitrogen (N), Sulfur (S), and Oxygen (O). Count the number of C, N, S and O atoms that occur in each amino acid of the protein, including the total number of C, N, S and O atoms in the protein. Compute the frequencies (%) for each atom in each unique amino acid. __Remember__: the keyword for atoms in proteins (instead of small molecules) is ATOM; the HETATM keywords can be ignored. The atomic element is given a one-letter code at the __end of the line__. The PDB file will display the x,y,z coordinates starting at amino acid #9, and continuing to amino acid #114. There will be one line per atom of the amino acid. The question you are trying to answer is, of all the C, N and O atoms in the protein structure, how many are in Alanine, Arginine, etc.

Your output should look like:

```
amino acid  C     N     S     O
ARG         0.03  0.08  0.00  0.03
ASN         0.05  0.10  0.00  0.09
ASP         0.05  0.04  0.00  0.12
…etc
total:      531   142   9     156
```


In [6]:

from collections import defaultdict
 
import numpy as np
f= open("3bmp.cif")
d=defaultdict(str)  
for line in f:
 ch=line[0:4]
 if(ch=="ATOM"):
  word=line.split()  
  ele=word[2]
  p= word[5]
  d[p]+=ele #default dict creates unique key for each protein and stores value as of all element present as a string 

print(d) 
print("")
print("")
# initlize counter variables
c=0
n=0
s=0
o=0
tc=0
tn=0
ts=0
to=0
for key in d:                     # just to count the total counts for each element as  I did not used hard coded value from the assignent question
    c=(d[key]).count("C")
    tc=c+tc
    n=(d[key]).count("N")         # count function to count the occurences of each element  
    tn=n+tn
    s=(d[key]).count("S")
    ts=s+ts
    o=(d[key]).count("O")
    to=o+to
# formatting to give the desired output   
print("amino acid","","C","     ","N","     ","S","     ","O")  
for key in sorted(d):
    c=(d[key]).count("C")/tc
    n=(d[key]).count("N")/tn
    s=(d[key]).count("S")/ts
    o=(d[key]).count("O")/to
    print(key ,"       ",format(c, '.2f'),"  ",format(n, '.2f'),"  ",format(s, '.2f'),"  ",format(o, '.2f'))
print("total:","    ",tc, "   ",tn,"   ",ts,"     ",to)    

defaultdict(<class 'str'>, {'ARG': 'NCCOCCCNCNNNCCOCCCNCNNNCCOCCCNCNNO', 'LEU': 'NCCOCCCCNCCOCCCCNCCOCCCCNCCOCCCCNCCOCCCCNCCOCCCCNCCOCCCCNCCOCCCCNCCOCCCC', 'LYS': 'NCCOCCCCNNCCOCCCCNNCCOCCCCNNCCOCCCCNNCCOCCCCNNCCOCCCCN', 'SER': 'NCCOCONCCOCONCCOCONCCOCONCCOCONCCOCONCCOCONCCOCO', 'CYS': 'NCCOCSNCCOCSNCCOCSNCCOCSNCCOCSNCCOCSNCCOCS', 'HIS': 'NCCOCCNCCNNCCOCCNCCNNCCOCCNCCNNCCOCCNCCNNCCOCCNCCN', 'PRO': 'NCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCC', 'TYR': 'NCCOCCCCCCCONCCOCCCCCCCONCCOCCCCCCCONCCOCCCCCCCONCCOCCCCCCCO', 'VAL': 'NCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCCNCCOCCC', 'ASP': 'NCCOCCOONCCOCCOONCCOCCOONCCOCCOONCCOCCOONCCOCCOO', 'PHE': 'NCCOCCCCCCCNCCOCCCCCCCNCCOCCCCCCC', 'GLY': 'NCCONCCONCCONCCONCCO', 'TRP': 'NCCOCCCCNCCCCCNCCOCCCCNCCCCC', 'ASN': 'NCCOCCONNCCOCCONNCCOCCONNCCOCCONNCCOCCONNCCOCCONNCCOCCON', 'ILE': 'NCCOCCCCNCCOCCCCNCCOCCCCNCCOCCCC', 'ALA': 'NCCOCNCCOCNCCOCNCCOCNCCOCNCCOC', 'GLU': 'NCCOCCCOONCCOCCCCCCOOOONCCOCCCOONCCOCCCOONCCOCCCOO', 'T

__What does the distribution of frequencies look like? Are there any atoms that are more prevalent in one amino acid or another?__