# BF527: Applications in Bioinformatics

>**Note:** Please submit the Jupyter notebook through Blackboard. Your code should follow the guidelines laid out in class, including commenting. Partial credit will be given for nonfunctional code that is logical and well commented. This assignment must be completed on your own.

## Homework 8

### See [Blackboard](https://learn.bu.edu) for assignment and due dates

---

## Problem 8.1 (40%):

#### Go to the PDB website and open the page for the structure with PDB ID 3BMP.

* Use __Pfam__, Uniprot, Google or Wikipedia to find some information about this protein. How long is the protein? Which superfamily does the protein belong to? What is the protein’s function, and the evolutionary history of the superfamily? What domains and enzymatic properties does the protein have?

![evo_history](evo_1.jpg)

**How long is the protein?**  
The length of the protein is 114 aa.  
**Which superfamily does the protein belong to?**  
It belongs to TGF-BETA superfamily  
**What is the protein’s function, and the evolutionary history of the superfamily?**  
It plays essential roles in many developmental processes, including cardiogenesis, neurogenesis, and osteogenesis, induces cartilage and bone.
The evolutionary history has been showed above.  
TGFβ family receptors are grouped into three types, type I, type II, and type III. There are seven type I receptors, termed the activin-like receptors (ALK1–7), five type II receptors, and one type III receptor, for a total of 13 TGFβ superfamily receptors. In the transduction pathway, ligand-bound type II receptors activate type I receptors by phosphorylation, which then autophosphorylate and bind SMAD. The Type I receptors have a glycine-serine (GS, or TTSGSGSG) repeat motif of around 30 AA, a target of type II activity. At least three, and perhaps four to five of the serines and threonines in the GS domain, must be phosphorylated to fully activate TbetaR-1. (Sourced from Widipeida)

**What domains and enzymatic properties does the protein have?**  
Here are some domain information about the protein below. For example, we can explain the domain information from the SCOPe: this protein domian is classified as belonging tot he 'small proteins' class, with a specific 'cystine-knot cytokines' fold and superfamily.. Wihtin this, it;s part of the 'Transforming growth factor-beta' family and specifically identified as the Bone morphogenetic protein-2 domain.  
For the enzymatic properties, we can find that it's a inhibitor in Enzyme-catalyzed Reactions: down-regulation of enzyme expression in osteoblasts, can be overcome by noggin.


![domain](domain.jpg)

#### Explore the 3D structure of “3BMP” using the "3D View" tab on the PDB website.

* Generate two informative pictures of this structure by manipulating the various style options (you can fine tune these options through the right-click menu). Include screen shots with your homework submission and explain the biological meaning of the different styles.

Observing a coarse surface can provide:  It helps in identifying hydrophobic and hydrophilic regions which are crucial for understanding protein-ligand interactions, protein-protein interactions, and the solubility of the protein. And also to reveal potential active sites or ligand-binding pockets of drug molecule or antibody binding site.

![coarsesurface](coarse_surface.jpg)

This displays a protein structure in a molecular viewer, specifically showing the protein in polymers with ligands. The secondary structural elements show in the picture like alpha helices and beta sheets are visible, which helps in understanding the protein's folding and function.The presence and position of the ligands suggest where binding sites are located on the protein. 

![ligand_polymer](ligand.jpg)

#### Use the other information tabs to answer the remaining questions.

* There are some “dots” buried in the structure—what do these represent? __Hint: try hovering over them with your pointer.__

* Describe the secondary structure composition of this protein. Is there a prevalence of one type of secondary structure?

* Does the protein belong to a family recognized by SCOP, CATH, and/or PFam?

Yes it belongs to cystine-knot cytokines family which is identified by SCOP, ECOD.

![domain](domain.jpg)

* Is the protein similar to any other human proteins? To what degree?

__Hints__: You can download a fasta record from the PDB website. You can restrict blast to only look in the human database.

* How was the 3D structure and view of this protein generated?

__Hint__: The "Experiment" tab on the PDB website has some information that may help here

---

## Problem 8.2 (60%):

__Your task is to write a python script to parse a PDB file__. A typical PDB format file will contain atomic coordinates for proteins, as well as small molecules, ions and water. Each atom is entered as a line of information that starts with a keyword ATOM or HETATM. By tradition, the ATOM keyword is used to identify proteins or nucleic acid atoms, and keyword HETATM is used to identify atoms in small molecules. Following this keyword, there is a list of information about the atom, including its name, its number in the file, the name and number of the residue it belongs to, one letter to specify the chain (in oligomeric proteins), its x, y, and z coordinates. Download the raw data for 3BMP. (__Hint: under “Download files” select "PDB Format"__.) Your Python script should do the following things:

* Open the 3BMP.pdb file in order to parse it line by line. __Hint__: PDB files can be a little hard to read because the lines will have varied numbers of spaces so that the columns line up exactly in a flat file. If you tried opening the file (in a text editor), you’ll also realize that it has a LOT of different information in it. You are only interested in rows that begin with “ATOM”. The best way to separate individual components of a line is by slicing, e.g. to get just “ATOM” you could use line[0:4]. __Splitting on a variable (e.g. '```\t```') will not work__.
* Amino acids are made of Carbon (C), Nitrogen (N), Sulfur (S), and Oxygen (O). Count the number of C, N, S and O atoms that occur in each amino acid of the protein, including the total number of C, N, S and O atoms in the protein. Compute the frequencies (%) for each atom in each unique amino acid. __Remember__: the keyword for atoms in proteins (instead of small molecules) is ATOM; the HETATM keywords can be ignored. The atomic element is given a one-letter code at the __end of the line__. The PDB file will display the x,y,z coordinates starting at amino acid #9, and continuing to amino acid #114. There will be one line per atom of the amino acid. The question you are trying to answer is, of all the C, N and O atoms in the protein structure, how many are in Alanine, Arginine, etc.

Your output should look like:

```
amino acid  C     N     S     O
ARG         0.03  0.08  0.00  0.03
ASN         0.05  0.10  0.00  0.09
ASP         0.05  0.04  0.00  0.12
…etc
total:      531   142   9     156
```


In [45]:
import pandas as pd
#Write your script here
# Initialize a dictionary to store the total count of each atom type for each amino acid
amino_acid_total={}
# Initialize a dictionary to store the total count of each atom type in the entire protein
total={'C':0,'N':0,'S':0,'O':0}
# Open the PDB file to read line by line
with open('3bmp.pdb','r') as file:
    for line in file:
         # Check if the line starts with 'ATOM' which indicates an atom record
        if line.startswith("ATOM"):
            atom_type = line[77:78] # Extract the atom type (C, N, S, O)
            amino_acid = line[17:20]# Extract the amino acid name
            total[atom_type]+=1# Increment the total count for the atom type
              # If the amino acid is not in the dictionary, add it with the current atom type
            if amino_acid not in amino_acid_total.keys():
                amino_acid_total[amino_acid]=atom_type
         # If the amino acid is already in the dictionary, append the current atom type
            else:
                amino_acid_total[amino_acid]+=atom_type
            


{'C': 531, 'N': 142, 'S': 9, 'O': 156}

In [56]:
# Print the headers for the output table
print('amino acid','C','N','S','O',sep='\t')
# Iterate over each amino acid to calculate and print the frequencies of each atom type
for name in amino_acid_total.keys():
    aa_c=round(amino_acid_total[name].count('C')/total["C"],2,)
    aa_n=round(amino_acid_total[name].count('N')/total["N"],2)
    aa_s=round(amino_acid_total[name].count('S')/total["S"],2)
    aa_o=round(amino_acid_total[name].count('O')/total["O"],2)
    # Print the amino acid name and the frequencies of C, N, S, O with tab separation
    print(name,'',aa_c,aa_n,aa_s,aa_o,sep='\t')
# Print the total counts of each atom type in the entire protein
print('total:','',total["C"],total["N"],total["S"],total["O"],sep='\t')

amino acid	C	N	S	O
ARG		0.03	0.08	0.0	0.03
LEU		0.1	0.06	0.0	0.06
LYS		0.07	0.08	0.0	0.04
SER		0.05	0.06	0.0	0.1
CYS		0.04	0.05	0.78	0.04
HIS		0.06	0.11	0.0	0.03
PRO		0.07	0.05	0.0	0.04
TYR		0.08	0.04	0.0	0.06
VAL		0.1	0.08	0.0	0.07
ASP		0.05	0.04	0.0	0.12
PHE		0.05	0.02	0.0	0.02
GLY		0.02	0.04	0.0	0.03
TRP		0.04	0.03	0.0	0.01
ASN		0.05	0.1	0.0	0.09
ILE		0.05	0.03	0.0	0.03
ALA		0.03	0.04	0.0	0.04
GLU		0.05	0.04	0.0	0.11
THR		0.02	0.02	0.0	0.04
GLN		0.02	0.03	0.0	0.03
MET		0.02	0.01	0.22	0.01
total:		531	142	9	156


__What does the distribution of frequencies look like? Are there any atoms that are more prevalent in one amino acid or another?__