# Introduction to PDB Files and Biopython

#### In this project, we will examine the structure of proteins in PDB (Protein Data Bank) files using the biopython libraray. We will examine the structure of a PDB file, use biopython to load PDB files as well as extract information and visualize the extracted information. We have obtained protein structructral data from the Protein Data Bank website at https://www.rcsb.org/ which provide free access to protein crystal structures. We have moved the downloaded file into the folder with this Jupyter notebook.

## PBD Format and parsing

#### The PDB ( Protein Data Bank) file format is widely used in the field of structural biology to represent the three-dimensional structures of proteins, nucleic acids, and complex molecular assemblies. While some other bioinformatics file formats online include mainly sequence data, the PDB file format provides an assortment of information which include amino acid sequence, information about helix, information about secondary structures to generate visual representations of proteins, xyz coordinates of atoms identified in structure (does not typically include hydrogens), metadata about the protein, etc. PDB files are text files with the data arranged with specific labels and seperated by strictly-delineated position in a line in the file.

#### We will run the below code to get our output.

In [None]:
import warnings
from Bio import BiopythonWarning
warnings.simplefilter('ignore', BiopythonWarning)

import os

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd

import Bio.PDB

## Parsing PDB Files

#### In this project, we will perform the parsing of data (i.e separating data into many part) from the PDB files used in this project. For this project, we use the different biopython library's which have capabilities of reading the PDB files. first, we need to import the PDB module of the biopython library with the `import Bio.PDB` import command which provides functions for parsing PDB files. 

#### We will use `Bio.PDB.PDBParser()` to praise the PDB files. The next we will create a parser object using the `PDBParser()` from the PDB module. 

#### We will also want to attach our parser object to a variable using the = symbol the same way we attach a number to a variable. Run the following code in a code cell.

#### parser = Bio.PDB.PDBParser()

#### Now we use the `PDBParser().get_structure()` function to parse a single PDB file using the `Bio.PDB` library in Python. We will use the `get_structure()` which returns a structure object that contains information about the protein structure contained in the PDB file. `PDBParser().get_structure()` function requires two need two arguments. The first one is 'name', which is a string through which we can give name to the structure that will be created from the PDB file and second one is 'fine_name.pdb' which is a name of the PDB file which we are extracting.

#### parser.get_structure('name', 'file_name.pdb')

In [None]:
parser = Bio.PDB.PDBParser()
structure = parser.get_structure('6x8j', '6x8j.pdb')

## Structure of the Data

#### We have organized the structural data extracted from the PDB in a hierachy starting with structure &rarr; model  &rarr; chain  &rarr; residue &rarr; atom. Structure refers to the 3D structure of a protein, which can be composed of one or two model, the chain represent the individual peptide chains in the protein, the residue is an amino acid residue in the chain, the residue is composition of several atoms and the atom is each atom within a given chain. By breaking down the complex structure, it is easier to understand the relationships between different parts of a protein structure and to perform more complex analyses. It also help us to undrestand and analyze the complex 3D structures of protein and other biological maromolecules. 

| Level       | Description                           |
| ----------- | ------------------------------------- |
| Structure   | Protein structure; may contain multiple models |
| Model       | Particular 3D model of the protein   |
| Chain       | Each peptide chain in the protein     |
| Residue     | Amino acid residue in a given chain   |
| Atom        | Atoms in a particular peptide chain   |


#### PDB files have capacity of containing multiple structure of protein, but most most of them contains only one structure. We have only one model that is used in our data so we will need to access the first (and only) model using indexing. To access the item we want, we have to give the number of item behind the variable in square brackets `[ ]`. In Python indexing, numbering starts with zero, so the first protein model is `structure[0]`. If there were a second, it would be `structure[1]`.

#### We use a `for` loop to run through our protein model so we can be to know to how many chains are in it.  The code below systematically goes through ever chain in the `protein_model`, assigns in to the variable chain, and then prints (i.e., displays on the screen) the information.

#### protein_model = structure[0]

In [None]:
protein_model = structure[0]

for chain in protein_model:
    print(chain)

- Chain id=A
- Chain id=C
- Chain id=B
- Chain id=D
- Chain id=E
- Chain id=F

    
#### To access a particular chain, use the chain's id. For example, to access Chain A, type protein_model['A'].Select the first chain and attach its information to an intuative variable of your choice.

In [None]:
chain_A = protein_model['A']
res = chain_A[58]
res

#### The output for the above is Residue TYR het=  resseq=58 icode=

In [None]:
res.get_unpacked_list()

The list of output atoms is:

- Atom N
- Atom CA
- Atom C
- Atom O
- Atom CB
- Atom CG
- Atom CD1
- Atom CD2
- Atom CE1
- Atom CE2
- Atom CZ
- Atom OH




 
 

## Examining Amino Acid Frequency

#### Examining Amino acid frequency in biopython helps us to analyze 

In [4]:
def get_aa(file):
    '''Accepts a PDB files name (string) and returns a list of residues
    that occur in a peptide.
    
     >>> ('1abc.pdb') -> ['GLY', 'ALA', 'LYS']
    '''

    amino_acids = []  # empty list to add the amino acids to

    parser = Bio.PDB.PDBParser()
    structure = parser.get_structure('6x8j', '6x8j.pdb')
    pp = Bio.PDB.PPBuilder().build_peptides(structure[0])

    # go through each chain and residue and append the amino acid identity to the list
    for chain in pp:
        for res in chain:
            res_name = res.get_resname()
            amino_acids.append(res_name)
            
    return amino_acids 


In [None]:
amino_acids = get_aa('6x8j.pdb')