# Table of Content:
* [Li Yuan's second week work](#2)
* [Li Yuan's third week work](#3)
    * [Pandas convert](#31)
    * [MdTraj convert](#32)
* [Li Yuan's forth week work](#4)
    * [get_all_atom() function](#41)
    * [get_atom_and_hetatm() function](#42)
    * [PDB Class](#43)
    * [An example of this class "4lza"](#44)

<a id='2'></a>
# Li Yuan's second week work 

This is a set of basic examples of the usage and outputs of the various individual functions included in. There are generally three types of functions:

+ Functions that perform searches and return lists of PDB IDs
+ Functions that get information about specific PDB IDs
+ Other general-purpose lookup functions

The list of supported search types, as well as the different types of information that can be returned for a given PDB ID, is large (and growing) and is enumerated in the docstrings of pypdb.py. The PDB allows a very wide range of different types of queries, and so any option that is not currently available can likely be implemented based on the structure of the query types that have already been implemented. Please submit feedback and pull requests on GitHub.

### I didn't find any funcion in that package pypdb we can use to extract seqres and atom, so I only use get_pdb_file() function from that package to get the file and write my own function to do that.

### Preamble

We import this package pypdb and prepare some other things.

In [3]:
%pylab inline
from IPython.display import HTML

## Import from local directory
import sys
sys.path.insert(0, '../pypdb')
from pypdb import *

## Import from installed package
# from pypdb import *

import pprint

%load_ext autoreload
%autoreload 2

Populating the interactive namespace from numpy and matplotlib


## This function I wrote is to extract only the seqres as a list

In [None]:
def get_seqres(pdb_id):
    """ Return the seqres sequence of a pdb file
    
    >>> get_seqres('4Z0L')
    >>> get_seqres('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False) 
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'.
    file1 = pdb_file.splitlines()
    # split this long string into list by \n.
    list_se = []
    for line in file1:
        if line[:6] == "SEQRES":
            list_se.append(line)
    return(list_se)

In [None]:
get_seqres('4lza')[:10]

In [None]:
get_seqres('4Z0L')[:20]

### This function I wrote is to extract only the atom sequence as a list

In [None]:
def get_atom(pdb_id):
    """ Return the atom sequence of a pdb file
    
    >>> get_atom('4Z0L')
    >>> get_atom('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False)
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'.
    file1 = pdb_file.splitlines()
    list_atom = []
    for line in file1:
        if line[:4] == "ATOM":
            list_atom.append(line)
    return(list_atom)

In [None]:
get_atom('4Z0L')[:10]

In [None]:
get_atom('4lza')[:10]

<a id='3'></a>
# Li Yuan's third week work

<a id='31'></a>
## We first used pandas to convert a list into dataframe 

### First we used split() to split each string in the list returned by get_atom() function 

In [4]:
import pandas as pd

In [None]:
def get_atom(pdb_id):
    """ Return the atom sequence of a pdb file as a pandas dataframe
    
    >>> get_atom('4Z0L')
    >>> get_atom('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False)
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'.
    file1 = pdb_file.splitlines()
    list_atom = []
    for line in file1:
        if line[:4] == "ATOM":
            list_atom.append(line)
    list_s_atom = [s.split() for s in list_atom]
    # split each string in a list by white spaces
    df = pd.DataFrame(list_s_atom)
    # use DataFrame function to convert a list to dataframe
    df["id"] = pdb_id
    # add one id column to exsiting dataframe
    return(df)

In [None]:
get_atom("4lza").head(11)

## Second we want to put a couple of pdb entries into one dataframe.

In [None]:
def get_some_atom(L):
    """ Take a list with returning some atom parts of pdb files into one dataframe 
    
    >>> get_some_atom(["4lza", "4Z0L"])
    """
    frames = [get_atom(l) for l in L]
    return(pd.concat(frames))

## We test this function with a list ["4lza", "4Z0L]

In [None]:
get_some_atom(["4lza", "4Z0L"])

## Next we want to put all ids into one dataframe

### We used get_all() to list all pdb entries. 

In [None]:
len(get_all())

### We found there was 169681 entries in the current PDB  DataBase.

### We used concat() function to merge all the dataframe of each pdb entry into one huge dataframe.

In [None]:
frames = [get_atom(id) for id in get_all()[:2]]
pd.concat(frames)

<a id='32'></a>
## We used mdtraj package to load pdb file into memory from URL

>MDTraj is a python library that allows users to manipulate molecular dynamics (MD) trajectories. Features include:
1. Wide MD format support, including pdb, xtc, trr, dcd, binpos, netcdf, mdcrd, prmtop, and more.
2. Extremely fast RMSD calculations (4x the speed of the original Theobald QCP).
3. Extensive analysis functions including those that compute bonds, angles, dihedrals, hydrogen bonds, secondary structure, and NMR observables.
4. Lightweight, Pythonic API.

In [None]:
import mdtraj as md # import this package

In [None]:
pdb = md.load_pdb("https://files.rcsb.org/view/4LZA.pdb")  # load data

In [None]:
print(pdb) # print to see how many frames and atoms, residues this file has 

## We convert this pdb file into topology

In [None]:
topology = pdb.topology

In [None]:
table, bonds = topology.to_dataframe()

In [None]:
print(table.head(7))

In [None]:
topology.atom(10)

In [None]:
topology.atoms

In [None]:
[i for i in topology.atoms][:10]

In [None]:
print(table.head(10))

In [None]:
atom = pdb.atom_slice(range(2833))

In [None]:
print(atom)

In [None]:
atom.xyz

In [None]:
[i for i in topology.bonds][:10]

<a id='4'></a>
# Li Yuan's forth week work

## This function returns all the atom part, including ATOM, ANISOU, TER
<a id='41'></a>

In [60]:
def get_all_atom(pdb_id):
    """ Return the all atom sequence of a pdb file as a pandas dataframe
    
    >>> get_all_atom('4Z0L')
    >>> get_all_atom('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False)
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'.
    file1 = pdb_file.splitlines()
    list_atom = []
    
    # Find the index number of the line with ATOM
    for line in file1:
        if line[:4] == "ATOM":
            num = file1.index(line)
            break
    
    # we add/append all the lines which are within atom part until we meet CONECT
    while file1[num][:6] != "CONECT":
        list_atom.append(file1[num])
        num = num + 1
        
    list_s_atom = [s.split() for s in list_atom]
    # split each string in a list by white spaces
    
    df = pd.DataFrame(list_s_atom)
    # use DataFrame function to convert a list to dataframe
    df["id"] = pdb_id
    # add one id column to exsiting dataframe
    return(df)

In [62]:
get_all_atom("4lza")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,id
0,ATOM,1,N,THR,A,0,-27.785,5.217,-21.426,1.00,50.53,N,,4lza
1,ANISOU,1,N,THR,A,0,6212,6054,6933,75,-219,-115,N,4lza
2,ATOM,2,CA,THR,A,0,-27.459,5.049,-19.974,1.00,49.41,C,,4lza
3,ANISOU,2,CA,THR,A,0,6089,5870,6816,59,-181,-113,C,4lza
4,ATOM,3,C,THR,A,0,-25.949,5.130,-19.667,1.00,46.13,C,,4lza
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5529,HETATM,2850,O,HOH,B,286,-0.332,17.645,1.000,1.00,42.23,O,,4lza
5530,HETATM,2851,O,HOH,B,287,-5.698,23.679,-0.316,1.00,47.35,O,,4lza
5531,HETATM,2852,O,HOH,B,288,-6.332,-13.026,-3.481,1.00,51.79,O,,4lza
5532,HETATM,2853,O,HOH,B,289,-8.265,-14.563,-0.902,1.00,50.84,O,,4lza


## This function I wrote only returns atom and hetatm
<a id='42'></a>

In [5]:
def get_atom_and_hetatm(pdb_id):
    """ Return the atom sequence of a pdb file as a pandas dataframe
    
    >>> get_atom_hetatm('4Z0L')
    >>> get_atom_hetatm('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False)
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'
    file1 = pdb_file.splitlines()
    list_atom = []
    
    # we only select line with atom and hetatm
    for line in file1:
        if line[:4] == "ATOM" or line[:6] == "HETATM":
            list_atom.append(line)
        
    list_s_atom = [s.split() for s in list_atom]
    # split each string in a list by white spaces
    df = pd.DataFrame(list_s_atom)
    # use DataFrame function to convert a list to dataframe
    df["id"] = pdb_id
    # add one id column to exsiting dataframe
    return(df)

In [6]:
get_atom_and_hetatm("4lza")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,id
0,ATOM,1,N,THR,A,0,-27.785,5.217,-21.426,1.00,50.53,N,4lza
1,ATOM,2,CA,THR,A,0,-27.459,5.049,-19.974,1.00,49.41,C,4lza
2,ATOM,3,C,THR,A,0,-25.949,5.130,-19.667,1.00,46.13,C,4lza
3,ATOM,4,O,THR,A,0,-25.572,5.789,-18.699,1.00,44.22,O,4lza
4,ATOM,5,CB,THR,A,0,-28.153,3.815,-19.346,1.00,51.85,C,4lza
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2847,HETATM,2850,O,HOH,B,286,-0.332,17.645,1.000,1.00,42.23,O,4lza
2848,HETATM,2851,O,HOH,B,287,-5.698,23.679,-0.316,1.00,47.35,O,4lza
2849,HETATM,2852,O,HOH,B,288,-6.332,-13.026,-3.481,1.00,51.79,O,4lza
2850,HETATM,2853,O,HOH,B,289,-8.265,-14.563,-0.902,1.00,50.84,O,4lza


<a id='43'></a>
## I create a class for each pdb_id which can return attribute/ desired fields and with methods we want

In [43]:
class pdb:
    """A class of each pdb_id file, including desired attributes and methods"""
    
    def __init__(self, pdb_id):
        """Create a object given a specific id"""
        
        self.id = pdb_id
        self.file = get_pdb_file(pdb_id, filetype='pdb', compression=False)
        self.list = self.file.splitlines()
        
    def get_enzyme_type(self):
        """get the enzyme type of this pdb"""
        
        for line in self.list:
            if line[:6] == "HEADER":
                return(line.split()[1])
            
    def get_name(self):
        """get molecue name of this pdb"""
        
        for line in self.list:
            if "MOLECULE:" in line:
                res = line[line.find("MOLECULE:")+10:]
                return(res[:res.find(";")])
            
    def get_organism_name(self):
        """get organism name of this pdb"""
        
        for line in self.list:
            if "ORGANISM_SCIENTIFIC:" in line:
                res = line[line.find("ORGANISM_SCIENTIFIC:")+21:]
                return(res[:res.find(";")])
            
    def get_organism_taxid(self):
        """get organism taxid of this pdb"""
        
        for line in self.list:
            if "ORGANISM_TAXID:" in line:
                res = line[line.find("ORGANISM_TAXID:")+16:]
                return(res[:res.find(";")])
            
    def get_chain_id(self):
        """get chain id of this pdb"""
        
        for line in self.list:
            if "CHAIN:" in line:
                res = line[line.find("CHAIN:")+7:]
                return(res[:res.find(";")])
            
    
    def get_ec_number(self):
        """get EC number of this pdb"""
        
        for line in self.list:
            if "EC:" in line:
                res = line[line.find("EC:")+4:]
                return(res[:res.find(";")])
            
    def get_strain(self):
        """get strain of this pdb"""
        
        for line in self.list:
            if "STRAIN:" in line:
                res = line[line.find("STRAIN:")+8:]
                return(res[:res.find(";")])
    
    def get_gene(self):
        """get gene of this pdb"""
        
        for line in self.list:
            if "GENE:" in line:
                res = line[line.find("GENE:")+6:]
                return(res[:res.find(";")])
            
    def get_resolution(self):
        """get resolution of this pdb"""
        
        for line in self.list:
            if "RESOLUTION." in line:
                num = line.find("RESOLUTION.") + 11
                while line[num] == " ":
                    num = num + 1
                    if line[num] != " ":
                        res = line[num:]
                        break
                return(res[:res.find("ANGSTROMS.")+9])
            
    def get_seqres(self):
        """get the seqres sequence of a pdb file"""
    
        list_se = []
        for line in self.list:
            if line[:6] == "SEQRES":
                list_se.append(line)
        list_se = [s.split() for s in list_se]
        df = pd.DataFrame(list_se)
        return(df)
    
    def get_atom_and_hetatm(self):
        """get the atom and hetatm sequence of a pdb file as a pandas dataframe"""
        
        list_atom = []
        # we only select line with atom and hetatm
        for line in self.list:
            if line[:4] == "ATOM" or line[:6] == "HETATM":
                list_atom.append(line)
        
        
        list_s_atom = []
        # according to the string position to make a list of each row
        for line in list_atom:
            l = []
            if line[:6].isspace():
                l.append(None)
            else:
                l.append(line[:6].split()[0])
            if line[6:11].isspace():
                l.append(None)
            else:
                l.append(line[6:11].split()[0])
            
            if line[12:16].isspace():
                l.append(None)
            else:
                l.append(line[12:16].split()[0])
                
            if line[16].isspace():
                l.append(None)
            else:
                l.append(line[16].split()[0])
            
            if line[17:20].isspace():
                l.append(None)
            else:
                l.append(line[17:20].split()[0])
                
            if line[21].isspace():
                l.append(None)
            else:
                l.append(line[21].split()[0])
                
            if line[22:26].isspace():
                l.append(None)
            else:
                l.append(line[22:26].split()[0])
                
            if line[26].isspace():
                l.append(None)
            else:
                l.append(line[26].split()[0])
                
            if line[30:38].isspace():
                l.append(None)
            else:
                l.append(line[30:38].split()[0])
                
            if line[38:46].isspace():
                l.append(None)
            else:
                l.append(line[38:46].split()[0])
                
            if line[46:54].isspace():
                l.append(None)
            else:
                l.append(line[46:54].split()[0])
                
            if line[54:60].isspace():
                l.append(None)
            else:
                l.append(line[54:60].split()[0])
                
            if line[60:66].isspace():
                l.append(None)
            else:
                l.append(line[60:66].split()[0])
                
            if line[76:78].isspace():
                l.append(None)
            else:
                l.append(line[76:78].split()[0])
                
            if line[78:80].isspace():
                l.append(None)
            else:
                l.append(line[78:80].split()[0])
            
            list_s_atom.append(l)
                
        df = pd.DataFrame(list_s_atom)
        # use DataFrame function to convert a list to dataframe
        df.columns = ["Record Name", "serial", "name", "altLoc", "resName", "chainID", "resSeq", 
                     "iCode", "x", "y", "z", "occupancy", "tempFactor", "element", "charge"]
        return(df)
    
    def get_missing_residue(self):
        """get the missing residue as a pandas data frame"""
        
        for line in self.list:
            if "MISSING RESIDUES" in line:
                res = line
        
        if res == None:
            return("There is no missing residue in this pdb file.")
        
        inx = self.list.index(res)
        
        while "M RES C SSSEQI" not in self.list[inx]:
            inx = inx + 1
        
        inx += 1
        
        list_miss = []
        # according to the string position to make a list of each row
        while not self.list[inx][10:].isspace():
            l = []
            line = self.list[inx]
            if line[13].isspace():
                l.append(None)
            else:
                l.append(line[13])
            if line[15:18].isspace():
                l.append(None)
            else:
                l.append(line[15:18])
            if line[19].isspace():
                l.append(None)
            else:
                l.append(line[19])
            if line[21:26].isspace():
                l.append(None)
            else:
                l.append(line[21:26])
            if line[26].isspace():
                l.append(None)
            else:
                l.append(line[26])
            list_miss.append(l)
            inx = inx + 1
        
        df = pd.DataFrame(list_miss)
        df.columns = ["MODEL NUMBER", "RESIDUE NAME", "CHAIN IDENTIFIER", "SEQUENCE NUMBER", "INSERTION CODE"]
        return(df)

<a id=44></a>
### an example of 4lza object of this class 

In [46]:
_4lza = pdb("4lza")
print(_4lza.id)

4lza


In [47]:
print(_4lza.list[:2])

['HEADER    TRANSFERASE                             31-JUL-13   4LZA              ', 'TITLE     CRYSTAL STRUCTURE OF ADENINE PHOSPHORIBOSYLTRANSFERASE FROM           ']


In [48]:
print(_4lza.get_enzyme_type())

TRANSFERASE


In [49]:
print(_4lza.get_name())

ADENINE PHOSPHORIBOSYLTRANSFERASE


In [50]:
print(_4lza.get_organism_name())

THERMOANAEROBACTER PSEUDETHANOLICUS


In [51]:
print(_4lza.get_organism_taxid())

340099


In [52]:
print(_4lza.get_chain_id())

A, B


In [53]:
print(_4lza.get_ec_number())

2.4.2.7


In [54]:
print(_4lza.get_strain())

ATCC 33223


In [55]:
print(_4lza.get_gene())

166856274, APT, TETH39_1027


In [56]:
print(_4lza.get_resolution())

1.84 ANGSTROMS


In [57]:
print(_4lza.get_seqres().head(6))

        0  1  2    3    4    5    6    7    8    9   10   11   12   13   14  \
0  SEQRES  1  A  195  MSE  HIS  HIS  HIS  HIS  HIS  HIS  SER  SER  GLY  VAL   
1  SEQRES  2  A  195  GLY  THR  GLU  ASN  LEU  TYR  PHE  GLN  SER  MSE  THR   
2  SEQRES  3  A  195  GLU  ILE  LYS  MSE  MSE  ILE  ARG  GLU  ILE  PRO  ASP   
3  SEQRES  4  A  195  LYS  LYS  GLY  ILE  LYS  PHE  LYS  ASP  ILE  THR  PRO   
4  SEQRES  5  A  195  LYS  ASP  ALA  LYS  ALA  PHE  ASN  TYR  SER  ILE  GLU   
5  SEQRES  6  A  195  ALA  LYS  ALA  LEU  GLU  GLY  ARG  LYS  PHE  ASP  LEU   

    15   16  
0  ASP  LEU  
1  LEU  GLU  
2  PHE  PRO  
3  VAL  LEU  
4  MSE  LEU  
5  ILE  ALA  


In [58]:
print(_4lza.get_atom_and_hetatm().head(10))

  Record Name serial name altLoc resName chainID resSeq iCode        x      y  \
0        ATOM      1    N   None     THR       A      0  None  -27.785  5.217   
1        ATOM      2   CA   None     THR       A      0  None  -27.459  5.049   
2        ATOM      3    C   None     THR       A      0  None  -25.949  5.130   
3        ATOM      4    O   None     THR       A      0  None  -25.572  5.789   
4        ATOM      5   CB   None     THR       A      0  None  -28.153  3.815   
5        ATOM      6  OG1   None     THR       A      0  None  -27.919  3.787   
6        ATOM      7  CG2   None     THR       A      0  None  -27.688  2.516   
7        ATOM      8    N   None     LEU       A      1  None  -25.087  4.511   
8        ATOM      9   CA   None     LEU       A      1  None  -23.681  4.942   
9        ATOM     10    C   None     LEU       A      1  None  -23.615  6.356   

         z occupancy tempFactor element charge  
0  -21.426      1.00      50.53       N   None  
1  -19.974

In [59]:
print(_4lza.get_missing_residue())

   MODEL NUMBER RESIDUE NAME CHAIN IDENTIFIER SEQUENCE NUMBER INSERTION CODE
0          None          MSE                A             -23           None
1          None          HIS                A             -22           None
2          None          HIS                A             -21           None
3          None          HIS                A             -20           None
4          None          HIS                A             -19           None
5          None          HIS                A             -18           None
6          None          HIS                A             -17           None
7          None          SER                A             -16           None
8          None          SER                A             -15           None
9          None          GLY                A             -14           None
10         None          VAL                A             -13           None
11         None          ASP                A             -12           None