# Table of Content:
* [Li Yuan's second week work](#2)
* [Li Yuan's third week work](#3)
    * [Pandas convert](#31)
    * [MdTraj convert](#32)

<a id='2'></a>
# Li Yuan's second week work 

This is a set of basic examples of the usage and outputs of the various individual functions included in. There are generally three types of functions:

+ Functions that perform searches and return lists of PDB IDs
+ Functions that get information about specific PDB IDs
+ Other general-purpose lookup functions

The list of supported search types, as well as the different types of information that can be returned for a given PDB ID, is large (and growing) and is enumerated in the docstrings of pypdb.py. The PDB allows a very wide range of different types of queries, and so any option that is not currently available can likely be implemented based on the structure of the query types that have already been implemented. Please submit feedback and pull requests on GitHub.

### I didn't find any funcion in that package pypdb we can use to extract seqres and atom, so I only use get_pdb_file() function from that package to get the file and write my own function to do that.

### Preamble

We import this package pypdb and prepare some other things.

In [3]:
%pylab inline
from IPython.display import HTML

## Import from local directory
import sys
sys.path.insert(0, '../pypdb')
from pypdb import *

## Import from installed package
# from pypdb import *

import pprint

%load_ext autoreload
%autoreload 2

Populating the interactive namespace from numpy and matplotlib


## This function I wrote is to extract only the seqres as a list

In [4]:
def get_seqres(pdb_id):
    """ Return the seqres sequence of a pdb file
    
    >>> get_seqres('4Z0L')
    >>> get_seqres('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False) 
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'.
    file1 = pdb_file.splitlines()
    # split this long string into list by \n.
    list_se = []
    for line in file1:
        if line[:6] == "SEQRES":
            list_se.append(line)
    return(list_se)

In [5]:
get_seqres('4lza')[:20]

['SEQRES   1 A  195  MSE HIS HIS HIS HIS HIS HIS SER SER GLY VAL ASP LEU          ',
 'SEQRES   2 A  195  GLY THR GLU ASN LEU TYR PHE GLN SER MSE THR LEU GLU          ',
 'SEQRES   3 A  195  GLU ILE LYS MSE MSE ILE ARG GLU ILE PRO ASP PHE PRO          ',
 'SEQRES   4 A  195  LYS LYS GLY ILE LYS PHE LYS ASP ILE THR PRO VAL LEU          ',
 'SEQRES   5 A  195  LYS ASP ALA LYS ALA PHE ASN TYR SER ILE GLU MSE LEU          ',
 'SEQRES   6 A  195  ALA LYS ALA LEU GLU GLY ARG LYS PHE ASP LEU ILE ALA          ',
 'SEQRES   7 A  195  ALA PRO GLU ALA ARG GLY PHE LEU PHE GLY ALA PRO LEU          ',
 'SEQRES   8 A  195  ALA TYR ARG LEU GLY VAL GLY PHE VAL PRO VAL ARG LYS          ',
 'SEQRES   9 A  195  PRO GLY LYS LEU PRO ALA GLU THR LEU SER TYR GLU TYR          ',
 'SEQRES  10 A  195  GLU LEU GLU TYR GLY THR ASP SER LEU GLU ILE HIS LYS          ',
 'SEQRES  11 A  195  ASP ALA VAL LEU GLU GLY GLN ARG VAL VAL ILE VAL ASP          ',
 'SEQRES  12 A  195  ASP LEU LEU ALA THR GLY GLY THR ILE TYR ALA 

In [6]:
get_seqres('4Z0L')[:20]

['SEQRES   1 A  587  ALA ASN PRO CYS CYS SER ASN PRO CYS GLN ASN ARG GLY          ',
 'SEQRES   2 A  587  GLU CYS MET SER THR GLY PHE ASP GLN TYR LYS CYS ASP          ',
 'SEQRES   3 A  587  CYS THR ARG THR GLY PHE TYR GLY GLU ASN CYS THR THR          ',
 'SEQRES   4 A  587  PRO GLU PHE LEU THR ARG ILE LYS LEU LEU LEU LYS PRO          ',
 'SEQRES   5 A  587  THR PRO ASN THR VAL HIS TYR ILE LEU THR HIS PHE LYS          ',
 'SEQRES   6 A  587  GLY VAL TRP ASN ILE VAL ASN ASN ILE PRO PHE LEU ARG          ',
 'SEQRES   7 A  587  SER LEU ILE MET LYS TYR VAL LEU THR SER ARG SER TYR          ',
 'SEQRES   8 A  587  LEU ILE ASP SER PRO PRO THR TYR ASN VAL HIS TYR GLY          ',
 'SEQRES   9 A  587  TYR LYS SER TRP GLU ALA PHE SER ASN LEU SER TYR TYR          ',
 'SEQRES  10 A  587  THR ARG ALA LEU PRO PRO VAL ALA ASP ASP CYS PRO THR          ',
 'SEQRES  11 A  587  PRO MET GLY VAL LYS GLY ASN LYS GLU LEU PRO ASP SER          ',
 'SEQRES  12 A  587  LYS GLU VAL LEU GLU LYS VAL LEU LEU ARG ARG 

### This function I wrote is to extract only the atom sequence as a list

In [7]:
def get_atom(pdb_id):
    """ Return the atom sequence of a pdb file
    
    >>> get_atom('4Z0L')
    >>> get_atom('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False)
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'.
    file1 = pdb_file.splitlines()
    list_atom = []
    for line in file1:
        if line[:4] == "ATOM":
            list_atom.append(line)
    return(list_atom)

In [8]:
get_atom('4Z0L')[:10]

['ATOM      1  N   ALA A  33     113.744  17.524  85.910  1.00 75.99           N  ',
 'ATOM      2  CA  ALA A  33     114.749  17.116  86.884  1.00 76.70           C  ',
 'ATOM      3  C   ALA A  33     115.677  18.275  87.231  1.00 73.52           C  ',
 'ATOM      4  O   ALA A  33     116.176  18.367  88.354  1.00 75.48           O  ',
 'ATOM      5  CB  ALA A  33     115.548  15.934  86.358  1.00 78.19           C  ',
 'ATOM      6  N   ASN A  34     115.906  19.154  86.261  1.00 67.98           N  ',
 'ATOM      7  CA  ASN A  34     116.747  20.327  86.469  1.00 63.43           C  ',
 'ATOM      8  C   ASN A  34     116.113  21.264  87.492  1.00 60.58           C  ',
 'ATOM      9  O   ASN A  34     115.006  21.756  87.287  1.00 61.30           O  ',
 'ATOM     10  CB  ASN A  34     116.983  21.058  85.144  1.00 63.09           C  ']

In [9]:
get_atom('4lza')[:10]

['ATOM      1  N   THR A   0     -27.785   5.217 -21.426  1.00 50.53           N  ',
 'ATOM      2  CA  THR A   0     -27.459   5.049 -19.974  1.00 49.41           C  ',
 'ATOM      3  C   THR A   0     -25.949   5.130 -19.667  1.00 46.13           C  ',
 'ATOM      4  O   THR A   0     -25.572   5.789 -18.699  1.00 44.22           O  ',
 'ATOM      5  CB  THR A   0     -28.153   3.815 -19.346  1.00 51.85           C  ',
 'ATOM      6  OG1 THR A   0     -27.919   3.787 -17.932  1.00 52.21           O  ',
 'ATOM      7  CG2 THR A   0     -27.688   2.516 -19.989  1.00 53.52           C  ',
 'ATOM      8  N   LEU A   1     -25.087   4.511 -20.480  1.00 43.20           N  ',
 'ATOM      9  CA  LEU A   1     -23.681   4.942 -20.481  1.00 42.39           C  ',
 'ATOM     10  C   LEU A   1     -23.615   6.356 -21.059  1.00 43.21           C  ']

<a id='3'></a>
# Li Yuan's third week work

<a id='31'></a>
## We first used pandas to convert a list into dataframe 

### First we used split() to split each string in the list returned by get_atom() function 

In [10]:
import pandas as pd

In [11]:
def get_atom(pdb_id):
    """ Return the atom sequence of a pdb file as a pandas dataframe
    
    >>> get_atom('4Z0L')
    >>> get_atom('4lza')
    """
    pdb_file = get_pdb_file(pdb_id, filetype='pdb', compression=False)
    # using a get_pdb_file() function from pypdb package to return a file with format 'pdb'.
    file1 = pdb_file.splitlines()
    list_atom = []
    for line in file1:
        if line[:4] == "ATOM":
            list_atom.append(line)
    list_s_atom = [s.split() for s in list_atom]
    # split each string in a list by white spaces
    df = pd.DataFrame(list_s_atom)
    # use DataFrame function to convert a list to dataframe
    df["id"] = pdb_id
    # add one id column to exsiting dataframe
    return(df)

In [12]:
get_atom("4lza").head(11)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,id
0,ATOM,1,N,THR,A,0,-27.785,5.217,-21.426,1.0,50.53,N,4lza
1,ATOM,2,CA,THR,A,0,-27.459,5.049,-19.974,1.0,49.41,C,4lza
2,ATOM,3,C,THR,A,0,-25.949,5.13,-19.667,1.0,46.13,C,4lza
3,ATOM,4,O,THR,A,0,-25.572,5.789,-18.699,1.0,44.22,O,4lza
4,ATOM,5,CB,THR,A,0,-28.153,3.815,-19.346,1.0,51.85,C,4lza
5,ATOM,6,OG1,THR,A,0,-27.919,3.787,-17.932,1.0,52.21,O,4lza
6,ATOM,7,CG2,THR,A,0,-27.688,2.516,-19.989,1.0,53.52,C,4lza
7,ATOM,8,N,LEU,A,1,-25.087,4.511,-20.48,1.0,43.2,N,4lza
8,ATOM,9,CA,LEU,A,1,-23.681,4.942,-20.481,1.0,42.39,C,4lza
9,ATOM,10,C,LEU,A,1,-23.615,6.356,-21.059,1.0,43.21,C,4lza


## Second we want to put a couple of pdb entries into one dataframe.

In [32]:
def get_some_atom(L):
    """ Take a list with returning some atom parts of pdb files into one dataframe 
    
    >>> get_some_atom(["4lza", "4Z0L"])
    """
    frames = [get_atom(l) for l in L]
    return(pd.concat(frames))

## We test this function with a list ["4lza", "4Z0L]

In [31]:
get_some_atom(["4lza", "4Z0L"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,id
0,ATOM,1,N,THR,A,0,-27.785,5.217,-21.426,1.00,50.53,N,4lza
1,ATOM,2,CA,THR,A,0,-27.459,5.049,-19.974,1.00,49.41,C,4lza
2,ATOM,3,C,THR,A,0,-25.949,5.130,-19.667,1.00,46.13,C,4lza
3,ATOM,4,O,THR,A,0,-25.572,5.789,-18.699,1.00,44.22,O,4lza
4,ATOM,5,CB,THR,A,0,-28.153,3.815,-19.346,1.00,51.85,C,4lza
5,ATOM,6,OG1,THR,A,0,-27.919,3.787,-17.932,1.00,52.21,O,4lza
6,ATOM,7,CG2,THR,A,0,-27.688,2.516,-19.989,1.00,53.52,C,4lza
7,ATOM,8,N,LEU,A,1,-25.087,4.511,-20.480,1.00,43.20,N,4lza
8,ATOM,9,CA,LEU,A,1,-23.681,4.942,-20.481,1.00,42.39,C,4lza
9,ATOM,10,C,LEU,A,1,-23.615,6.356,-21.059,1.00,43.21,C,4lza


## Next we want to put all ids into one dataframe

### We used get_all() to list all pdb entries. 

In [15]:
len(get_all())

169681

### We found there was 169681 entries in the current PDB  DataBase.

### We used concat() function to merge all the dataframe of each pdb entry into one huge dataframe.

In [16]:
frames = [get_atom(id) for id in get_all()[:2]]
pd.concat(frames)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,id
0,ATOM,1,O5',C,A,1,-4.549,5.095,4.262,1.00,28.71,O,100D
1,ATOM,2,C5',C,A,1,-4.176,6.323,3.646,1.00,27.35,C,100D
2,ATOM,3,C4',C,A,1,-3.853,7.410,4.672,1.00,24.41,C,100D
3,ATOM,4,O4',C,A,1,-4.992,7.650,5.512,1.00,22.53,O,100D
4,ATOM,5,C3',C,A,1,-2.713,7.010,5.605,1.00,23.56,C,100D
5,ATOM,6,O3',C,A,1,-1.379,7.127,5.060,1.00,21.02,O,100D
6,ATOM,7,C2',C,A,1,-2.950,7.949,6.756,1.00,23.73,C,100D
7,ATOM,8,O2',C,A,1,-2.407,9.267,6.554,1.00,23.93,O,100D
8,ATOM,9,C1',C,A,1,-4.489,7.917,6.825,1.00,20.60,C,100D
9,ATOM,10,N1,C,A,1,-4.931,6.902,7.826,1.00,19.25,N,100D


<a id='32'></a>
## We used mdtraj package to load pdb file into memory from URL

>MDTraj is a python library that allows users to manipulate molecular dynamics (MD) trajectories. Features include:
1. Wide MD format support, including pdb, xtc, trr, dcd, binpos, netcdf, mdcrd, prmtop, and more.
2. Extremely fast RMSD calculations (4x the speed of the original Theobald QCP).
3. Extensive analysis functions including those that compute bonds, angles, dihedrals, hydrogen bonds, secondary structure, and NMR observables.
4. Lightweight, Pythonic API.

In [17]:
import mdtraj as md # import this package

In [18]:
pdb = md.load_pdb("https://files.rcsb.org/view/4LZA.pdb")  # load data

In [19]:
print(pdb) # print to see how many frames and atoms, residues this file has 

<mdtraj.Trajectory with 1 frames, 2833 atoms, 512 residues, and unitcells>


## We convert this pdb file into topology

In [20]:
topology = pdb.topology

In [21]:
table, bonds = topology.to_dataframe()

In [22]:
print(table.head(7))

   serial name element  resSeq resName  chainID segmentID
0       1    N       N       0     THR        0          
1       2   CA       C       0     THR        0          
2       3    C       C       0     THR        0          
3       4    O       O       0     THR        0          
4       5   CB       C       0     THR        0          
5       6  OG1       O       0     THR        0          
6       7  CG2       C       0     THR        0          


In [23]:
topology.atom(10)

LEU1-O

In [24]:
topology.atoms

<generator object Topology.atoms at 0x7fe9cd22c390>

In [25]:
[i for i in topology.atoms][:10]

[THR0-N,
 THR0-CA,
 THR0-C,
 THR0-O,
 THR0-CB,
 THR0-OG1,
 THR0-CG2,
 LEU1-N,
 LEU1-CA,
 LEU1-C]

In [26]:
print(table.head(10))

   serial name element  resSeq resName  chainID segmentID
0       1    N       N       0     THR        0          
1       2   CA       C       0     THR        0          
2       3    C       C       0     THR        0          
3       4    O       O       0     THR        0          
4       5   CB       C       0     THR        0          
5       6  OG1       O       0     THR        0          
6       7  CG2       C       0     THR        0          
7       8    N       N       1     LEU        0          
8       9   CA       C       1     LEU        0          
9      10    C       C       1     LEU        0          


In [27]:
atom = pdb.atom_slice(range(2833))

In [28]:
print(atom)

<mdtraj.Trajectory with 1 frames, 2833 atoms, 512 residues, and unitcells>


In [29]:
atom.xyz

array([[[-2.7785,  0.5217, -2.1426],
        [-2.7459,  0.5049, -1.9974],
        [-2.5949,  0.513 , -1.9667],
        ...,
        [-0.6332, -1.3026, -0.3481],
        [-0.8265, -1.4563, -0.0902],
        [-2.8824,  1.244 , -0.1084]]], dtype=float32)

In [30]:
[i for i in topology.bonds][:10]

[Bond(THR0-CA, THR0-C),
 Bond(THR0-C, THR0-O),
 Bond(THR0-CA, THR0-CB),
 Bond(THR0-N, THR0-CA),
 Bond(THR0-CB, THR0-CG2),
 Bond(THR0-CB, THR0-OG1),
 Bond(THR0-C, LEU1-N),
 Bond(LEU1-CA, LEU1-C),
 Bond(LEU1-C, LEU1-O),
 Bond(LEU1-CA, LEU1-CB)]