# Using Biopython to subset a PDB file

For when you need a specific part or parts of a structure file that is multi-chain in nature. In other words, when you need specify more of a complex selection than just the specific individual chains that you could generate via routes demonstrated in the ['split PDB files into chains using command line' demo notebook](cl_demo-binder%20split%20pdb%20files%20into%20chains.ipynb).

This will build up to more complex examples of that situation. The final example illustrated is when you want two chains and part of a third.  

In [1]:
#get stucture
!curl -OL https://files.rcsb.org/download/6AGB.pdb.gz
!gunzip 6AGB.pdb.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  491k  100  491k    0     0   886k      0 --:--:-- --:--:-- --:--:--  886k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  519k  100  519k    0     0  1175k      0 --:--:-- --:--:-- --:--:-- 1175k


In [2]:
# Basic example in Biopython Bio.PDB documentation under 'Can I write PDB files?' section:
# Limit to glycine residues

# Use of Biopython's Bio.PDB based on
# https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

# The code as written at the time didn't work (see
# https://github.com/biopython/biopython/blob/d7789a5041802405204666f7e033981dd03cf14c/Doc/Tutorial/chapter_pdb.tex )
# , searching the error `'Residue' object has no attribute 'get_name' bio.pdb`
# lead me to http://biopython.org/DIST/docs/api/Bio.PDB.Residue.Residue-class.html
# which made me realize I needed to change `residue.get_name()` to
# `residue.get_resname()`
from Bio.PDB import *
structure = PDBParser().get_structure('6AGB', '6AGB.pdb')


class GlySelect(Select):
    def accept_residue(self, residue):
        if residue.get_resname()=='GLY':
            return 1
        else:
            return 0
        
io = PDBIO()
io.set_structure(structure)
# save it
io.save('gly_only.pdb', GlySelect())

Verify that worked:

In [3]:
!tail -35 gly_only.pdb

ATOM    417  N   GLY J 265     207.436 158.894 133.296  1.00 63.70           N  
ATOM    418  CA  GLY J 265     206.800 158.694 132.009  1.00 63.70           C  
ATOM    419  C   GLY J 265     205.980 159.889 131.568  1.00 63.70           C  
ATOM    420  O   GLY J 265     205.452 159.911 130.452  1.00 63.70           O  
ATOM    421  N   GLY J 285     186.291 146.647 121.054  1.00 65.09           N  
ATOM    422  CA  GLY J 285     185.022 146.679 121.753  1.00 65.09           C  
ATOM    423  C   GLY J 285     185.169 146.611 123.259  1.00 65.09           C  
ATOM    424  O   GLY J 285     185.830 147.460 123.865  1.00 65.09           O  
TER     425      GLY J 285                                                       
ATOM    425  N   GLY K  18     182.279 210.510 116.554  1.00135.65           N  
ATOM    426  CA  GLY K  18     182.997 209.879 117.651  1.00135.65           C  
ATOM    427  C   GLY K  18     184.378 209.420 117.233  1.00135.65           C  
ATOM    428  O   GLY K  18 

### More complex example #1

In [4]:
# Want chains A, F, and G  

# Use of Biopython's Bio.PDB based on
# https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

from Bio.PDB import *
structure = PDBParser().get_structure('6AGB', '6AGB.pdb')

     
class MyLimit(Select):
    def accept_chain(self, chain):
        allowed_chains = ["A","F","G"]
        if str(chain.id) in allowed_chains:
            return 1
        else:
            return 0
        

io = PDBIO()
io.set_structure(structure)
#print(structure)
# save it
io.save('POP6nPOP7nChainA.pdb', MyLimit())

### More complex example #2

In [22]:
# Limit to glycine residues in chains F and G  

# Because want to limit at least one chain to specific residues,
# target the selection limitation to `accept_residue()`.
# If target to `accept_chain()`, didn't seem I could use selection to limit further.

# Use of `residue.get_parent()` idea based on seeing it listed
# at http://biopython.org/DIST/docs/api/Bio.PDB.Chain.Chain-class.html
# and trying `print (residue.get_parent())` within `accept_residue()` in
# the subclassing of select. Then I added in the `.id` use I used in
# the above example.
from Bio.PDB import *
structure = PDBParser().get_structure('6AGB', '6AGB.pdb')

     
class MyLimit(Select):
    def accept_residue(self, residue):
        allowed_chains = ["F","G"]
        if str(residue.get_parent().id) in allowed_chains:
            if residue.get_resname()=='GLY':
                return 1
            else:
                return 0
        else:
            return 0


io = PDBIO()
io.set_structure(structure)
#print(structure)
# save it
io.save('POP6nPOP7nChainAGlys.pdb', MyLimit())

Verify that worked:

In [24]:
!head POP6nPOP7nChainAGlys.pdb
print(" ")
!tail POP6nPOP7nChainAGlys.pdb

ATOM      1  N   GLY F   4     184.999 149.414 228.507  1.00 69.09           N  
ATOM      2  CA  GLY F   4     185.232 150.460 229.482  1.00 69.09           C  
ATOM      3  C   GLY F   4     184.376 151.692 229.279  1.00 69.09           C  
ATOM      4  O   GLY F   4     183.962 151.995 228.155  1.00 69.09           O  
ATOM      5  N   GLY F  38     177.548 175.546 247.273  1.00118.22           N  
ATOM      6  CA  GLY F  38     178.156 176.109 248.468  1.00118.22           C  
ATOM      7  C   GLY F  38     178.026 177.618 248.560  1.00118.22           C  
ATOM      8  O   GLY F  38     177.754 178.160 249.635  1.00118.22           O  
ATOM      9  N   GLY F  48     167.256 160.329 240.620  1.00 71.96           N  
ATOM     10  CA  GLY F  48     167.485 159.634 239.368  1.00 71.96           C  
 
ATOM     45  N   GLY G 107     130.418 146.992 219.222  1.00106.36           N  
ATOM     46  CA  GLY G 107     129.051 147.403 219.484  1.00106.36           C  
ATOM     47  C   GLY G 107

### More complex example #2 (alternative version)

In [32]:
# Limit to glycine residues in chains F and G  

# Because want to limit at least one chain to specific residues,
# target the selection limitation to `accept_residue()`.
# If target to `accept_chain()`, didn't seem I could use selection to limit further.

# Use of `residue.get_full_id()` idea based on 
# http://biopython.org/DIST/docs/api/Bio.PDB.Entity.Entity-class.html#get_full_id
# that I was lead to from 
# http://biopython.org/DIST/docs/api/Bio.PDB.Chain.Chain-class.html .
# Not as immediately transparent as above version because have to know item at index 2 is the
# chain id, but since it is a consistent structure,models, chains, and residues hierarchy, it 
# makes sense once you are aware.

from Bio.PDB import *
structure = PDBParser().get_structure('6AGB', '6AGB.pdb')

     
class MyLimit(Select):
    def accept_residue(self, residue):
        allowed_chains = ["F","G"]
        if str(residue.get_full_id()[2]) in allowed_chains:
            if residue.get_resname()=='GLY':
                return 1
            else:
                return 0
        else:
            return 0


io = PDBIO()
io.set_structure(structure)
#print(structure)
# save it
io.save('POP6nPOP7nChainAGlys_fid.pdb', MyLimit())

Verify that worked:

In [33]:
!head POP6nPOP7nChainAGlys_fid.pdb
print(" ")
!tail POP6nPOP7nChainAGlys_fid.pdb

ATOM      1  N   GLY F   4     184.999 149.414 228.507  1.00 69.09           N  
ATOM      2  CA  GLY F   4     185.232 150.460 229.482  1.00 69.09           C  
ATOM      3  C   GLY F   4     184.376 151.692 229.279  1.00 69.09           C  
ATOM      4  O   GLY F   4     183.962 151.995 228.155  1.00 69.09           O  
ATOM      5  N   GLY F  38     177.548 175.546 247.273  1.00118.22           N  
ATOM      6  CA  GLY F  38     178.156 176.109 248.468  1.00118.22           C  
ATOM      7  C   GLY F  38     178.026 177.618 248.560  1.00118.22           C  
ATOM      8  O   GLY F  38     177.754 178.160 249.635  1.00118.22           O  
ATOM      9  N   GLY F  48     167.256 160.329 240.620  1.00 71.96           N  
ATOM     10  CA  GLY F  48     167.485 159.634 239.368  1.00 71.96           C  
 
ATOM     45  N   GLY G 107     130.418 146.992 219.222  1.00106.36           N  
ATOM     46  CA  GLY G 107     129.051 147.403 219.484  1.00106.36           C  
ATOM     47  C   GLY G 107

### More complex example #3

In [45]:
# Want chain F and chain G and residues 32 - 85 of Chain A

# Because want to limit at least one chain to specific residues,
# target the selection limitation to `accept_chain()`.

# Based on earlier examples with the addition of use of `residue.get_id()` to
# get the residue id tuple that is explained under 'What is a residue id?' section
# of https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ
from Bio.PDB import *
structure = PDBParser().get_structure('6AGB', '6AGB.pdb')

class MyLimit(Select):
    def accept_residue(self, residue):
        allowed_full_chains = ["F","G"]
        allowed_partial_chain = "A"
        res_in_chain_A_allowed = (32,85)
        if str(residue.get_parent().id) in allowed_full_chains:
            return 1
        elif str(residue.get_parent().id) == allowed_partial_chain:
            #determine if in allowed range of residues
            if res_in_chain_A_allowed[0] <= residue.get_id()[1] <= res_in_chain_A_allowed[1]:
                return 1
        else:
            return 0

        
io = PDBIO()
io.set_structure(structure)
#print(structure)
# save it
io.save('POP6nPOP7nP3domain.pdb', MyLimit())

Verify that worked:

In [58]:
print ("These chains and lengths are represented in the original '6AGB.pdb':")
structure = PDBParser().get_structure('6AGB', '6AGB.pdb')
for model in structure:
    for chain in model:
        print (chain)
        print (len(chain))
print (" ")
print ("A selection of residues 32-85 spans {} residues.".format(85-32+1))
print ("These chains and lengths are represented in the saved file 'POP6nPOP7nP3domain.pdb':")
structure = PDBParser().get_structure('P3domain', 'POP6nPOP7nP3domain.pdb')
for model in structure:
    for chain in model:
        print (chain)
        print (len(chain))

These chains and lengths are represented in the original '6AGB.pdb':
<Chain id=A>
369
<Chain id=B>
784
<Chain id=C>
175
<Chain id=D>
227
<Chain id=E>
146
<Chain id=F>
157
<Chain id=G>
121
<Chain id=H>
131
<Chain id=I>
242
<Chain id=J>
293
<Chain id=K>
129
 
A selection of residues 32-85 spans 54 residues.
These chains and lengths are represented in the saved file 'POP6nPOP7nP3domain.pdb':
<Chain id=A>
54
<Chain id=F>
157
<Chain id=G>
121


------