Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected #992

lennax · 2016-11-12T20:32:37Z

Migrated from redmine

https://redmine.open-bio.org/issues/2910

Christian Schaefer wrote:
Parsing the one-letter sequence for a specific chain out of a given pdb file often seems to result in shorter sequences than expected.

The following code demonstrates this behavior for structure 1a2d chain A. Aminoacid 118 VAL after the HETATOM (117) block is missing in the result.

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import *

parser = PDBParser()
ppb = PPBuilder()
structure = parser.get_structure('tmp', '1a2d.pdb')
polypeptides = ppb.build_peptides(structure[0]['A'])
sequence = str(polypeptides[0].get_sequence())

print sequence

Another example is structure 13gs chain C and D. Both sequences are ECG, the code above however returns only CG.
So this behavior seems to be indepedent from a present HETATOM block.
This bug is also present in version 1.51.

The text was updated successfully, but these errors were encountered:

lennax · 2016-11-12T20:33:28Z

Peter Cock wrote:
Retitled as this appears to be a bug in the PPBuilder build_peptides method, not the PDB parser, see:
http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html

Test script:

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder, to_one_letter_code
parser = PDBParser()
ppb = PPBuilder()
#structure = parser.get_structure('tmp', '1A2D.pdb')
structure = parser.get_structure('tmp', '13GS.pdb')
for model in structure :
    polypeptides = ppb.build_peptides(model)
    assert len(model) == len(polypeptides)
    for chain, pep in zip(model, polypeptides) :
        print
        print "Chain", chain.id
        print "Raw chain:"
        print "".join(to_one_letter_code.get(res.resname,"X") \
                      for res in chain if "CA" in res.child_dict)
        print "From peptide builder:"
        print pep.get_sequence()

Output for 1A2D,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2426.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2427.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2428.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2448.

Chain A
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Chain B
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Notice there are discontinuities in both chains A and B, and a missing residue in their peptides.

And the output from 13GS,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3760.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3812.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3852.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3948.
PDBConstructionWarning: WARNING: Chain C is discontinuous at line 4033.

Chain A
Raw chain:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
From peptide builder:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain B
Raw chain:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
From peptide builder:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain C
Raw chain:
ECG
From peptide builder:
CG

Chain D
Raw chain:
ECG
From peptide builder:
CG

Notice there are discontinuities in chains A, B and C, but missing residues in the peptide chains C and D. This suggests the discontinuities are required to trigger the problem. Also there are no HETATM residues for chains C and D.

lennax · 2016-11-12T20:34:24Z

Peter Cock wrote:
I've looked at PDB file 13GS in more detail, and this doesn't look like a bug in Biopython, but rather just another odd PDB file.

Chains C and D are only three residue peptides, e.g.

ATOM 3301 N GLU D 1 ATOM 3302 CA GLU D 1 ATOM 3303 C GLU D 1 ATOM 3304 O GLU D 1 ATOM 3305 CB GLU D 1 ATOM 3306 CG GLU D 1 ATOM 3307 CD GLU D 1 ATOM 3308 OE1 GLU D 1 ATOM 3309 OXT GLU D 1 ATOM 3310 N CYS D 2 ATOM 3311 CA CYS D 2 ATOM 3312 C CYS D 2 ATOM 3313 O CYS D 2 ATOM 3314 CB CYS D 2 ATOM 3315 SG CYS D 2 ATOM 3316 N GLY D 3 ATOM 3317 CA GLY D 3 ATOM 3318 C GLY D 3 ATOM 3319 O GLY D 3 ATOM 3320 OXT GLY D 3 TER 3321 GLY D 3 16.854 13.061 10.252 1.00 65.68 N
17.100 13.860 9.018 1.00 66.23 C
17.937 15.095 9.363 1.00 65.02 C
18.510 15.724 8.439 1.00 56.86 O
15.764 14.279 8.389 1.00 66.35 C
15.913 14.994 7.062 1.00 67.41 C
14.584 15.456 6.508 1.00 68.72 C
13.547 15.340 7.163 1.00 69.08 O
17.998 15.420 10.569 1.00 66.12 O
14.618 15.966 5.283 1.00 69.97 N
13.431 16.483 4.614 1.00 70.18 C
13.374 15.898 3.213 1.00 69.53 C
14.409 15.625 2.610 1.00 65.61 O
13.502 18.008 4.507 1.00 73.18 C
14.485 18.841 5.796 1.00 76.47 S
12.166 15.713 2.693 1.00 71.49 N
12.023 15.155 1.360 1.00 75.33 C
11.489 13.733 1.399 1.00 78.72 C
10.840 13.313 0.413 1.00 79.95 O
11.717 13.031 2.412 1.00 80.37 O

Look at the C-alpha distances, (17.100, 13.860, 9.018) to (13.431, 16.483, 4.614) to (12.023, 15.155, 1.360) giving distances of 6.3 and 3.8:

from math import sqrt
import numpy
a = numpy.array((17.100, 13.860, 9.018))
b = numpy.array((13.431, 16.483, 4.614))
c = numpy.array((12.023, 15.155, 1.360))
sqrt(sum((a-b)**2))
6.3037215991825049
sqrt(sum((b-c)**2))
3.7861014249488876

Clearly the first two residues in this "peptide" are very far apart, regardless of if you do a simple C-alpha distance (as here), or look at the backbone's N to C bonds.

The "problem" for 13GS goes away if you relax the default distance threshold, e.g. use PPBuilder(10.0) instead of PPBuilder().

However, whatever affects 1A2D seems to be a different issue...

Peter

lennax · 2016-11-12T20:35:20Z

Peter Cock wrote:
I think the problem with PDB file 1A2D is due to the atypical PYX residue,

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import is_aa
structure = PDBParser().get_structure('tmp', '1A2D.pdb')
for model in structure :
    for chain in model :
        for res in chain :
            if "CA" in res.child_dict and not is_aa(res) :
                print chain, res

The polypeptide code only looks at residues that pass the is_aa test, which means we can ignore things like water atoms associated with a chain. In this PDB file there are two residues which fail this test:

According to the SEQADV and MODRES lines, these are modified CYS residues.
Comparing this to the PDB provided FASTA file, a "C" is used (CYS). This
leads me to believe the fix is to add the PYX -> C mapping to Biopython.
[The dictionary used, to_one_letter_code, is actually defined in file
Bio/SCOP/RAF.py for some historical reason.]

Consulting the PDB documentation suggests that there are potentially
many more examples like this of unknown HETATM entries which are
modified amino acid residues... see:
ftp://ftp.wwpdb.org/pub/pdb/data/monomers/

Christian - did you find any other problem PDB files?

Peter

lennax · 2016-11-12T20:36:08Z

Christian Schaefer wrote:
Peter,

yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, I'll take a look at it and post them here. It's easy to do this. What I did is, I parsed the structures through the dssp structure assignment tool and compared the obtained sequence with that obtained from the Bio.PDB parser. Background: I wanted to map the sequence that dssp sees to atomic coordinates.

lennax · 2016-11-12T20:36:39Z

Peter Cock wrote:

If you can give us some more examples that would be very helpful, thank you.

I have committed a partial fix which means any known modified amino acids
(based on the presence of an alpha carbon) will be treated as an amino
acid for building the peptide (and given the default sequence letter of X).
This will also issue a warning. Any such previously unknown modified amino
acid (like PYX) needs to be added to our hard coded lookup table with the
appropriate single letter symbol as used by the PDF in their FASTA files
(in this case, PYX -> C for cysteine).

I suspect that some of your other problem PDB files still have (currently)
undefined modified amino acids in them...

Peter

lennax · 2016-11-12T20:39:50Z

@peterjc Your code (in first comment) no longer works because Bio.PDB.Polypeptide no longer contains to_one_letter_code so I'm not immediately able to tell whether this is resolved.

Absent new input from Christian, it is technically possible to run DSSP and Bio.PDB on the entire PDB to look for discrepancies, but I don't currently have time to set up such a run.

peterjc added the From Redmine label Nov 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected #992

Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected #992

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected #992

Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected #992

Comments

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016

lennax commented Nov 12, 2016