Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected #992

Open
lennax opened this issue Nov 12, 2016 · 6 comments

Comments

@lennax
Copy link
Contributor

lennax commented Nov 12, 2016

Migrated from redmine

https://redmine.open-bio.org/issues/2910

Christian Schaefer wrote:
Parsing the one-letter sequence for a specific chain out of a given pdb file often seems to result in shorter sequences than expected.

The following code demonstrates this behavior for structure 1a2d chain A. Aminoacid 118 VAL after the HETATOM (117) block is missing in the result.

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import *

parser = PDBParser()
ppb = PPBuilder()
structure = parser.get_structure('tmp', '1a2d.pdb')
polypeptides = ppb.build_peptides(structure[0]['A'])
sequence = str(polypeptides[0].get_sequence())

print sequence

Another example is structure 13gs chain C and D. Both sequences are ECG, the code above however returns only CG.
So this behavior seems to be indepedent from a present HETATOM block.
This bug is also present in version 1.51.

@lennax
Copy link
Contributor Author

lennax commented Nov 12, 2016

Peter Cock wrote:
Retitled as this appears to be a bug in the PPBuilder build_peptides method, not the PDB parser, see:
http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html

Test script:

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder, to_one_letter_code
parser = PDBParser()
ppb = PPBuilder()
#structure = parser.get_structure('tmp', '1A2D.pdb')
structure = parser.get_structure('tmp', '13GS.pdb')
for model in structure :
    polypeptides = ppb.build_peptides(model)
    assert len(model) == len(polypeptides)
    for chain, pep in zip(model, polypeptides) :
        print
        print "Chain", chain.id
        print "Raw chain:"
        print "".join(to_one_letter_code.get(res.resname,"X") \
                      for res in chain if "CA" in res.child_dict)
        print "From peptide builder:"
        print pep.get_sequence()

Output for 1A2D,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2426.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2427.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2428.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2448.

Chain A
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Chain B
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Notice there are discontinuities in both chains A and B, and a missing residue in their peptides.

And the output from 13GS,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3760.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3812.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3852.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3948.
PDBConstructionWarning: WARNING: Chain C is discontinuous at line 4033.

Chain A
Raw chain:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
From peptide builder:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain B
Raw chain:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
From peptide builder:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain C
Raw chain:
ECG
From peptide builder:
CG

Chain D
Raw chain:
ECG
From peptide builder:
CG

Notice there are discontinuities in chains A, B and C, but missing residues in the peptide chains C and D. This suggests the discontinuities are required to trigger the problem. Also there are no HETATM residues for chains C and D.

@lennax
Copy link
Contributor Author

lennax commented Nov 12, 2016

Peter Cock wrote:
I've looked at PDB file 13GS in more detail, and this doesn't look like a bug in Biopython, but rather just another odd PDB file.

Chains C and D are only three residue peptides, e.g.

ATOM 3301 N GLU D 1 16.854 13.061 10.252 1.00 65.68 N
ATOM 3302 CA GLU D 1 17.100 13.860 9.018 1.00 66.23 C
ATOM 3303 C GLU D 1 17.937 15.095 9.363 1.00 65.02 C
ATOM 3304 O GLU D 1 18.510 15.724 8.439 1.00 56.86 O
ATOM 3305 CB GLU D 1 15.764 14.279 8.389 1.00 66.35 C
ATOM 3306 CG GLU D 1 15.913 14.994 7.062 1.00 67.41 C
ATOM 3307 CD GLU D 1 14.584 15.456 6.508 1.00 68.72 C
ATOM 3308 OE1 GLU D 1 13.547 15.340 7.163 1.00 69.08 O
ATOM 3309 OXT GLU D 1 17.998 15.420 10.569 1.00 66.12 O
ATOM 3310 N CYS D 2 14.618 15.966 5.283 1.00 69.97 N
ATOM 3311 CA CYS D 2 13.431 16.483 4.614 1.00 70.18 C
ATOM 3312 C CYS D 2 13.374 15.898 3.213 1.00 69.53 C
ATOM 3313 O CYS D 2 14.409 15.625 2.610 1.00 65.61 O
ATOM 3314 CB CYS D 2 13.502 18.008 4.507 1.00 73.18 C
ATOM 3315 SG CYS D 2 14.485 18.841 5.796 1.00 76.47 S
ATOM 3316 N GLY D 3 12.166 15.713 2.693 1.00 71.49 N
ATOM 3317 CA GLY D 3 12.023 15.155 1.360 1.00 75.33 C
ATOM 3318 C GLY D 3 11.489 13.733 1.399 1.00 78.72 C
ATOM 3319 O GLY D 3 10.840 13.313 0.413 1.00 79.95 O
ATOM 3320 OXT GLY D 3 11.717 13.031 2.412 1.00 80.37 O
TER 3321 GLY D 3

Look at the C-alpha distances, (17.100, 13.860, 9.018) to (13.431, 16.483, 4.614) to (12.023, 15.155, 1.360) giving distances of 6.3 and 3.8:

from math import sqrt
import numpy
a = numpy.array((17.100, 13.860, 9.018))
b = numpy.array((13.431, 16.483, 4.614))
c = numpy.array((12.023, 15.155, 1.360))
sqrt(sum((a-b)**2))
6.3037215991825049
sqrt(sum((b-c)**2))
3.7861014249488876

Clearly the first two residues in this "peptide" are very far apart, regardless of if you do a simple C-alpha distance (as here), or look at the backbone's N to C bonds.

The "problem" for 13GS goes away if you relax the default distance threshold, e.g. use PPBuilder(10.0) instead of PPBuilder().

However, whatever affects 1A2D seems to be a different issue...

Peter

@lennax
Copy link
Contributor Author

lennax commented Nov 12, 2016

Peter Cock wrote:
I think the problem with PDB file 1A2D is due to the atypical PYX residue,

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import is_aa
structure = PDBParser().get_structure('tmp', '1A2D.pdb')
for model in structure :
    for chain in model :
        for res in chain :
            if "CA" in res.child_dict and not is_aa(res) :
                print chain, res

The polypeptide code only looks at residues that pass the is_aa test, which means we can ignore things like water atoms associated with a chain. In this PDB file there are two residues which fail this test:

<Residue PYX het=H_PYX resseq=117 icode= >
<Residue PYX het=H_PYX resseq=117 icode= >

According to the SEQADV and MODRES lines, these are modified CYS residues.
Comparing this to the PDB provided FASTA file, a "C" is used (CYS). This
leads me to believe the fix is to add the PYX -> C mapping to Biopython.
[The dictionary used, to_one_letter_code, is actually defined in file
Bio/SCOP/RAF.py for some historical reason.]

Consulting the PDB documentation suggests that there are potentially
many more examples like this of unknown HETATM entries which are
modified amino acid residues... see:
ftp://ftp.wwpdb.org/pub/pdb/data/monomers/

Christian - did you find any other problem PDB files?

Peter

@lennax
Copy link
Contributor Author

lennax commented Nov 12, 2016

Christian Schaefer wrote:
Peter,

yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, I'll take a look at it and post them here. It's easy to do this. What I did is, I parsed the structures through the dssp structure assignment tool and compared the obtained sequence with that obtained from the Bio.PDB parser. Background: I wanted to map the sequence that dssp sees to atomic coordinates.

@lennax
Copy link
Contributor Author

lennax commented Nov 12, 2016

Peter Cock wrote:

If you can give us some more examples that would be very helpful, thank you.

I have committed a partial fix which means any known modified amino acids
(based on the presence of an alpha carbon) will be treated as an amino
acid for building the peptide (and given the default sequence letter of X).
This will also issue a warning. Any such previously unknown modified amino
acid (like PYX) needs to be added to our hard coded lookup table with the
appropriate single letter symbol as used by the PDF in their FASTA files
(in this case, PYX -> C for cysteine).

I suspect that some of your other problem PDB files still have (currently)
undefined modified amino acids in them...

Peter

@lennax
Copy link
Contributor Author

lennax commented Nov 12, 2016

@peterjc Your code (in first comment) no longer works because Bio.PDB.Polypeptide no longer contains to_one_letter_code so I'm not immediately able to tell whether this is resolved.

Absent new input from Christian, it is technically possible to run DSSP and Bio.PDB on the entire PDB to look for discrepancies, but I don't currently have time to set up such a run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants