New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected #992
Comments
Peter Cock wrote: Test script:
Output for 1A2D, PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2426. Chain A Chain B Notice there are discontinuities in both chains A and B, and a missing residue in their peptides. And the output from 13GS, PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3760. Chain A Chain B Chain C Chain D Notice there are discontinuities in chains A, B and C, but missing residues in the peptide chains C and D. This suggests the discontinuities are required to trigger the problem. Also there are no HETATM residues for chains C and D. |
Peter Cock wrote: Chains C and D are only three residue peptides, e.g. ATOM 3301 N GLU D 1 16.854 13.061 10.252 1.00 65.68 N Look at the C-alpha distances, (17.100, 13.860, 9.018) to (13.431, 16.483, 4.614) to (12.023, 15.155, 1.360) giving distances of 6.3 and 3.8:
Clearly the first two residues in this "peptide" are very far apart, regardless of if you do a simple C-alpha distance (as here), or look at the backbone's N to C bonds. The "problem" for 13GS goes away if you relax the default distance threshold, e.g. use PPBuilder(10.0) instead of PPBuilder(). However, whatever affects 1A2D seems to be a different issue... Peter |
Peter Cock wrote:
The polypeptide code only looks at residues that pass the is_aa test, which means we can ignore things like water atoms associated with a chain. In this PDB file there are two residues which fail this test: <Residue PYX het=H_PYX resseq=117 icode= > According to the SEQADV and MODRES lines, these are modified CYS residues. Consulting the PDB documentation suggests that there are potentially Christian - did you find any other problem PDB files? Peter |
Christian Schaefer wrote: yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, I'll take a look at it and post them here. It's easy to do this. What I did is, I parsed the structures through the dssp structure assignment tool and compared the obtained sequence with that obtained from the Bio.PDB parser. Background: I wanted to map the sequence that dssp sees to atomic coordinates. |
Peter Cock wrote: If you can give us some more examples that would be very helpful, thank you. I have committed a partial fix which means any known modified amino acids I suspect that some of your other problem PDB files still have (currently) Peter |
@peterjc Your code (in first comment) no longer works because Bio.PDB.Polypeptide no longer contains Absent new input from Christian, it is technically possible to run DSSP and Bio.PDB on the entire PDB to look for discrepancies, but I don't currently have time to set up such a run. |
Migrated from redmine
https://redmine.open-bio.org/issues/2910
Christian Schaefer wrote:
Parsing the one-letter sequence for a specific chain out of a given pdb file often seems to result in shorter sequences than expected.
The following code demonstrates this behavior for structure 1a2d chain A. Aminoacid 118 VAL after the HETATOM (117) block is missing in the result.
Another example is structure 13gs chain C and D. Both sequences are ECG, the code above however returns only CG.
So this behavior seems to be indepedent from a present HETATOM block.
This bug is also present in version 1.51.
The text was updated successfully, but these errors were encountered: