-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal coords #2346
Internal coords #2346
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2346 +/- ##
==========================================
- Coverage 84.95% 84.66% -0.29%
==========================================
Files 323 328 +5
Lines 52690 54449 +1759
==========================================
+ Hits 44764 46101 +1337
- Misses 7926 8348 +422
Continue to review full report at Codecov.
|
… bonds and h-bonds for OpenSCAD output
…dinates to SMCR classes
…aning and black reformatting
…ing transform; formatting; comments
…pic to .internal_coord
… accept_backbone to _mainchain
60d6fa4
to
af9e858
Compare
Hi @rob-miller Thanks for the contributions, that's a lot of work. Would it be possible to divide this PR in several smaller ones? I see changes to the documentation, manual, bug fixes, etc, which will be very hard to review in this format. |
Thanks João, |
This pull request offers infrastructure to work with internal coordinates of protein structures: dihedral angles, bond angles and bond lengths. The transform is bidirectional, in that PDB file coordinates for a complete structure can be regernerated from the computed internal coordinates.
The primary entry points are atom_to_internal_coordinates() and internal_to_atom_coordiates() methods added to Structure, Model and Chain Classes. IC_Chain and IC_Residue classes extend Chain and Residue classes, respectively. A gist to try out the various features is available here
Subsidiary functionality includes import/export of internal coordinates as a defined file format (.pic), export of internal coordinates as OpenSCAD data matries with supporting software to generate protein models for 3D printing, and various 'structure modification' pipeline options such as removing specific hydrogens, converting deuteriums to hydrogens, and generating Gly C-beta atoms from database values for Ala residues.
Refer to the docstrings in the individual files for further overview and details, in particular starting with comments at the beginning of Bio/PDB/internal_coords.py. Some specific design and implementation issues are discussed below:
Why not subclass and inherit from the existing (S, M,) C, R classes instead of modifying them? Upon discussion and reflection, I concluded that 'a residue HAS_A set of internal coordinates' is a more accurate paradigm than 'an internal-coordinate residue IS_A kind of residue'. In particular, with this work the current system may be described as 'a residue HAS_A set of atom coordinates,' and neither coordinate system requires the other.
It still seems like there is a lot of duplication around referencing atoms and their coordinates? Atom coordinates in this work are homogeneous
[4][1]
matrices as opposed to the[3]
arrays used in Biopython Atom classes, and this facilitates the application of the combined translation, rotation and occasional scaling matrices employed here. Different levels of the data hierarchy maintain intermediate results, with the intention of avoiding recalculation when an angle is modified, but there's been no optimisation around this. At the end of the atom assembly calculation, results are promoted to the Atom[3]
arrays as expected by the rest of Biopython. Backbone dihedral angles span residues, so I needed a richer system than the existing Atom IDs to capture position and disorder information (see AtomKeys).What about disorder, missing residues, HETATMs and all that other fun stuff? ALTLOC atoms are handled, generating angles and dihedrals for each path. The rebuild system will fail for residues with missing backbone N-Ca-C atoms, otherwise rebuild success depends on how the missing atoms interact with the pre-defined dihedrals for building the sidechains (just try it, should be fine). UNK and two non-standard amino acids I came across in testing with peptide backbones are accepted (backbone only), otherwise a chain break will occur. This list can be extended at IC_Residue.accept_resnames. Deuterium structures cannot be generated, primarily because it did not seem worth doubling the numerous Hydrogen name table entries (convert to H's as mentioned above). See the Unittests, and the inclusion of ic_rebuild.structure_rebuild_test() is a feature to test a structure for completeness.
What use cases are there for building structures from internal coordinates? My primary application is protein structure prediction, and fundamentally the rationale for exactly rebuilding PDB coordinate files is to prove code correctness (or at least that the bugs are consistent on both sides). @JoaoRodrigues suggested there may be an application in smoothing trajectories. It's also handy to be able to create a PDB chain by cut-and-pasting individual residues in a .pic file.
Are there default values for bond lengths and angles? No. This would be another large table (very large depending on the specificity desired) with values dependent on the database of selected structures. I'm happy to generate a reasonable set if there is appetite, but it seems as accurate to stitch together residues from a .pic file from a single high-resolution structure. (Note that PDB coordinates can only be regenerated by capturing every parameter; just setting all omega angles to 180.0 will probably generate bad contacts and collisions in any structure rebuild)
Why another file format? There's no public application or specification for .pic files other than this source code, however it is used here to extract and verify the 'minimum information set' for rebuilding a PDB file.
is there a publication? This is it now; please cite this pull request or my GitHub page if you wish. I developed the algorithms in C circa 1993 in graduate school and referenced their application in Miller94. I've re-writeen in various languages since then, and made a Lua implementation available on GitHub starting in early 2016. With the recent demise of the Torch neural nets library, I have chosen to move my development work to (Bio)Python.
Other code modifications included with this PR: I added some code that tries to populate header information from mmCIF files (unittest added), and fixed a bug that pre-pended a
space
to PDB TITLE entries (unittest fixed). I updated the Structure FAQ and tutorial, and the copyright dates for the API docs. Added homogeneous coordinate routines to Vectors.py.my name is already in CONTRIBUTING.rst, I wasn't sure how to modify NEWS.rst
[ X] I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
[ X] I have read the
CONTRIBUTING.rst
file, have runflake8
locally, andunderstand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)