Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal coords #2346

Closed
wants to merge 46 commits into from
Closed

Conversation

rob-miller
Copy link
Contributor

@rob-miller rob-miller commented Nov 19, 2019

This pull request offers infrastructure to work with internal coordinates of protein structures: dihedral angles, bond angles and bond lengths. The transform is bidirectional, in that PDB file coordinates for a complete structure can be regernerated from the computed internal coordinates.

The primary entry points are atom_to_internal_coordinates() and internal_to_atom_coordiates() methods added to Structure, Model and Chain Classes. IC_Chain and IC_Residue classes extend Chain and Residue classes, respectively. A gist to try out the various features is available here

Subsidiary functionality includes import/export of internal coordinates as a defined file format (.pic), export of internal coordinates as OpenSCAD data matries with supporting software to generate protein models for 3D printing, and various 'structure modification' pipeline options such as removing specific hydrogens, converting deuteriums to hydrogens, and generating Gly C-beta atoms from database values for Ala residues.

Refer to the docstrings in the individual files for further overview and details, in particular starting with comments at the beginning of Bio/PDB/internal_coords.py. Some specific design and implementation issues are discussed below:

  • Why not subclass and inherit from the existing (S, M,) C, R classes instead of modifying them? Upon discussion and reflection, I concluded that 'a residue HAS_A set of internal coordinates' is a more accurate paradigm than 'an internal-coordinate residue IS_A kind of residue'. In particular, with this work the current system may be described as 'a residue HAS_A set of atom coordinates,' and neither coordinate system requires the other.

  • It still seems like there is a lot of duplication around referencing atoms and their coordinates? Atom coordinates in this work are homogeneous [4][1] matrices as opposed to the [3] arrays used in Biopython Atom classes, and this facilitates the application of the combined translation, rotation and occasional scaling matrices employed here. Different levels of the data hierarchy maintain intermediate results, with the intention of avoiding recalculation when an angle is modified, but there's been no optimisation around this. At the end of the atom assembly calculation, results are promoted to the Atom [3] arrays as expected by the rest of Biopython. Backbone dihedral angles span residues, so I needed a richer system than the existing Atom IDs to capture position and disorder information (see AtomKeys).

  • What about disorder, missing residues, HETATMs and all that other fun stuff? ALTLOC atoms are handled, generating angles and dihedrals for each path. The rebuild system will fail for residues with missing backbone N-Ca-C atoms, otherwise rebuild success depends on how the missing atoms interact with the pre-defined dihedrals for building the sidechains (just try it, should be fine). UNK and two non-standard amino acids I came across in testing with peptide backbones are accepted (backbone only), otherwise a chain break will occur. This list can be extended at IC_Residue.accept_resnames. Deuterium structures cannot be generated, primarily because it did not seem worth doubling the numerous Hydrogen name table entries (convert to H's as mentioned above). See the Unittests, and the inclusion of ic_rebuild.structure_rebuild_test() is a feature to test a structure for completeness.

  • What use cases are there for building structures from internal coordinates? My primary application is protein structure prediction, and fundamentally the rationale for exactly rebuilding PDB coordinate files is to prove code correctness (or at least that the bugs are consistent on both sides). @JoaoRodrigues suggested there may be an application in smoothing trajectories. It's also handy to be able to create a PDB chain by cut-and-pasting individual residues in a .pic file.

  • Are there default values for bond lengths and angles? No. This would be another large table (very large depending on the specificity desired) with values dependent on the database of selected structures. I'm happy to generate a reasonable set if there is appetite, but it seems as accurate to stitch together residues from a .pic file from a single high-resolution structure. (Note that PDB coordinates can only be regenerated by capturing every parameter; just setting all omega angles to 180.0 will probably generate bad contacts and collisions in any structure rebuild)

  • Why another file format? There's no public application or specification for .pic files other than this source code, however it is used here to extract and verify the 'minimum information set' for rebuilding a PDB file.

  • is there a publication? This is it now; please cite this pull request or my GitHub page if you wish. I developed the algorithms in C circa 1993 in graduate school and referenced their application in Miller94. I've re-writeen in various languages since then, and made a Lua implementation available on GitHub starting in early 2016. With the recent demise of the Torch neural nets library, I have chosen to move my development work to (Bio)Python.

  • Other code modifications included with this PR: I added some code that tries to populate header information from mmCIF files (unittest added), and fixed a bug that pre-pended a space to PDB TITLE entries (unittest fixed). I updated the Structure FAQ and tutorial, and the copyright dates for the API docs. Added homogeneous coordinate routines to Vectors.py.

  • my name is already in CONTRIBUTING.rst, I wasn't sure how to modify NEWS.rst

  • [ X] I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • [ X] I have read the CONTRIBUTING.rst file, have run flake8 locally, and
    understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

@codecov
Copy link

codecov bot commented Nov 19, 2019

Codecov Report

Merging #2346 into master will decrease coverage by 0.28%.
The diff coverage is 75.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2346      +/-   ##
==========================================
- Coverage   84.95%   84.66%   -0.29%     
==========================================
  Files         323      328       +5     
  Lines       52690    54449    +1759     
==========================================
+ Hits        44764    46101    +1337     
- Misses       7926     8348     +422
Impacted Files Coverage Δ
Bio/PDB/Residue.py 81.25% <100%> (+0.23%) ⬆️
Bio/PDB/ic_data.py 100% <100%> (ø)
Bio/PDB/parse_pdb_header.py 96.92% <100%> (+0.51%) ⬆️
Bio/PDB/Structure.py 96.66% <100%> (+1.01%) ⬆️
Bio/PDB/MMCIFParser.py 91.05% <100%> (+0.77%) ⬆️
Bio/PDB/Model.py 82.75% <42.85%> (-12.7%) ⬇️
Bio/PDB/ic_rebuild.py 44.97% <44.97%> (ø)
Bio/PDB/SCADIO.py 68.96% <68.96%> (ø)
Bio/PDB/internal_coords.py 78.1% <78.1%> (ø)
Bio/PDB/PICIO.py 85.08% <85.08%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bb7a24...62a02ac. Read the comment docs.

@JoaoRodrigues
Copy link
Member

Hi @rob-miller

Thanks for the contributions, that's a lot of work. Would it be possible to divide this PR in several smaller ones? I see changes to the documentation, manual, bug fixes, etc, which will be very hard to review in this format.

@rob-miller
Copy link
Contributor Author

Thanks João,
Will do. Thought it would be easier to have the docs together with the code and the other stuff is pretty small, but I know it is a lot to look at.

@rob-miller rob-miller closed this Nov 24, 2019
This was referenced Dec 14, 2019
@rob-miller
Copy link
Contributor Author

Rebased, cleaned up and re-submitted as the following pull requests:

#2362 generate header data for mmCIF files like PDB parser
#2364 remove initial space from PDB TITLE lines
#2399 internal coordinates infrastructure
#2400 internal coordinates documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants