Internal coords #2346

rob-miller · 2019-11-19T07:58:58Z

This pull request offers infrastructure to work with internal coordinates of protein structures: dihedral angles, bond angles and bond lengths. The transform is bidirectional, in that PDB file coordinates for a complete structure can be regernerated from the computed internal coordinates.

The primary entry points are atom_to_internal_coordinates() and internal_to_atom_coordiates() methods added to Structure, Model and Chain Classes. IC_Chain and IC_Residue classes extend Chain and Residue classes, respectively. A gist to try out the various features is available here

Subsidiary functionality includes import/export of internal coordinates as a defined file format (.pic), export of internal coordinates as OpenSCAD data matries with supporting software to generate protein models for 3D printing, and various 'structure modification' pipeline options such as removing specific hydrogens, converting deuteriums to hydrogens, and generating Gly C-beta atoms from database values for Ala residues.

Refer to the docstrings in the individual files for further overview and details, in particular starting with comments at the beginning of Bio/PDB/internal_coords.py. Some specific design and implementation issues are discussed below:

Why not subclass and inherit from the existing (S, M,) C, R classes instead of modifying them? Upon discussion and reflection, I concluded that 'a residue HAS_A set of internal coordinates' is a more accurate paradigm than 'an internal-coordinate residue IS_A kind of residue'. In particular, with this work the current system may be described as 'a residue HAS_A set of atom coordinates,' and neither coordinate system requires the other.
It still seems like there is a lot of duplication around referencing atoms and their coordinates? Atom coordinates in this work are homogeneous [4][1] matrices as opposed to the [3] arrays used in Biopython Atom classes, and this facilitates the application of the combined translation, rotation and occasional scaling matrices employed here. Different levels of the data hierarchy maintain intermediate results, with the intention of avoiding recalculation when an angle is modified, but there's been no optimisation around this. At the end of the atom assembly calculation, results are promoted to the Atom [3] arrays as expected by the rest of Biopython. Backbone dihedral angles span residues, so I needed a richer system than the existing Atom IDs to capture position and disorder information (see AtomKeys).
What about disorder, missing residues, HETATMs and all that other fun stuff? ALTLOC atoms are handled, generating angles and dihedrals for each path. The rebuild system will fail for residues with missing backbone N-Ca-C atoms, otherwise rebuild success depends on how the missing atoms interact with the pre-defined dihedrals for building the sidechains (just try it, should be fine). UNK and two non-standard amino acids I came across in testing with peptide backbones are accepted (backbone only), otherwise a chain break will occur. This list can be extended at IC_Residue.accept_resnames. Deuterium structures cannot be generated, primarily because it did not seem worth doubling the numerous Hydrogen name table entries (convert to H's as mentioned above). See the Unittests, and the inclusion of ic_rebuild.structure_rebuild_test() is a feature to test a structure for completeness.
What use cases are there for building structures from internal coordinates? My primary application is protein structure prediction, and fundamentally the rationale for exactly rebuilding PDB coordinate files is to prove code correctness (or at least that the bugs are consistent on both sides). @JoaoRodrigues suggested there may be an application in smoothing trajectories. It's also handy to be able to create a PDB chain by cut-and-pasting individual residues in a .pic file.
Are there default values for bond lengths and angles? No. This would be another large table (very large depending on the specificity desired) with values dependent on the database of selected structures. I'm happy to generate a reasonable set if there is appetite, but it seems as accurate to stitch together residues from a .pic file from a single high-resolution structure. (Note that PDB coordinates can only be regenerated by capturing every parameter; just setting all omega angles to 180.0 will probably generate bad contacts and collisions in any structure rebuild)
Why another file format? There's no public application or specification for .pic files other than this source code, however it is used here to extract and verify the 'minimum information set' for rebuilding a PDB file.
is there a publication? This is it now; please cite this pull request or my GitHub page if you wish. I developed the algorithms in C circa 1993 in graduate school and referenced their application in Miller94. I've re-writeen in various languages since then, and made a Lua implementation available on GitHub starting in early 2016. With the recent demise of the Torch neural nets library, I have chosen to move my development work to (Bio)Python.
Other code modifications included with this PR: I added some code that tries to populate header information from mmCIF files (unittest added), and fixed a bug that pre-pended a space to PDB TITLE entries (unittest fixed). I updated the Structure FAQ and tutorial, and the copyright dates for the API docs. Added homogeneous coordinate routines to Vectors.py.
my name is already in CONTRIBUTING.rst, I wasn't sure how to modify NEWS.rst

[ X] I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
[ X] I have read the CONTRIBUTING.rst file, have run flake8 locally, and
understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

codecov · 2019-11-19T08:45:37Z

Codecov Report

Merging #2346 into master will decrease coverage by 0.28%.
The diff coverage is 75.79%.

@@            Coverage Diff             @@
##           master    #2346      +/-   ##
==========================================
- Coverage   84.95%   84.66%   -0.29%     
==========================================
  Files         323      328       +5     
  Lines       52690    54449    +1759     
==========================================
+ Hits        44764    46101    +1337     
- Misses       7926     8348     +422

Impacted Files	Coverage Δ
Bio/PDB/Residue.py	`81.25% <100%> (+0.23%)`	⬆️
Bio/PDB/ic_data.py	`100% <100%> (ø)`
Bio/PDB/parse_pdb_header.py	`96.92% <100%> (+0.51%)`	⬆️
Bio/PDB/Structure.py	`96.66% <100%> (+1.01%)`	⬆️
Bio/PDB/MMCIFParser.py	`91.05% <100%> (+0.77%)`	⬆️
Bio/PDB/Model.py	`82.75% <42.85%> (-12.7%)`	⬇️
Bio/PDB/ic_rebuild.py	`44.97% <44.97%> (ø)`
Bio/PDB/SCADIO.py	`68.96% <68.96%> (ø)`
Bio/PDB/internal_coords.py	`78.1% <78.1%> (ø)`
Bio/PDB/PICIO.py	`85.08% <85.08%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bb7a24...62a02ac. Read the comment docs.

… bonds and h-bonds for OpenSCAD output

…dinates to SMCR classes

…aning and black reformatting

…uild

…ing transform; formatting; comments

…pic to .internal_coord

… accept_backbone to _mainchain

…le compatibility

…ack reformatting

JoaoRodrigues · 2019-11-23T00:49:50Z

Hi @rob-miller

Thanks for the contributions, that's a lot of work. Would it be possible to divide this PR in several smaller ones? I see changes to the documentation, manual, bug fixes, etc, which will be very hard to review in this format.

rob-miller · 2019-11-24T07:26:21Z

Thanks João,
Will do. Thought it would be easier to have the docs together with the code and the other stuff is pretty small, but I know it is a lot to look at.

rob-miller · 2019-12-14T09:24:54Z

Rebased, cleaned up and re-submitted as the following pull requests:

#2362 generate header data for mmCIF files like PDB parser
#2364 remove initial space from PDB TITLE lines
#2399 internal coordinates infrastructure
#2400 internal coordinates documentation

rob-miller requested a review from JoaoRodrigues as a code owner November 19, 2019 07:58

rob-miller added 28 commits November 19, 2019 14:05

internal coords version 2

84be4f3

do not create res.internal_coord if not-accepted hetatm

5193876

implement comment lines in .pic files

5463186

implement loading .pic data without initial NCaC coords; add flexible…

27d4c2a

… bonds and h-bonds for OpenSCAD output

change parameter format for homogeneous matrix function calls

8a1a86b

comments

3c01f5e

add .internal_coord as permanent property, move atom_to_internal_coor…

abd619d

…dinates to SMCR classes

minor clode cleanup

5c34e0e

cleanand improve comments

4482b06

make MMCIFParser create minimal header info like PDB

9334c18

change name to internal_coords

4499289

add enumerate to generate atom serial numbers for cif files; code cle…

aa79e04

…aning and black reformatting

fix typo

11b1c0d

rename to internal_coords, reformatting

651649d

fix extra space added at beginning of TITLE

e58348f

rename main file to internal_coords, toplevel commands file to ic_reb…

106eeb0

…uild

fix extra space at start of TITLE

620e5be

formatting

ffb4c16

add internal_coords unittests

245e6a7

fix OpenSCAD output disordered residues not scaled, altloc atoms miss…

f7f15c1

…ing transform; formatting; comments

black formatting

3970ccd

change pic names to ic/internal_coord except for PIC output file format

ec3bc6d

changes to peptide.scad for movable polypeptide development; change .…

ade0d13

…pic to .internal_coord

re-work Gly C-beta code

2371fa4

initial docs for internal_coords

d149bf9

more tweaks to make gly c-beta work with scad output

fae435f

help with rotatable bonds

829198a

sort NCaCKey access to stabilize results for disordered atoms; change…

1409df3

… accept_backbone to _mainchain

rob-miller added 16 commits November 19, 2019 14:05

reduce numerical stringency on write_SCAD test; not clear why

074346f

add Tutorial doc for internal_coords

bf47658

fix doclines

3993cf6

add test header populated

54dd497

improve doclines; fix tabs in OpenSCAD code to 4-space for Python sty…

21a0e88

…le compatibility

improve docstrings

2fa9cd1

sort NCaCKeys to stabilise writeSCAD unittest; improve docstrings; bl…

ab2bc78

…ack reformatting

update test for sorted NCaCKey resolution to instability

b3687c2

docstring tweaks

c91db34

clean up NCaCKey sorts, docstring tweaks

a507cdd

more docstring on 3D printing a model

48ae2c0

resolve travis issues

99c0dd9

hack to deal with black vs. flake8 issue

e4155b5

fix python2 compatibility issues

bc4c048

loosen numerical stringency due to sorting issues early python versions

af9e858

rob-miller force-pushed the internal_coords branch from 60d6fa4 to af9e858 Compare November 19, 2019 11:05

rob-miller added 2 commits November 19, 2019 20:09

fixed missing atoms cause partial dihedron, so some hedra not updated

2593874

try different test subject for write_SCAD

62a02ac

rob-miller closed this Nov 24, 2019

This was referenced Dec 14, 2019

Internal coords submit #2399

Merged

Internal coords docs #2400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal coords #2346

Internal coords #2346

rob-miller commented Nov 19, 2019 •

edited

Loading

codecov bot commented Nov 19, 2019 •

edited

Loading

JoaoRodrigues commented Nov 23, 2019

rob-miller commented Nov 24, 2019

rob-miller commented Dec 14, 2019

Internal coords #2346

Internal coords #2346

Conversation

rob-miller commented Nov 19, 2019 • edited Loading

codecov bot commented Nov 19, 2019 • edited Loading

Codecov Report

JoaoRodrigues commented Nov 23, 2019

rob-miller commented Nov 24, 2019

rob-miller commented Dec 14, 2019

rob-miller commented Nov 19, 2019 •

edited

Loading

codecov bot commented Nov 19, 2019 •

edited

Loading