Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal-coords vector assembly update #3774

Merged
merged 13 commits into from
Jan 30, 2022

Conversation

rob-miller
Copy link
Contributor

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit locally,
    and understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

This is an update / upgrade to the Bio/PDB/internal_coords module, primarily around 'vectorizing' (NumPy-izing') the generation of atom coordinates from bond lengths, angles and dihedral angles.

Will come up with News entry when happy to merge; last time forgot to mention 3D printing aspect.

Primary improvements / additions:

  • assembly (internal_to_atom_coordinates()) is now ‘vectorized’, as in the steps are processed as NumPy commands on arrays of data as the coordinates become available. Assembly is still dependent on preceding regions of a chain, so for example both the side chain and backbone can be assembled in the same numpy steps once the shared initial coordinates are set. Chains over 100 residues assemble in about 1/2 the time as for the serial algorithm, while the speedup for shorter chains (e.g. 1crn, 46 residues) is insignificant. For more specific cases like changing all side chain chi1 angles on a fixed backbone, the speedup is much greater as all sidechains can be processed in parallel. The serial algorithm can still be optionally activated, and is triggered when start and end positions are specified for assembly. The reverse procedure, atom_to_internal_coordinates(), was vectorized in the initial release.

  • AtomArray: To support the vectorized assembly process, all atom coordinates are relocated into one numpy array and the coordinates in the Biopython Atom objects become views into this larger array. (This was discussed on the Biopython mailing list in Oct, 2019 as "Overhauling of Bio.PDB module" but the result here just came about as a result of vectorizing the assembly step.) The atomArray is created and made available when internal coordinates are calculated (atom_to_internal_cooordinates()). Subsets of this array can be accessed using AtomKey objects, and generating a 2D distance plot is a single (admittedly arcane) line of numpy code.

    • A distance_plot() routine is provided, see attached example with filter to generate a C-alpha distance plot.
    • The all-atom distance plot is another encoding of structure in internal coordinates (invariant to rotation or translation), however it is incomplete in that it loses chirality information (a mirror image structure has the same distance plot). For completeness, a distance_to_internal_coordinates() routine is provided, but it requires an array specifying the sign (+/-1) of each dihedral angle. See Tests/test_PDB_internal_coordinates.py:test_distplot_rebuild() for an example. Note this calculation only considers the six distances between the four atoms of each dihedral, so tertiary interactions away from the main diagonal of the distance plot are ignored.
  • Both the assembly and internal coordinate calculation algorithms are based on computing coordinate spaces for relevant triples (hedra) of atoms. The coordinate transform matrices to and from these spaces for every dihedral are available for use, for example to compare residue environments or pairwise interactions. The steps are essentially:

    chi1 = ric0.pick_angle("chi1") # chi1 space defined with CA at origin
    cst = np.transpose(chi1.cst) # transform TO chi1 space
    newAtomCoords = oldAomCoords.dot(cst)

See attached sample code presenting phe-phe pair interactions as PDB files for inspection (proof of concept only, not carefully selected interacting pairs). Note this should/will use PR #3676 if/when it is approved but for now there is a redundant pdb_residue_string() routine (and supporting data structures) supplied with the internal_coordinates module.

  • Default values are now available for all distance and angle measurements. For bond lengths, angles and a subset of dihedral angles which can be specified as a fixed rotation from another dihedral angle this works 'OK', but rebuilt structures degrade rapidly with any variation from original internal coordinates. For primary dihedral angles like psi and phi, the default value is the most common observed in a sample dataset, so the result is a ‘reasonable’ choice. The implementation impacts are:
    • When enabled (default False), any missing value in a .pic file will be defaulted. At the limit, this means the .pic file can just be the list of Biopython residue specification lines
    • Alternatively, missing internal coordinates in a .pic file can be reported.
    • added parameters to write_PIC() to experiment with effects of different levels of defaults
    • A read_seq_as PIC() routine is supplied to generate a Biopython Structure object (initially only internal coordinates) from a SeqIO seq_record object. Obviously the default values are only intended as a start for a modelling exercise. See test_PDB_internal_coordinates.py:test_seq_as_PIC() for an example.
  • a rotate_bond() routine is added. When applied to e.g. a psi backbone dihedral angle (N-Ca-C-N), this will rotate the linked N-Ca-C-O dihedral by the same amount. The existing set_dihedral() primitive applied to a psi angle does not do this, so the N-Ca-C-O rotation has to be explicitly accounted for.

bugfixes / smaller things:

  • no_altloc option not accessible before
  • Arg nh1, nh2 mainchain atoms were in error
  • Crosstalk test added (ensure changing an angle doesn't change other angles)
  • Fixed write scad issues
  • Quick structure rebuild test
  • primitives for difference, average of +/-180 angles

ca-plot-7rsa.py.txt
phe-pairs-2.py.txt

@codecov
Copy link

codecov bot commented Oct 29, 2021

Codecov Report

Merging #3774 (3301155) into master (3ba2b9b) will decrease coverage by 5.23%.
The diff coverage is 9.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3774      +/-   ##
==========================================
- Coverage   84.05%   78.82%   -5.24%     
==========================================
  Files         321      301      -20     
  Lines       54093    51749    -2344     
==========================================
- Hits        45470    40790    -4680     
- Misses       8623    10959    +2336     
Impacted Files Coverage Δ
Bio/PDB/Chain.py 57.50% <ø> (-6.25%) ⬇️
Bio/PDB/Structure.py 84.00% <ø> (-16.00%) ⬇️
Bio/PDB/internal_coords.py 10.69% <ø> (-70.11%) ⬇️
Bio/PDB/PICIO.py 5.00% <3.71%> (-70.55%) ⬇️
Bio/PDB/SCADIO.py 11.84% <3.84%> (-68.68%) ⬇️
Bio/PDB/ic_rebuild.py 10.77% <5.95%> (-43.12%) ⬇️
Bio/PDB/vectors.py 89.13% <95.65%> (+0.19%) ⬆️
Bio/PDB/ic_data.py 100.00% <100.00%> (ø)
Bio/Phylo/CDAOIO.py 4.46% <0.00%> (-80.36%) ⬇️
Bio/phenotype/pm_fitting.py 12.28% <0.00%> (-75.44%) ⬇️
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ba2b9b...3301155. Read the comment docs.

@rob-miller
Copy link
Contributor Author

New bpbp-gist.py sample code for exercising features in this PR, plus the two example scripts above, uploaded to rtm-biopython-scripts.

@rob-miller
Copy link
Contributor Author

Closes #3793

Bio/PDB/internal_coords.py Outdated Show resolved Hide resolved
@rob-miller
Copy link
Contributor Author

Working on improving test coverage and hoping to resolve #3802

Copy link
Member

@JoaoRodrigues JoaoRodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this took so long @rob-miller - work+life got in the way. 👍 for me to merge. It would be really nice at some point to have a doc showcasing how useful this code can be for the users!

@JoaoRodrigues JoaoRodrigues merged commit c5b1520 into biopython:master Jan 30, 2022
JoaoRodrigues pushed a commit to JoaoRodrigues/biopython that referenced this pull request Mar 12, 2022
PR overhauls the vector assembly code and makes heavier use of numpy for fast vectorized operations.
@rob-miller rob-miller deleted the ic-vec-submit branch March 21, 2022 16:15
JoaoRodrigues pushed a commit to cbalbin-bio/biopython that referenced this pull request Aug 7, 2022
PR overhauls the vector assembly code and makes heavier use of numpy for fast vectorized operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants