-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul atom mapping #809
Conversation
…tes, and enable coordinate precision for geometry-derived mappings to be specified.
For the geometry-derived mappings, we should be using the largest connected subgraph of matched nodes that does not contain only a subset of any cycles. Cycles should be either contained in their entirety or have all nodes omitted. We can use an algorithm that first computes the largest connected subset of all matched nodes and then iterates over all cycles, removing any nodes for which the cycle only contains a partial (rather than complete or empty) set of matched nodes. Alternatively, we could try to match the whole cycle if they share common subgraphs, but we'd need to be more clever about how exactly we do this graph isomorphism "helped" by partial matches. |
@zhang-ivy : I'm seeing test failures on protein mutation tests---any idea what this might mean?
|
@dominicrufa : I'd like to add atom mapping validation and refinement steps to occur after the coordinate-derived mapping step. Are there methods already implemented in |
@mikemhenry: I think you re-worked the counterion tests.. did you see the above error that John mentioned before you merged? I've run the counterion code for my RBD:ACE2 D501Y simulations and it runs fine. It seems like this error only comes up in the tests. Something is going wrong when the mdtraj trajectory is created:
@mikemhenry could you help debug this further? I'm guessing something weird is happening when |
I'm tremendously confused as to why the # Create an atom mapping factory
atom_mapper = AtomMapper(atom_expr=atom_expr, bond_expr=bond_expr, map_strategy=map_strategy)
# Use the factory
for target_molecule in target_molecules:
atom_mapping = AtomMapper.create_map(source_molecule, target_molecule)
... I'm going to at least make some notes about how to refactor this class, but may end up doing some refactoring in this PR if it seems straightforward to fix. |
Looks like the tests are passing now, not sure why since the committed code -- or are you talking about some failing tests happing locally? |
@mikemhenry @zhang-ivy : This must be a stochastic test failure then. Can you add this to the list of stochastic test failures to revisit? |
…reaking/forming with class methods.
…hat generates AtomMapping objects
We're down to two test failures remaining!
Failure log for `test_RepartitionedHybridTopologyFactory_energies`
Energy breakdown
|
OK, the hard-coded map test is fixed, and this PR should be ready for review!
|
One way we might fix the Tagging @mikemhenry to review this on Monday if possible! |
…r notebook with atom mapping examples
I've now enabled |
@jchodera : Does this PR break biopolymer atom mapping? If so, what is the short term solution here for making sure that we can map protein mutations?
I can look into adding a minimization step to the tests today. |
No. All tests pass. So if the test coverage in If you have unit tests outside of
Let's tackle that in a separate PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I really like the rework of the atom mapping API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR. Lots of improvements in the API. There are some comments to be discussed as usual but nothing blocking. This can be merged.
def test_render_image(self): | ||
import tempfile | ||
atom_mapping = AtomMapping(self.old_mol, self.old_mol, old_to_new_atom_map=self.old_to_new_atom_map) | ||
for suffix in ['.pdf', '.png', '.svg']: | ||
with tempfile.NamedTemporaryFile(suffix=suffix) as tmpfile: | ||
atom_mapping.render_image(tmpfile.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to test for these formats/extensions individually? Such that if one of them fails, for whatever reasons it is easier to detect which one actually failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test parameterization is probably the right way to go here, as you suggested! I'll leave that for further cleanup.
old_to_new_atom_map = { 0:0, 1:1, 2:2, 3:3, 5:4, 6:5 } | ||
new_to_old_atom_map = dict(map(reversed, self.old_to_new_atom_map.items())) | ||
atom_mapping = AtomMapping(old_mol, new_mol, old_to_new_atom_map=old_to_new_atom_map) | ||
assert atom_mapping.creates_or_breaks_rings() == True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hasn't this feature/method been tested in the previous test_ring_breaking_detection
already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra test coverage doesn't help! There are a lot of combinations of options being tested here.
perses/tests/test_atom_mapping.py
Outdated
"""Test AtomMapping object.""" | ||
def setUp(self): | ||
"""Create useful common objects for testing.""" | ||
from openff.toolkit.topology import Molecule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this import is needed here since there is already a global one happening at line 7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
perses/tests/test_atom_mapping.py
Outdated
e.args += (f'Exception encountered for {dataset_name} use_positions={use_positions} allow_ring_breaking={allow_ring_breaking}: {old_index} {molecules[old_index]}-> {new_index} {molecules[new_index]}', ) | ||
raise e | ||
|
||
def test_map_strategy(self): | ||
""" | ||
Test the creation of atom maps between pairs of molecules from the JACS benchmark set. | ||
""" | ||
# Create and configure an AtomMapper | ||
from openeye import oechem | ||
atom_expr = oechem.OEExprOpts_IntType | ||
bond_expr = oechem.OEExprOpts_RingMember | ||
atom_mapper = AtomMapper(atom_expr=atom_expr, bond_expr=bond_expr) | ||
|
||
# Test mappings for JACS dataset ligands | ||
for dataset_name in ['Jnk1']: | ||
molecules = self.molecules[dataset_name] | ||
|
||
# Jnk1 ligands 0 and 2 have meta substituents that face opposite each other in the active site. | ||
# When ignoring position information, the mapper should align these groups, and put them both in the core. | ||
# When using position information, the mapper should see that the orientations differ and chose | ||
# to unmap (i.e. put both these groups in core) such as to get the geometry right at the expense of | ||
# mapping fewer atoms | ||
|
||
# Ignore positional information when scoring mappings | ||
atom_mapper.use_positions = False | ||
atom_mapping = atom_mapper.get_best_mapping(molecules[0], molecules[2]) | ||
#assert len(atom_mapping.new_to_old_atom_map) == 36, f'Expected meta groups methyl C to map onto ethyl O\n{atom_mapping}' # TODO | ||
|
||
# Use positional information to score mappings | ||
atom_mapper.use_positions = True | ||
atom_mapping = atom_mapper.get_best_mapping(molecules[0], molecules[2]) | ||
#assert len(atom_mapping.new_to_old_atom_map) == 35, f'Expected meta groups methyl C to NOT map onto ethyl O as they are distal in cartesian space\n{atom_mapping}' # TODO | ||
|
||
# Explicitly construct mapping from positional information alone | ||
atom_mapping = atom_mapper.generate_atom_mapping_from_positions(molecules[0], molecules[2]) | ||
#assert len(atom_mapping.new_to_old_atom_map) == 35, f'Expected meta groups methyl C to NOT map onto ethyl O as they are distal in cartesian space\n{atom_mapping}' # TODO | ||
|
||
def test_simple_heterocycle_mapping(self): | ||
""" | ||
Test the ability to map conjugated heterocycles (that preserves all rings). Will assert that the number of ring members in both molecules is the same. | ||
""" | ||
# TODO: generalize this to test for ring breakage and closure. | ||
|
||
iupac_pairs = [ | ||
('benzene', 'pyridine') | ||
] | ||
|
||
# Create and configure an AtomMapper | ||
atom_mapper = AtomMapper(allow_ring_breaking=False) | ||
|
||
|
||
for old_iupac, new_iupac in iupac_pairs: | ||
from openff.toolkit.topology import Molecule | ||
old_mol = Molecule.from_iupac(old_iupac) | ||
new_mol = Molecule.from_iupac(new_iupac) | ||
atom_mapping = atom_mapper.get_best_mapping(old_mol, new_mol) | ||
|
||
assert len(atom_mapping.old_to_new_atom_map) > 0 | ||
|
||
def test_mapping_strength_levels(self): | ||
"""Test the mapping strength defaults work as expected""" | ||
# SMILES pairs to test mappings | ||
tests = [ | ||
('c1ccccc1', 'C1CCCCC1', {'default': 0, 'weak' : 6, 'strong' : 0}), # benzene -> cyclohexane | ||
('CNC1CCCC1', 'CNC1CCCCC1', {'default': 6, 'weak' : 6, 'strong' : 6}), # https://github.com/choderalab/perses/issues/805#issue-913932127 | ||
('c1ccccc1CNC2CCC2', 'c1ccccc1CNCC2CCC2', {'default': 13, 'weak' : 13, 'strong' : 11}), # https://github.com/choderalab/perses/issues/805#issue-913932127 | ||
('Cc1ccccc1','c1ccc(cc1)N', {'default': 12, 'weak' : 12, 'strong' : 11}), | ||
('CC(c1ccccc1)','O=C(c1ccccc1)', {'default': 13, 'weak' : 14, 'strong' : 11}), | ||
('Oc1ccccc1','Sc1ccccc1', {'default': 12, 'weak' : 12, 'strong' : 11}), | ||
] | ||
|
||
DEBUG_MODE = True # If True, don't fail, but print results of tests for calibration | ||
|
||
for mol1_smiles, mol2_smiles, expected_results in tests: | ||
for map_strength, expected_n_mapped_atoms in expected_results.items(): | ||
# Create OpenFF Molecule objects | ||
from openff.toolkit.topology import Molecule | ||
mol1 = Molecule.from_smiles(mol1_smiles) | ||
mol2 = Molecule.from_smiles(mol2_smiles) | ||
|
||
# Initialize the atom mapper with the requested mapping strength | ||
atom_mapper = AtomMapper(map_strength=map_strength, allow_ring_breaking=False) | ||
|
||
# Create the atom mapping | ||
atom_mapping = atom_mapper.get_best_mapping(mol1, mol2) | ||
|
||
if DEBUG_MODE: | ||
if atom_mapping is not None: | ||
_logger.info(f'{mol1_smiles} -> {mol2_smiles} using map strength {map_strength} : {atom_mapping.n_mapped_atoms} atoms mapped : {atom_mapping.old_to_new_atom_map}') | ||
atom_mapping.render_image(f'test_mapping_strength_levels:{mol1_smiles}:{mol2_smiles}:{map_strength}.png') | ||
else: | ||
_logger.info(f'{mol1_smiles} -> {mol2_smiles} using map strength {map_strength} : {atom_mapping}') | ||
|
||
else: | ||
# Check that expected number of mapped atoms are provided | ||
n_mapped_atoms = 0 | ||
if atom_mapping is not None: | ||
n_mapped_atoms = atom_mapping.n_mapped_atoms | ||
|
||
assert n_mapped_atoms==expected_n_mapped_atoms, "Number of mapped atoms does not match hand-calibrated expectation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great testing, a lot of cases involved. Maybe these tests could eventually use test parametrization and fixtures to make it a bit cleaner. We should try making a separate issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Great idea---please open an issue!
# Fix atom name retrieval from OEMol objects | ||
# TODO: When https://github.com/openforcefield/openff-toolkit/issues/1026 is fixed, remove this | ||
if hasattr(old_mol, 'GetAtoms'): | ||
for index, atom in enumerate(old_mol.GetAtoms()): | ||
self.old_mol.atoms[index].name = atom.GetName() | ||
if hasattr(new_mol, 'GetAtoms'): | ||
for index, atom in enumerate(new_mol.GetAtoms()): | ||
self.new_mol.atoms[index].name = atom.GetName() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue seems to be solved upstream but a release of openff-toolkit
with the changes has yet to be made. I think the next release should have this already, we should keep an eye for it. Do you think we can then remove this code and make perses
depend on that version or newer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this code doesn't do any harm, and is compatible with the openff-toolkit
current and future release.
At some point in the future, we can change the dependency to require the new version of openff-toolkit
or later, and then remove this code---hence the #TODO
!
old_offmol, old_oemol = copy_molecule(old_mol) | ||
new_offmol, new_oemol = copy_molecule(new_mol) | ||
|
||
if (old_offmol.n_atoms==0) or (new_offmol.n_atoms==0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just test this in the beginning with old_mol
and new_mol
directly? I know this shouldn't change the behavior but just for 'order'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would take more code, since we don't know whether they are OEMol
or Molecule
objects at the beginning.
elapsed_time = time.time() - initial_time | ||
_logger.debug(f'get_best_mapping took {elapsed_time:.3f} s') | ||
|
||
return atom_mappings[best_map_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get multiple mappings with the same max score? What happens when that's the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! Added a comment to address this ambiguity.
else: | ||
raise ValueError(f'Not sure how to copy {mol}; should be openff.toolkit.topology.Molecule or openeye.oechem.OEMol') | ||
|
||
return offmol, oemol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how bad we are in memory usage or if we would ever need a better memory footprint. But this might be a candidate for improvement, maybe returning just copies of the same type as the input. That is, if input is an OEMol
output is just OEMol
as well (similar with offmol
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! The hope is to eliminate OEMol
objects entirely when possible, and just use Molecule
in the future. This is just a temporary hack.
|
||
""" | ||
import numpy as np | ||
atom_mappings = self.get_all_maps(old_mol, new_mol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe get_all_maps
doesn't exist as a public/user-visible method. Is this supposed to be _get_all_maps
? Now that makes me wonder if this method is being tested at all (probably not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I now realize as well that the _get_all_maps
static method has an undefined reference to self
if no atom or bond expressions are passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping to refactor the private _get_all_map()
method in a future PR---once I fully understand what it does!
This was just a partial refactor, cleaning up the public API, eliminating references to the private API when possible, and refactoring the bits of the code I could understand. We'll need @dominicrufa's help (after he returns from his internship) for the next phase of refactoring/cleanup.
|
||
""" | ||
from openff.toolkit.topology import Molecule | ||
distance_unit = unit.angstroms # distance unit for comparisons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are units always expected to be angstroms? I guess this works for the most part but maybe we could try to get the actual units from the molecule conformers themselves. Something like old_mol.conformers[0].unit
(assuming they all have the same units as the first occurrence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distance_unit
here is arbitrary, but defined here to be consistent in how both the measured distances and absolute distance threshold are converted into unitless numbers. So the choice is totally arbitrary.
Changes should be complete. Just waiting on tests to pass now. |
This PR attempts to address some of the issues described in #805
SmallMoleculeSetProposalEngine
to move all user-modifiable attributes frompropose()
to be class attributes. These attributes should impact all proposals, so they make sense to have at the class level, rather than as optional arguments topropose()
.@dominicrufa : I've got a start at addressing #804, but could use some pointers on how to add new options to the YAML file to carry over into geometry-derived mappings!