Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul atom mapping #809

Merged
merged 42 commits into from
Jul 30, 2021
Merged

Overhaul atom mapping #809

merged 42 commits into from
Jul 30, 2021

Conversation

jchodera
Copy link
Member

@jchodera jchodera commented Jun 19, 2021

This PR attempts to address some of the issues described in #805

  • Refactor SmallMoleculeSetProposalEngine to move all user-modifiable attributes from propose() to be class attributes. These attributes should impact all proposals, so they make sense to have at the class level, rather than as optional arguments to propose().
  • Expose option to modify cutoff used for geometry-derived mappings
  • Enable YAML to specify cutoff for geometry-derived mappings
  • Check for problematic mappings using scheme in Check factorization of partition function #792
  • Clean up problematic mappings (e.g. de-mapping partially mapped rings)
  • Add tests for compounds that led to problematic mappings
  • Fix Sporadic unspecified stereochemistry issue with generate_atom_mapping_from_positions #826

@dominicrufa : I've got a start at addressing #804, but could use some pointers on how to add new options to the YAML file to carry over into geometry-derived mappings!

@codecov
Copy link

codecov bot commented Jun 20, 2021

Codecov Report

Merging #809 (788055d) into master (0352f82) will increase coverage by 0.86%.
The diff coverage is 86.49%.

@jchodera
Copy link
Member Author

For the geometry-derived mappings, we should be using the largest connected subgraph of matched nodes that does not contain only a subset of any cycles. Cycles should be either contained in their entirety or have all nodes omitted.

We can use an algorithm that first computes the largest connected subset of all matched nodes and then iterates over all cycles, removing any nodes for which the cycle only contains a partial (rather than complete or empty) set of matched nodes.

Alternatively, we could try to match the whole cycle if they share common subgraphs, but we'd need to be more clever about how exactly we do this graph isomorphism "helped" by partial matches.

@jchodera
Copy link
Member Author

@zhang-ivy : I'm seeing test failures on protein mutation tests---any idea what this might mean?


    def test_protein_counterion_topology_fix_negitive():
        """
        mutate alanine dipeptide into ARG dipeptide and assert that the appropriate number of water indices are identified
        """
        from perses.rjmc.topology_proposal import PolymerProposalEngine
        new_res = 'ARG'
        charge_diff = -1
    
        # Make a vacuum system
        atp, system_generator = generate_atp(phase='vacuum')
    
        # Make a solvated system/topology/positions with modeller
        modeller = app.Modeller(atp.topology, atp.positions)
        modeller.addSolvent(system_generator.forcefield, model='tip3p', padding=9*unit.angstroms, ionicStrength=0.15*unit.molar)
        solvated_topology = modeller.getTopology()
        solvated_positions = modeller.getPositions()
    
        # Canonicalize the solvated positions: turn tuples into np.array
        atp.positions = unit.quantity.Quantity(value=np.array([list(atom_pos) for atom_pos in solvated_positions.value_in_unit_system(unit.md_unit_system)]), unit=unit.nanometers)
        atp.topology = solvated_topology
    
        atp.system = system_generator.create_system(atp.topology)
    
        # Make a topology proposal and generate new positions
        top_proposal, new_pos, _, _ = generate_dipeptide_top_pos_sys(topology = atp.topology,
                                       new_res = new_res,
                                       system = atp.system,
                                       positions = atp.positions,
                                       system_generator = system_generator,
                                       conduct_geometry_prop = True,
                                       conduct_htf_prop = False,
                                       validate_energy_bookkeeping=True,
                                       )
    
        # Get the charge difference
        charge_diff_test = PolymerProposalEngine._get_charge_difference(top_proposal._old_topology.residue_topology.name,
                                                                    top_proposal._new_topology.residue_topology.name)
        assert charge_diff_test == charge_diff
    
        # Get the array of water indices (w.r.t. new topology) to turn into ions
>       water_indices = PolymerProposalEngine.get_water_indices(charge_diff = charge_diff_test,
                                                 new_positions = new_pos,
                                                 new_topology = top_proposal._new_topology,
                                                 radius=0.8)

perses/tests/test_topology_proposal.py:868: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
perses/rjmc/topology_proposal.py:577: in get_water_indices
    traj = md.Trajectory(new_positions[np.newaxis, ...], md.Topology.from_openmm(new_topology))
../../../miniconda/envs/perses-dev/lib/python3.8/site-packages/mdtraj/core/trajectory.py:1169: in __init__
    self.xyz = xyz
../../../miniconda/envs/perses-dev/lib/python3.8/site-packages/mdtraj/core/trajectory.py:882: in xyz
    value = ensure_type(value, np.float32, 3, 'xyz', shape=shape,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

val = Quantity(value=array([[[ 2.00000100e-01,  1.00000000e-01, -1.30000000e-07],
        [ 2.00000100e-01,  2.09000000e-01,...-01,  5.23500000e-01,  9.63099985e-01],
        [ 3.55762500e-02,  6.86000000e-02,  1.78259999e+00]]]), unit=nanometer)
dtype = <class 'numpy.float32'>, ndim = 3, name = 'xyz', length = None, can_be_none = False, shape = (None, 1571, 3), warn_on_cast = False, add_newaxis_on_deficient_ndim = True

    def ensure_type(val, dtype, ndim, name, length=None, can_be_none=False, shape=None,
        warn_on_cast=True, add_newaxis_on_deficient_ndim=False):
        """Typecheck the size, shape and dtype of a numpy array, with optional
        casting.
    
        Parameters
        ----------
        val : {np.ndaraay, None}
            The array to check
        dtype : {nd.dtype, str}
            The dtype you'd like the array to have
        ndim : int
            The number of dimensions you'd like the array to have
        name : str
            name of the array. This is used when throwing exceptions, so that
            we can describe to the user which array is messed up.
        length : int, optional
            How long should the array be?
        can_be_none : bool
            Is ``val == None`` acceptable?
        shape : tuple, optional
            What should be shape of the array be? If the provided tuple has
            Nones in it, those will be semantically interpreted as matching
            any length in that dimension. So, for example, using the shape
            spec ``(None, None, 3)`` will ensure that the last dimension is of
            length three without constraining the first two dimensions
        warn_on_cast : bool, default=True
            Raise a warning when the dtypes don't match and a cast is done.
        add_newaxis_on_deficient_ndim : bool, default=True
            Add a new axis to the beginining of the array if the number of
            dimensions is deficient by one compared to your specification. For
            instance, if you're trying to get out an array of ``ndim == 3``,
            but the user provides an array of ``shape == (10, 10)``, a new axis will
            be created with length 1 in front, so that the return value is of
            shape ``(1, 10, 10)``.
    
        Notes
        -----
        The returned value will always be C-contiguous.
    
        Returns
        -------
        typechecked_val : np.ndarray, None
            If `val=None` and `can_be_none=True`, then this will return None.
            Otherwise, it will return val (or a copy of val). If the dtype wasn't right,
            it'll be casted to the right shape. If the array was not C-contiguous, it'll
            be copied as well.
    
        """
        if can_be_none and val is None:
            return None
    
        if not isinstance(val, np.ndarray):
            if isinstance(val, collections.abc.Iterable):
                # If they give us an iterator, let's try...
                if isinstance(val, collections.abc.Sequence):
                    # sequences are easy. these are like lists and stuff
                    val = np.array(val, dtype=dtype)
                else:
                    # this is a generator...
>                   val = np.array(list(val), dtype=dtype)
E                   ValueError: setting an array element with a sequence.

../../../miniconda/envs/perses-dev/lib/python3.8/site-packages/mdtraj/utils/validation.py:104: ValueError

@jchodera
Copy link
Member Author

@dominicrufa : I'd like to add atom mapping validation and refinement steps to occur after the coordinate-derived mapping step. Are there methods already implemented in AtomMapper that I should be using to check the validity of mappings or de-map ring systems that are only partially mapped? Or am I better off implementing new methods for this purpose?

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Jun 23, 2021

@mikemhenry: I think you re-worked the counterion tests.. did you see the above error that John mentioned before you merged?

I've run the counterion code for my RBD:ACE2 D501Y simulations and it runs fine. It seems like this error only comes up in the tests. Something is going wrong when the mdtraj trajectory is created:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
perses/rjmc/topology_proposal.py:577: in get_water_indices
    traj = md.Trajectory(new_positions[np.newaxis, ...], md.Topology.from_openmm(new_topology))
../../../miniconda/envs/perses-dev/lib/python3.8/site-packages/mdtraj/core/trajectory.py:1169: in __init__
    self.xyz = xyz
../../../miniconda/envs/perses-dev/lib/python3.8/site-packages/mdtraj/core/trajectory.py:882: in xyz
    value = ensure_type(value, np.float32, 3, 'xyz', shape=shape,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

@mikemhenry could you help debug this further? I'm guessing something weird is happening when new_positions[np.newaxis, ...] is fed into md.Trajectory

@jchodera
Copy link
Member Author

I'm tremendously confused as to why the AtomMapper class exists only to hold a bunch of private underscore-appended static methods, rather than use a simple factory idiom like

# Create an atom mapping factory
atom_mapper = AtomMapper(atom_expr=atom_expr, bond_expr=bond_expr, map_strategy=map_strategy)
# Use the factory
for target_molecule in target_molecules:
   atom_mapping = AtomMapper.create_map(source_molecule, target_molecule)
   ...

I'm going to at least make some notes about how to refactor this class, but may end up doing some refactoring in this PR if it seems straightforward to fix.

@mikemhenry
Copy link
Contributor

Looks like the tests are passing now, not sure why since the committed code -- or are you talking about some failing tests happing locally?

@jchodera
Copy link
Member Author

@mikemhenry @zhang-ivy : This must be a stochastic test failure then. Can you add this to the list of stochastic test failures to revisit?

@jchodera
Copy link
Member Author

jchodera commented Jul 25, 2021

We're down to two test failures remaining!

  • @dominicrufa @zhang-ivy : I'll need your help to sort out this issue with the repartitioned topology that likely appeared previously as a stochastic error since we were not exhaustively testing transformations here before:
Failure log for `test_RepartitionedHybridTopologyFactory_energies`

=================================== FAILURES ===================================
_______________ test_RepartitionedHybridTopologyFactory_energies _______________
[gw0] linux -- Python 3.9.6 /usr/share/miniconda/envs/test/bin/python

    def test_RepartitionedHybridTopologyFactory_energies():
        """
        Test whether the difference in the nonalchemical zero and alchemical zero states is the forward valence energy.  Also test for the one states.
        """
    
        from perses.tests.test_topology_proposal import generate_atp
        from openmmforcefields.generators import SystemGenerator
    
        # Test alanine dipeptide in vacuum
        atp, system_generator = generate_atp('vacuum')
        RepartitionedHybridTopologyFactory_energies(atp.topology, '1', atp.system, atp.positions, system_generator)
    
        # Test alanine dipeptide in solvent
        atp, system_generator = generate_atp('solvent')
>       RepartitionedHybridTopologyFactory_energies(atp.topology, '1', atp.system, atp.positions, system_generator)

perses/tests/test_relative.py:784: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
perses/tests/test_relative.py:665: in RepartitionedHybridTopologyFactory_energies
    RepartitionedHybridTopologyFactory_energy_mutant(topology, chain, system, positions, system_generator, mutant)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

topology = <Topology; 3 chains, 516 residues, 1557 atoms, 1043 bonds>
chain = '1'
system = <simtk.openmm.openmm.System; proxy of <Swig Object of type 'OpenMM::System *' at 0x7fddf8b4bea0> >
positions = Quantity(value=array([[ 2.00000100e-01,  1.00000000e-01, -1.30000000e-07],
       [ 2.00000100e-01,  2.09000000e-01,  ...0e-01, -6.70000000e-01,  1.08389999e+00],
       [ 1.29177625e+00,  1.14910000e+00,  1.20299985e-01]]), unit=nanometer)
system_generator = <openmmforcefields.generators.system_generators.SystemGenerator object at 0x7fde0c30b4f0>
mutant = 'GLU'

    def RepartitionedHybridTopologyFactory_energy_mutant(topology, chain, system, positions, system_generator, mutant):
        """
        Test whether the difference in the nonalchemical zero and alchemical zero states is the forward valence energy.  Also test for the one states.
        Note that two RepartitionedHybridTopologyFactorys need to be generated (one for each endstate) because the energies need to be validated separately for each endstate.
        Tests a single mutant.
        """
    
        from perses.rjmc.topology_proposal import PointMutationEngine
        from perses.annihilation.relative import RepartitionedHybridTopologyFactory
        from perses.tests.utils import validate_endstate_energies
    
        ENERGY_THRESHOLD = 1e-4
    
        for res in topology.residues():
            if res.id == '2':
                wt_res = res.name
    
        print(f'Making mutation {wt_res}->{mutant}')
    
        # Create point mutation engine to mutate residue at id 2 to random amino acid
        point_mutation_engine = PointMutationEngine(wildtype_topology=topology,
                                                    system_generator=system_generator,
                                                    chain_id=chain,
                                                    max_point_mutants=1,
                                                    residues_allowed_to_mutate=['2'],  # the residue ids allowed to mutate
                                                    allowed_mutations=[('2', mutant)],
                                                    aggregate=True)  # always allow aggregation
    
        # Create topology proposal
        topology_proposal = point_mutation_engine.propose(current_system=system, current_topology=topology)
    
        # Create geometry engine
        from perses.rjmc.geometry import FFAllAngleGeometryEngine
        geometry_engine = FFAllAngleGeometryEngine(metadata=None,
                                                   use_sterics=False,
                                                   n_bond_divisions=100,
                                                   n_angle_divisions=180,
                                                   n_torsion_divisions=360,
                                                   verbose=True,
                                                   storage=None,
                                                   bond_softening_constant=1.0,
                                                   angle_softening_constant=1.0,
                                                   neglect_angles=False,
                                                   use_14_nonbondeds=True)
    
        # Create geometry proposal
        new_positions, logp_proposal = geometry_engine.propose(topology_proposal, positions, beta,
                                                                       validate_energy_bookkeeping=True)
        logp_reverse = geometry_engine.logp_reverse(topology_proposal, new_positions, positions, beta,
                                                    validate_energy_bookkeeping=True)
    
        if not topology_proposal.unique_new_atoms:
            assert geometry_engine.forward_final_context_reduced_potential == None, f"There are no unique new atoms but the geometry_engine's final context reduced potential is not None (i.e. {self._geometry_engine.forward_final_context_reduced_potential})"
            assert geometry_engine.forward_atoms_with_positions_reduced_potential == None, f"There are no unique new atoms but the geometry_engine's forward atoms-with-positions-reduced-potential in not None (i.e. { self._geometry_engine.forward_atoms_with_positions_reduced_potential})"
            vacuum_added_valence_energy = 0.0
        else:
            added_valence_energy = geometry_engine.forward_final_context_reduced_potential - geometry_engine.forward_atoms_with_positions_reduced_potential
    
        if not topology_proposal.unique_old_atoms:
            assert geometry_engine.reverse_final_context_reduced_potential == None, f"There are no unique old atoms but the geometry_engine's final context reduced potential is not None (i.e. {self._geometry_engine.reverse_final_context_reduced_potential})"
            assert geometry_engine.reverse_atoms_with_positions_reduced_potential == None, f"There are no unique old atoms but the geometry_engine's atoms-with-positions-reduced-potential in not None (i.e. { self._geometry_engine.reverse_atoms_with_positions_reduced_potential})"
            subtracted_valence_energy = 0.0
        else:
            subtracted_valence_energy = geometry_engine.reverse_final_context_reduced_potential - geometry_engine.reverse_atoms_with_positions_reduced_potential
    
        # Generate repartitioned htf at lambda = 0
        zero_htf = RepartitionedHybridTopologyFactory(topology_proposal=topology_proposal,
                              current_positions=positions,
                              new_positions=new_positions,
                              endstate=0)
    
        # Compute error at lambda = 0 endstate
        zero_state_error, _ = validate_endstate_energies(zero_htf._topology_proposal,
                                                         zero_htf,
                                                         added_valence_energy,
                                                         subtracted_valence_energy,
                                                         ENERGY_THRESHOLD=ENERGY_THRESHOLD,
                                                         platform=openmm.Platform.getPlatformByName('Reference'),
                                                         repartitioned_endstate=0)
        # Generate repartitioned htf at lambda = 1
        one_htf = RepartitionedHybridTopologyFactory(topology_proposal=topology_proposal,
                                                      current_positions=positions,
                                                      new_positions=new_positions,
                                                      endstate=1)
    
        # Compute error at lambda = 1 endstate
        _, one_state_error = validate_endstate_energies(one_htf._topology_proposal,
                                                         one_htf,
                                                         added_valence_energy,
                                                         subtracted_valence_energy,
                                                         ENERGY_THRESHOLD=ENERGY_THRESHOLD,
                                                         platform=openmm.Platform.getPlatformByName('Reference'),
                                                         repartitioned_endstate=1)
    
        # Check that endstate errors are below threshold
        assert abs(zero_state_error) < ENERGY_THRESHOLD, f"The zero state alchemical and nonalchemical energy absolute difference {abs(zero_state_error)} is greater than the threshold of {ENERGY_THRESHOLD}."
>       assert abs(one_state_error) < ENERGY_THRESHOLD, f"The one state alchemical and nonalchemical energy absolute difference {abs(one_state_error)} is greater than the threshold of {ENERGY_THRESHOLD}."
E       AssertionError: The one state alchemical and nonalchemical energy absolute difference 0.2041063642262273 is greater than the threshold of 0.0001.
E       assert 0.2041063642262273 < 0.0001
E        +  where 0.2041063642262273 = abs(-0.2041063642262273)

perses/tests/test_relative.py:764: AssertionError
----------------------------- Captured stdout call -----------------------------
Energy breakdown
Making mutation ALA->GLU
Warning: Returning Reference platform instead of requested platform CUDA
conducting subsequent work with the following platform: Reference
conducting subsequent work with the following platform: CPU
conducting subsequent work with the following platform: CPU
conducting subsequent work with the following platform: CPU
conducting subsequent work with the following platform: CPU
Warning: Returning Reference platform instead of requested platform CUDA
conducting subsequent work with the following platform: Reference
conducting subsequent work with the following platform: CPU
added energy components: [('CustomBondForce', 1.580827770516917), ('CustomAngleForce', 4.3543592013601105), ('CustomTorsionForce', 7.050911762338636), ('CustomBondForce', 20.71734407448997)]
Warning: Returning Reference platform instead of requested platform CUDA
conducting subsequent work with the following platform: Reference
conducting subsequent work with the following platform: CPU
conducting subsequent work with the following platform: CPU
conducting subsequent work with the following platform: CPU
conducting subsequent work with the following platform: CPU
Warning: Returning Reference platform instead of requested platform CUDA
conducting subsequent work with the following platform: Reference
conducting subsequent work with the following platform: CPU
added energy components: [('CustomBondForce', 0.0), ('CustomAngleForce', 0.00011854972790545768), ('CustomTorsionForce', 0.0014456143524425994), ('CustomBondForce', 4.034319661495377)]
conducting subsequent work with the following platform: Reference
conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 0.034551580662335844
			HarmonicAngleForce: 0.6072024919297015
			PeriodicTorsionForce: 16.176829149610796
			NonbondedForce: -796.2718391330847
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0
conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 1.615379351179253
			HarmonicAngleForce: 4.961561693289812
			PeriodicTorsionForce: 23.22774091194943
			NonbondedForce: -775.5554960693804
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0
conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 1.615379351179253
			HarmonicAngleForce: 4.9614431435619055
			PeriodicTorsionForce: 23.22629529759699
			NonbondedForce: 2162531545690.6074
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0
conducting subsequent work with the following platform: Reference
conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 0.034551580662335844
			HarmonicAngleForce: 0.6072024919297015
			PeriodicTorsionForce: 16.176829149610796
			NonbondedForce: -796.2718391330847
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0
conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 1.615379351179253
			HarmonicAngleForce: 4.961561693289811
			PeriodicTorsionForce: 23.22774091194943
			NonbondedForce: 2162531545748.2427
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0
conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 1.615379351179253
			HarmonicAngleForce: 4.9614431435619055
			PeriodicTorsionForce: 23.22629529759699
			NonbondedForce: 2162531545690.6074
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0
------------------------------ Captured log call -------------------------------
  • Here's a failed test that I need to update that had simply hard-coded the "correct" mapping results:
[gw0] linux -- Python 3.9.6 /usr/share/miniconda/envs/test/bin/python

pairs_of_smiles = [('Cc1ccccc1', 'c1ccc(cc1)N'), ('CC(c1ccccc1)', 'O=C(c1ccccc1)'), ('Oc1ccccc1', 'Sc1ccccc1')]
test = True

    def test_mapping_strength_levels(pairs_of_smiles=[('Cc1ccccc1','c1ccc(cc1)N'),('CC(c1ccccc1)','O=C(c1ccccc1)'),('Oc1ccccc1','Sc1ccccc1')],test=True):
    
        correct_results = {0:{'default': (3,2), 'weak':(3,2), 'strong':(4,3)},
                           1:{'default': (7,3), 'weak':(6,2), 'strong':(7,3)},
                           2:{'default': (1,1), 'weak':(1,1), 'strong':(2,2)}}
    
        mapping = ['weak','default','strong']
    
        for example in mapping:
            for index, (lig_a, lig_b) in enumerate(pairs_of_smiles):
                print(f"conducting {example} mapping with ligands {lig_a}, {lig_b}")
                initial_molecule = smiles_to_oemol(lig_a)
                proposed_molecule = smiles_to_oemol(lig_b)
                molecules = [Molecule.from_openeye(mol) for mol in [initial_molecule, proposed_molecule]]
                system_generator = SystemGenerator(forcefields = forcefield_files, barostat=barostat, forcefield_kwargs=forcefield_kwargs,nonperiodic_forcefield_kwargs=nonperiodic_forcefield_kwargs,
                                                     small_molecule_forcefield = 'gaff-1.81', molecules=molecules, cache=None)
                proposal_engine = SmallMoleculeSetProposalEngine([initial_molecule, proposed_molecule], system_generator)
                proposal_engine.map_strength = example
                initial_system, initial_positions, initial_topology = OEMol_to_omm_ff(initial_molecule, system_generator)
                print(f"running now with map strength {example}")
                proposal = proposal_engine.propose(initial_system, initial_topology)
                print(lig_a, lig_b,'length OLD and NEW atoms',len(proposal.unique_old_atoms), len(proposal.unique_new_atoms))
                if test:
                    render_atom_mapping(f'{index}-{example}.png', initial_molecule, proposed_molecule, proposal._new_to_old_atom_map)
>                   assert ( (len(proposal.unique_old_atoms), len(proposal.unique_new_atoms)) == correct_results[index][example]), f"the mapping failed, correct results are {correct_results[index][example]}"
E                   AssertionError: the mapping failed, correct results are (7, 3)
E                   assert (6, 2) == (7, 3)
E                     At index 0 diff: 6 != 7
E                     Full diff:
E                     - (7, 3)
E                     + (6, 2)

perses/tests/test_topology_proposal.py:99: AssertionError
----------------------------- Captured stdout call -----------------------------

@jchodera
Copy link
Member Author

OK, the hard-coded map test is fixed, and this PR should be ready for review!
I'll incorporate input from @zhang-ivy @dominicrufa when they can provide it about the failing ALA->GLU mutation energy comparison test, but I'm guessing that test failure occurs because the nonbonded forces are just very large:

conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 1.615379351179253
			HarmonicAngleForce: 4.961561693289811
			PeriodicTorsionForce: 23.22774091194943
			NonbondedForce: 2162531545748.2427
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0
conducting subsequent work with the following platform: CPU
			HarmonicBondForce: 1.615379351179253
			HarmonicAngleForce: 4.9614431435619055
			PeriodicTorsionForce: 23.22629529759699
			NonbondedForce: 2162531545690.6074
			AndersenThermostat: 0.0
			MonteCarloBarostat: 0.0

@jchodera
Copy link
Member Author

One way we might fix the test_RepartitionedHybridTopologyFactory_energies issue is to either minimize the energy or generate multiple conformers until we get one with a non-crazy energy, and then compare the energies for that conformation. But I suggest we create a different issue for that.

Tagging @mikemhenry to review this on Monday if possible!

@jchodera
Copy link
Member Author

I've now enabled AtomMapping classes to render themselves inside Jupyter notebooks!
image
I've used this feature to overhaul the atom mapping example notebook so we can show off the new features at Monday's meeting:
https://github.com/choderalab/perses/blob/refactor-small-molecule-proposal/examples/atom-mapping/Atom-mapping.ipynb

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Jul 26, 2021

Just noting this here for my future self: The major issue with why the current biopolymer atom mapping is failing is that the current scheme seems to make use of the fact that OEMol atom indices can be arbitrarily set, and this facility is used to keep track of which atoms in part of the residue are mapped. My atom mapping tool instead returns indices of the atoms within the OpenFF Molecule object, which does not support the concept of attached arbitrary atom indices.

@jchodera : Does this PR break biopolymer atom mapping? If so, what is the short term solution here for making sure that we can map protein mutations?

One way we might fix the test_RepartitionedHybridTopologyFactory_energies issue is to either minimize the energy or generate multiple conformers until we get one with a non-crazy energy, and then compare the energies for that conformation. But I suggest we create a different issue for that.

I can look into adding a minimization step to the tests today.

@jchodera
Copy link
Member Author

@jchodera : Does this PR break biopolymer atom mapping? If so, what is the short term solution here for making sure that we can map protein mutations?

No. All tests pass. So if the test coverage in master covered your use cases, we should not break them.

If you have unit tests outside of master that are important, it would be critical to share those now!

I can look into adding a minimization step to the tests today.

Let's tackle that in a separate PR!

Copy link
Contributor

@mikemhenry mikemhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I really like the rework of the atom mapping API

Copy link
Contributor

@ijpulidos ijpulidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR. Lots of improvements in the API. There are some comments to be discussed as usual but nothing blocking. This can be merged.

Comment on lines +76 to +81
def test_render_image(self):
import tempfile
atom_mapping = AtomMapping(self.old_mol, self.old_mol, old_to_new_atom_map=self.old_to_new_atom_map)
for suffix in ['.pdf', '.png', '.svg']:
with tempfile.NamedTemporaryFile(suffix=suffix) as tmpfile:
atom_mapping.render_image(tmpfile.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to test for these formats/extensions individually? Such that if one of them fails, for whatever reasons it is easier to detect which one actually failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test parameterization is probably the right way to go here, as you suggested! I'll leave that for further cleanup.

old_to_new_atom_map = { 0:0, 1:1, 2:2, 3:3, 5:4, 6:5 }
new_to_old_atom_map = dict(map(reversed, self.old_to_new_atom_map.items()))
atom_mapping = AtomMapping(old_mol, new_mol, old_to_new_atom_map=old_to_new_atom_map)
assert atom_mapping.creates_or_breaks_rings() == True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hasn't this feature/method been tested in the previous test_ring_breaking_detection already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra test coverage doesn't help! There are a lot of combinations of options being tested here.

"""Test AtomMapping object."""
def setUp(self):
"""Create useful common objects for testing."""
from openff.toolkit.topology import Molecule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this import is needed here since there is already a global one happening at line 7.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines 134 to 268
e.args += (f'Exception encountered for {dataset_name} use_positions={use_positions} allow_ring_breaking={allow_ring_breaking}: {old_index} {molecules[old_index]}-> {new_index} {molecules[new_index]}', )
raise e

def test_map_strategy(self):
"""
Test the creation of atom maps between pairs of molecules from the JACS benchmark set.
"""
# Create and configure an AtomMapper
from openeye import oechem
atom_expr = oechem.OEExprOpts_IntType
bond_expr = oechem.OEExprOpts_RingMember
atom_mapper = AtomMapper(atom_expr=atom_expr, bond_expr=bond_expr)

# Test mappings for JACS dataset ligands
for dataset_name in ['Jnk1']:
molecules = self.molecules[dataset_name]

# Jnk1 ligands 0 and 2 have meta substituents that face opposite each other in the active site.
# When ignoring position information, the mapper should align these groups, and put them both in the core.
# When using position information, the mapper should see that the orientations differ and chose
# to unmap (i.e. put both these groups in core) such as to get the geometry right at the expense of
# mapping fewer atoms

# Ignore positional information when scoring mappings
atom_mapper.use_positions = False
atom_mapping = atom_mapper.get_best_mapping(molecules[0], molecules[2])
#assert len(atom_mapping.new_to_old_atom_map) == 36, f'Expected meta groups methyl C to map onto ethyl O\n{atom_mapping}' # TODO

# Use positional information to score mappings
atom_mapper.use_positions = True
atom_mapping = atom_mapper.get_best_mapping(molecules[0], molecules[2])
#assert len(atom_mapping.new_to_old_atom_map) == 35, f'Expected meta groups methyl C to NOT map onto ethyl O as they are distal in cartesian space\n{atom_mapping}' # TODO

# Explicitly construct mapping from positional information alone
atom_mapping = atom_mapper.generate_atom_mapping_from_positions(molecules[0], molecules[2])
#assert len(atom_mapping.new_to_old_atom_map) == 35, f'Expected meta groups methyl C to NOT map onto ethyl O as they are distal in cartesian space\n{atom_mapping}' # TODO

def test_simple_heterocycle_mapping(self):
"""
Test the ability to map conjugated heterocycles (that preserves all rings). Will assert that the number of ring members in both molecules is the same.
"""
# TODO: generalize this to test for ring breakage and closure.

iupac_pairs = [
('benzene', 'pyridine')
]

# Create and configure an AtomMapper
atom_mapper = AtomMapper(allow_ring_breaking=False)


for old_iupac, new_iupac in iupac_pairs:
from openff.toolkit.topology import Molecule
old_mol = Molecule.from_iupac(old_iupac)
new_mol = Molecule.from_iupac(new_iupac)
atom_mapping = atom_mapper.get_best_mapping(old_mol, new_mol)

assert len(atom_mapping.old_to_new_atom_map) > 0

def test_mapping_strength_levels(self):
"""Test the mapping strength defaults work as expected"""
# SMILES pairs to test mappings
tests = [
('c1ccccc1', 'C1CCCCC1', {'default': 0, 'weak' : 6, 'strong' : 0}), # benzene -> cyclohexane
('CNC1CCCC1', 'CNC1CCCCC1', {'default': 6, 'weak' : 6, 'strong' : 6}), # https://github.com/choderalab/perses/issues/805#issue-913932127
('c1ccccc1CNC2CCC2', 'c1ccccc1CNCC2CCC2', {'default': 13, 'weak' : 13, 'strong' : 11}), # https://github.com/choderalab/perses/issues/805#issue-913932127
('Cc1ccccc1','c1ccc(cc1)N', {'default': 12, 'weak' : 12, 'strong' : 11}),
('CC(c1ccccc1)','O=C(c1ccccc1)', {'default': 13, 'weak' : 14, 'strong' : 11}),
('Oc1ccccc1','Sc1ccccc1', {'default': 12, 'weak' : 12, 'strong' : 11}),
]

DEBUG_MODE = True # If True, don't fail, but print results of tests for calibration

for mol1_smiles, mol2_smiles, expected_results in tests:
for map_strength, expected_n_mapped_atoms in expected_results.items():
# Create OpenFF Molecule objects
from openff.toolkit.topology import Molecule
mol1 = Molecule.from_smiles(mol1_smiles)
mol2 = Molecule.from_smiles(mol2_smiles)

# Initialize the atom mapper with the requested mapping strength
atom_mapper = AtomMapper(map_strength=map_strength, allow_ring_breaking=False)

# Create the atom mapping
atom_mapping = atom_mapper.get_best_mapping(mol1, mol2)

if DEBUG_MODE:
if atom_mapping is not None:
_logger.info(f'{mol1_smiles} -> {mol2_smiles} using map strength {map_strength} : {atom_mapping.n_mapped_atoms} atoms mapped : {atom_mapping.old_to_new_atom_map}')
atom_mapping.render_image(f'test_mapping_strength_levels:{mol1_smiles}:{mol2_smiles}:{map_strength}.png')
else:
_logger.info(f'{mol1_smiles} -> {mol2_smiles} using map strength {map_strength} : {atom_mapping}')

else:
# Check that expected number of mapped atoms are provided
n_mapped_atoms = 0
if atom_mapping is not None:
n_mapped_atoms = atom_mapping.n_mapped_atoms

assert n_mapped_atoms==expected_n_mapped_atoms, "Number of mapped atoms does not match hand-calibrated expectation"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great testing, a lot of cases involved. Maybe these tests could eventually use test parametrization and fixtures to make it a bit cleaner. We should try making a separate issue for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Great idea---please open an issue!

Comment on lines +105 to +112
# Fix atom name retrieval from OEMol objects
# TODO: When https://github.com/openforcefield/openff-toolkit/issues/1026 is fixed, remove this
if hasattr(old_mol, 'GetAtoms'):
for index, atom in enumerate(old_mol.GetAtoms()):
self.old_mol.atoms[index].name = atom.GetName()
if hasattr(new_mol, 'GetAtoms'):
for index, atom in enumerate(new_mol.GetAtoms()):
self.new_mol.atoms[index].name = atom.GetName()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue seems to be solved upstream but a release of openff-toolkit with the changes has yet to be made. I think the next release should have this already, we should keep an eye for it. Do you think we can then remove this code and make perses depend on that version or newer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this code doesn't do any harm, and is compatible with the openff-toolkit current and future release.
At some point in the future, we can change the dependency to require the new version of openff-toolkit or later, and then remove this code---hence the #TODO!

old_offmol, old_oemol = copy_molecule(old_mol)
new_offmol, new_oemol = copy_molecule(new_mol)

if (old_offmol.n_atoms==0) or (new_offmol.n_atoms==0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just test this in the beginning with old_mol and new_mol directly? I know this shouldn't change the behavior but just for 'order'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would take more code, since we don't know whether they are OEMol or Molecule objects at the beginning.

elapsed_time = time.time() - initial_time
_logger.debug(f'get_best_mapping took {elapsed_time:.3f} s')

return atom_mappings[best_map_index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get multiple mappings with the same max score? What happens when that's the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! Added a comment to address this ambiguity.

else:
raise ValueError(f'Not sure how to copy {mol}; should be openff.toolkit.topology.Molecule or openeye.oechem.OEMol')

return offmol, oemol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how bad we are in memory usage or if we would ever need a better memory footprint. But this might be a candidate for improvement, maybe returning just copies of the same type as the input. That is, if input is an OEMol output is just OEMol as well (similar with offmol).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! The hope is to eliminate OEMol objects entirely when possible, and just use Molecule in the future. This is just a temporary hack.


"""
import numpy as np
atom_mappings = self.get_all_maps(old_mol, new_mol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe get_all_maps doesn't exist as a public/user-visible method. Is this supposed to be _get_all_maps? Now that makes me wonder if this method is being tested at all (probably not).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I now realize as well that the _get_all_maps static method has an undefined reference to self if no atom or bond expressions are passed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hoping to refactor the private _get_all_map() method in a future PR---once I fully understand what it does!
This was just a partial refactor, cleaning up the public API, eliminating references to the private API when possible, and refactoring the bits of the code I could understand. We'll need @dominicrufa's help (after he returns from his internship) for the next phase of refactoring/cleanup.


"""
from openff.toolkit.topology import Molecule
distance_unit = unit.angstroms # distance unit for comparisons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are units always expected to be angstroms? I guess this works for the most part but maybe we could try to get the actual units from the molecule conformers themselves. Something like old_mol.conformers[0].unit (assuming they all have the same units as the first occurrence)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distance_unit here is arbitrary, but defined here to be consistent in how both the measured distances and absolute distance threshold are converted into unitless numbers. So the choice is totally arbitrary.

@jchodera
Copy link
Member Author

Changes should be complete. Just waiting on tests to pass now.

@jchodera jchodera merged commit 02df62f into master Jul 30, 2021
@jchodera jchodera deleted the refactor-small-molecule-proposal branch July 30, 2021 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sporadic unspecified stereochemistry issue with generate_atom_mapping_from_positions
4 participants