Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix protein mutation repex consistency tests #1054

Merged
merged 7 commits into from
Jun 24, 2022
Merged

Conversation

zhang-ivy
Copy link
Contributor

@zhang-ivy zhang-ivy commented Jun 24, 2022

Description

This PR introduces the following changes:

For speed up:

  • Decrease the number of steps per iteration from 250 to 50
  • Set reassign_velocities to be False

For improved convergence:

  • Set n_iterations = 3000 for the neutral mutations and = 3000 for the charge changing mutations
  • Set n_replicas=36 for the charge changing test
  • Set minimisation_steps = 0, which means that the minimization will run until the energy is below a tolerance (instead of running for 100 steps, which is the default in HybridRepexSampler)
  • Change the convergence criteria to check if DDG < 6 * dDDG

Other:
- Change the solvent padding to be 1.7 nm (which is necessary for the nightly dev builds of openmm). Note this may make the tests slower if we are testing against OpenMM <= 7.7 EDIT: We are currently testing against openmm 7.7. We should not bump up the solvent padding until we switch the GPU tests to test against openmm => 7.8

  • Clean up unfinished comment and reporter file name

Motivation and context

This PR changes some of the parameters in the protein mutation repex consistency tests to ensure convergence (DDG < 6 * dDDG) and speed up the tests.

Resolves #1044

How has this been tested?

Tested on lilac.

Change log

Fix protein mutation repex internal consistency tests to ensure convergence and speed up the tests. So far only testing neutral transformations.

@zhang-ivy zhang-ivy requested a review from ijpulidos June 24, 2022 14:19
@zhang-ivy zhang-ivy added this to the 0.10.1 - Bugfix release milestone Jun 24, 2022
@zhang-ivy
Copy link
Contributor Author

I think we'll want to skip the charge changing test for now -- @ijpulidos do you know how to do this?

@codecov
Copy link

codecov bot commented Jun 24, 2022

Codecov Report

Attention: Patch coverage is 7.69231% with 12 lines in your changes missing coverage. Please review.

Project coverage is 52.37%. Comparing base (6af4bfd) to head (ed784d7).
Report is 59 commits behind head on main.

Additional details and impacted files

Copy link
Contributor

@ijpulidos ijpulidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some suggested changes to skip the charge change test and a question with the change of the padding.

perses/tests/test_repex.py Show resolved Hide resolved
@@ -39,6 +37,7 @@ def test_RESTCapableHybridTopologyFactory_repex_neutral_mutation():
"1",
"2",
mutant_name.upper(),
padding=1.7 * unit.nanometers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that we need to increase the default padding here? Thinking that would make the test to take more time to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decreased it back down and left a TODO to increase it once we move to testing against openmm >= 7.8

@ijpulidos ijpulidos self-requested a review June 24, 2022 18:07
Copy link
Contributor

@ijpulidos ijpulidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We have to double check how much the tests will take on our GPU CI and that the charge changes test is not getting run.

@ijpulidos ijpulidos enabled auto-merge (squash) June 24, 2022 18:09
@ijpulidos ijpulidos disabled auto-merge June 24, 2022 18:11
@ijpulidos ijpulidos enabled auto-merge (squash) June 24, 2022 18:16
@ijpulidos ijpulidos merged commit 9e61c22 into main Jun 24, 2022
@ijpulidos ijpulidos deleted the fix-repex-tests branch June 24, 2022 18:32
@ijpulidos
Copy link
Contributor

ijpulidos commented Jun 27, 2022

Just wanted to say, that according to our latest GPU CI workflow, we are correctly testing just the neutral mutation. It is passing and taking ~5.5 hours in total, but the test itself is using ~4.5 hours.

@jchodera
Copy link
Member

jchodera commented Jul 4, 2022

This is odd. I was wondering why it took so long, so I ran the test on lilac on a GTX 1080 Ti, using

$ pytest -vvv --log-cli-level=DEBUG --log-cli-format="%(asctime)s %(levelname)s %(message)s" perses/tests/test_repex.py::test_RESTCapableHybridTopologyFactory_repex_neutral_mutation

to print a timestamp for each step.

It's been stuck at Generating unsampled endstates for a full hour!

00:11:03 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
00:11:03 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
00:11:05 INFO n_replicas not defined, setting to match n_states, 12
00:11:17 INFO Generating unsampled endstates.

@zhang-ivy : Is it possible this is actually taking an hour? Any idea why?

@jchodera
Copy link
Member

jchodera commented Jul 4, 2022

Hm---I think this is a red herring. If I disable the endstates in the test, it goes silent after the previous line:

01:24:42 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
01:24:42 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
01:24:43 INFO n_replicas not defined, setting to match n_states, 12

Some of our code must be disabling or changing the logging so I'm not getting any profess information even when requesting it via --log-cli-level=DEBUG. This would seem to be in hss.setup() or hss.extend():

            hss.setup(n_states=12, temperature=300 * unit.kelvin, t_max=300 * unit.kelvin,
                      storage_file=reporter, minimisation_steps=0, endstates=True)
            hss.energy_context_cache = cache.ContextCache(capacity=None, time_to_live=None, platform=platform)
            hss.sampler_context_cache = cache.ContextCache(capacity=None, time_to_live=None, platform=platform)

            # Run simulation                                                                                                                                                                                                                                                     
            hss.extend(n_iterations)

Any ideas why @ijpulidos or @mikemhenry?

@zhang-ivy
Copy link
Contributor Author

@jchodera : Yes there is something weird going on with the logger. When I run the neutral mutations repex test, the per iteration repex logger results don't show for me either. I had to adapt the script to set up a logger (with level as DEBUG) to get the per iteration results.

I was wondering why it took so long

Note that the test neutral mutation repex test runs both ALA->THR and THR->ALA for 3000 iterations each (with 50 steps per iteration). While the test takes ~4.5 hours total, this would meant that each leg takes ~2 hours, which seems reasonable to me? How long would you expect each leg to take?

@jchodera
Copy link
Member

jchodera commented Jul 5, 2022 via email

@zhang-ivy
Copy link
Contributor Author

zhang-ivy commented Jul 5, 2022

@jchodera : I posted the logs (for the last 2 iterations) of a charge changing mutation (ALA->ARG) as well as a neutral mutation (THR->ALA) here -- you should be able to see the breakdown there.

For small molecules in solvent, I thought I had been seeing ~30 min
execution times before.

Hmm, I don't think we've finished writing the tests for small molecule repex consistency yet, so I'm not sure where that would be from, but maybe @dominicrufa knows which data you're referring to?

@jchodera
Copy link
Member

jchodera commented Jul 5, 2022 via email

@zhang-ivy
Copy link
Contributor Author

zhang-ivy commented Jul 6, 2022

@jchodera : For the neutral (THR->ALA) mutation, which uses n_replicas=12, propagating replicas takes ~2 seconds and computing energy matrix takes < 1 second. However, running 5000 iterations still takes ~3.5 hours, which you mentioned is much longer than expected. Is 2 seconds still much longer than you'd expect for the replica propagation time? If not, then I'm not sure what else would be making these so slow.
All the scripts/output files for this are located here: /data/chodera/zhangi/perses_benchmark/repex/38/38/with-perses-paper-test/, although I just accidentally deleted the log file :(

For the charge changing (ALA->ARG) mutation, which uses uses n_replicas=36, propagating replicas and computing the energy matrix take longer, as you mentioned above. I think the reason is because we are using so many more replicas.
All the scripts/output files are located here: /data/chodera/zhangi/perses_benchmark/repex/38/39/

Also since the solvation procedure has changed in openmm, i just want to double check that i'm using the appropriate amount of solvent. I'm using padding = 1.7 nm. Here is solvated capped ALA:
Screen Shot 2022-07-06 at 10 42 06 AM
Does this look reasonable to you? Or is there more solvent than necessary? Note that we had to bump up the solvent padding because otherwise we were getting errors like: openmm.OpenMMException: NonbondedForce: The cutoff distance cannot be greater than half the periodic box size

@zhang-ivy
Copy link
Contributor Author

@jchodera : I've run the experiments that we talked about yesterday to investigate what's causing the slow repex tests for ala->thr dipeptide:

Screen Shot 2022-07-07 at 2 07 58 PM

where the propagation time is computed for running 50 steps (with 4 fs timestep) 12 times (aka 12 replicas) and V1, V2, V3 are defined as:
V1 (aka current version using LangevinSplittingDynamics move):

mcmc.LangevinSplittingDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                       collision_rate=1.0 / unit.picosecond,
                                                                       n_steps=50,
                                                                       reassign_velocities=True,
                                                                       n_restart_attempts=20,
                                                                       splitting="V R R R O R R R V",
                                                                       constraint_tolerance=1e-06)

V2 (aka version using LangevinSplittingDynamics move with simpler splitting and higher constraint tolerance):

mcmc.LangevinSplittingDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                       collision_rate=1.0 / unit.picosecond,
                                                                       n_steps=50,
                                                                       reassign_velocities=True,
                                                                       n_restart_attempts=20,
                                                                       splitting="V R O R V",
                                                                       constraint_tolerance=1e-05)

V3 (aka version using LangevinDynamics move):

mcmc.LangevinDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                       collision_rate=1.0 / unit.picosecond,
                                                                       n_steps=50,
                                                                       reassign_velocities=True,
                                                                       n_restart_attempts=20)

Note that LangevinSplittingDynamicsMove uses openmmtools BAOAB integrator, whereas LangevinDynamicsMove uses openmm LangevinIntegrator

Results:

  • Across all move versions, running vanilla MD on the old (unmodified) system is 2x faster than running vanilla MD on the hybrid system.
  • There is a small (10-30%) slowdown incurred when running repex vs vanilla MD using the hybrid system.
  • with V3, we can get close to 800 ns/day

Takeaways:

  • We should use V3 for the gpu repex consistency tests.
  • Since the LangevinDynamicsMove does not use the BAOAB integrator, I think we should use V2 to generate data for the barnase:barstar paper

Do you agree with my above takeaways?

Final thing to note: In the logs I posted before, I was seeing that the replica propagation time (when using V1) was taking closer to 2 seconds, whereas in the above table, I report 0.75 for the propagation time. This is because I've found that generating the htf in the same script as running repex causes the repex times to be >2x slower. I wonder if there is some thread contention that happens here.

@ijpulidos : Can you help me adapt the repex consistency tests (protein mutation and small molecule) given the above takeaways? And could you investigate why using the same script to generate the htf + run repex yields 2x slower repex iteration times // is there a workaround we can use to prevent the slowdown?

@jchodera
Copy link
Member

jchodera commented Jul 9, 2022

I spoke with @dominicrufa about implementing a LangevinMiddleIntegrator variant of LangevinDynamicsMove, which would give us better configuration-space accuracy. This uses BAOAB, and should be very easy to put into the upcoming openmmtools release that @mikemhenry will need to cut.

In general, I think we want to set reassign_velocities=False once we have fixed the velocity resume issue. This will reduce correlation times.

The other slowdowns---generating the htf in the same script, as well as the difference between hybrid propagation and repex---are things we might want to look into as well later on.

@zhang-ivy
Copy link
Contributor Author

zhang-ivy commented Jul 9, 2022

@jchodera : A new LangevinDynamicsMove with LangevinMiddleIntegrator sounds great, and fixing the velocity resume issue would also be very helpful, but I'm wondering how essential it is for me to wait for these improvements to be able to start generating data for the barnase:barstar paper. I'll set up a time to chat about this (and other software roadblocks I've been experiencing, including the repex cycling problem that has reappeared) next week

@jchodera
Copy link
Member

jchodera commented Jul 10, 2022

We should use a BAOAB-based integrator, which means you cannot use LangevinDynamicsMove since it is based on OBABO.

It may be easier just to modify LangevinDynamicsMove to use openmm.LangevinMiddleIntegrator, since there's no reason to specifically use openmm.LangevinIntetegrator. This is a one-line change here we should be able to get in right away.

Fixing the velocities resume bug would enable us to disable velocity reassignment every iteration, which is likely slowing down sampling significantly.

cc : choderalab/openmmtools#599 choderalab/openmmtools#600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix GPU tests for repex internal consistency (protein mutations)
3 participants