Fix protein mutation repex consistency tests #1054

zhang-ivy · 2022-06-24T14:18:20Z

Description

This PR introduces the following changes:

For speed up:

Decrease the number of steps per iteration from 250 to 50
Set reassign_velocities to be False

For improved convergence:

Set n_iterations = 3000 for the neutral mutations and = 3000 for the charge changing mutations
Set n_replicas=36 for the charge changing test
Set minimisation_steps = 0, which means that the minimization will run until the energy is below a tolerance (instead of running for 100 steps, which is the default in HybridRepexSampler)
Change the convergence criteria to check if DDG < 6 * dDDG

Other:
- Change the solvent padding to be 1.7 nm (which is necessary for the nightly dev builds of openmm). Note this may make the tests slower if we are testing against OpenMM <= 7.7 EDIT: We are currently testing against openmm 7.7. We should not bump up the solvent padding until we switch the GPU tests to test against openmm => 7.8

Clean up unfinished comment and reporter file name

Motivation and context

This PR changes some of the parameters in the protein mutation repex consistency tests to ensure convergence (DDG < 6 * dDDG) and speed up the tests.

Resolves #1044

How has this been tested?

Tested on lilac.

Change log

Fix protein mutation repex internal consistency tests to ensure convergence and speed up the tests. So far only testing neutral transformations.

zhang-ivy · 2022-06-24T14:20:40Z

I think we'll want to skip the charge changing test for now -- @ijpulidos do you know how to do this?

codecov · 2022-06-24T14:33:40Z

Codecov Report

Attention: Patch coverage is 7.69231% with 12 lines in your changes missing coverage. Please review.

Project coverage is 52.37%. Comparing base (6af4bfd) to head (ed784d7).
Report is 59 commits behind head on main.

Additional details and impacted files

ijpulidos

Looks good, just some suggested changes to skip the charge change test and a question with the change of the padding.

perses/tests/test_repex.py

ijpulidos · 2022-06-24T16:49:16Z

perses/tests/test_repex.py

@@ -39,6 +37,7 @@ def test_RESTCapableHybridTopologyFactory_repex_neutral_mutation():
                "1",
                "2",
                mutant_name.upper(),
+                padding=1.7 * unit.nanometers,


Why is that we need to increase the default padding here? Thinking that would make the test to take more time to run.

I decreased it back down and left a TODO to increase it once we move to testing against openmm >= 7.8

Co-authored-by: Iván Pulido <ivanpulido@protonmail.com>

…into fix-repex-tests

ijpulidos

Looks good! We have to double check how much the tests will take on our GPU CI and that the charge changes test is not getting run.

ijpulidos · 2022-06-27T15:19:06Z

Just wanted to say, that according to our latest GPU CI workflow, we are correctly testing just the neutral mutation. It is passing and taking ~5.5 hours in total, but the test itself is using ~4.5 hours.

jchodera · 2022-07-04T05:08:50Z

This is odd. I was wondering why it took so long, so I ran the test on lilac on a GTX 1080 Ti, using

$ pytest -vvv --log-cli-level=DEBUG --log-cli-format="%(asctime)s %(levelname)s %(message)s" perses/tests/test_repex.py::test_RESTCapableHybridTopologyFactory_repex_neutral_mutation

to print a timestamp for each step.

It's been stuck at Generating unsampled endstates for a full hour!

00:11:03 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
00:11:03 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
00:11:05 INFO n_replicas not defined, setting to match n_states, 12
00:11:17 INFO Generating unsampled endstates.

@zhang-ivy : Is it possible this is actually taking an hour? Any idea why?

jchodera · 2022-07-04T05:44:01Z

Hm---I think this is a red herring. If I disable the endstates in the test, it goes silent after the previous line:

01:24:42 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
01:24:42 WARNING Warning: The openmmtools.multistate API is experimental and may change in future releases
01:24:43 INFO n_replicas not defined, setting to match n_states, 12

Some of our code must be disabling or changing the logging so I'm not getting any profess information even when requesting it via --log-cli-level=DEBUG. This would seem to be in hss.setup() or hss.extend():

            hss.setup(n_states=12, temperature=300 * unit.kelvin, t_max=300 * unit.kelvin,
                      storage_file=reporter, minimisation_steps=0, endstates=True)
            hss.energy_context_cache = cache.ContextCache(capacity=None, time_to_live=None, platform=platform)
            hss.sampler_context_cache = cache.ContextCache(capacity=None, time_to_live=None, platform=platform)

            # Run simulation                                                                                                                                                                                                                                                     
            hss.extend(n_iterations)

Any ideas why @ijpulidos or @mikemhenry?

zhang-ivy · 2022-07-05T13:19:11Z

@jchodera : Yes there is something weird going on with the logger. When I run the neutral mutations repex test, the per iteration repex logger results don't show for me either. I had to adapt the script to set up a logger (with level as DEBUG) to get the per iteration results.

I was wondering why it took so long

Note that the test neutral mutation repex test runs both ALA->THR and THR->ALA for 3000 iterations each (with 50 steps per iteration). While the test takes ~4.5 hours total, this would meant that each leg takes ~2 hours, which seems reasonable to me? How long would you expect each leg to take?

jchodera · 2022-07-05T18:00:07Z

For small molecules in solvent, I thought I had been seeing ~30 min execution times before. Instead of looking at the overall time (which is not very informative), I really want to see the breakdown of where time is being spent once we get past the first iteration.

zhang-ivy · 2022-07-05T18:13:20Z

@jchodera : I posted the logs (for the last 2 iterations) of a charge changing mutation (ALA->ARG) as well as a neutral mutation (THR->ALA) here -- you should be able to see the breakdown there.

For small molecules in solvent, I thought I had been seeing ~30 min
execution times before.

Hmm, I don't think we've finished writing the tests for small molecule repex consistency yet, so I'm not sure where that would be from, but maybe @dominicrufa knows which data you're referring to?

jchodera · 2022-07-05T20:16:24Z

I had been referring to timings of small molecules in solvent from my own previous experiments. Two things worry me from the logs: 1. The energy matrix computation is way too long at over 3s. This should be less than 1s. We need to look into why this is happening. Are long range dispersion corrections slowing things down? 2. The replica propagation time is nearly 7s. This seems way too long for a small solvent box. Do you have this calculation staged on disk somewhere?

zhang-ivy · 2022-07-06T14:46:15Z

@jchodera : For the neutral (THR->ALA) mutation, which uses n_replicas=12, propagating replicas takes ~2 seconds and computing energy matrix takes < 1 second. However, running 5000 iterations still takes ~3.5 hours, which you mentioned is much longer than expected. Is 2 seconds still much longer than you'd expect for the replica propagation time? If not, then I'm not sure what else would be making these so slow.
All the scripts/output files for this are located here: /data/chodera/zhangi/perses_benchmark/repex/38/38/with-perses-paper-test/, although I just accidentally deleted the log file :(

For the charge changing (ALA->ARG) mutation, which uses uses n_replicas=36, propagating replicas and computing the energy matrix take longer, as you mentioned above. I think the reason is because we are using so many more replicas.
All the scripts/output files are located here: /data/chodera/zhangi/perses_benchmark/repex/38/39/

Also since the solvation procedure has changed in openmm, i just want to double check that i'm using the appropriate amount of solvent. I'm using padding = 1.7 nm. Here is solvated capped ALA:

Does this look reasonable to you? Or is there more solvent than necessary? Note that we had to bump up the solvent padding because otherwise we were getting errors like: openmm.OpenMMException: NonbondedForce: The cutoff distance cannot be greater than half the periodic box size

zhang-ivy · 2022-07-07T21:01:26Z

@jchodera : I've run the experiments that we talked about yesterday to investigate what's causing the slow repex tests for ala->thr dipeptide:

where the propagation time is computed for running 50 steps (with 4 fs timestep) 12 times (aka 12 replicas) and V1, V2, V3 are defined as:
V1 (aka current version using LangevinSplittingDynamics move):

mcmc.LangevinSplittingDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                       collision_rate=1.0 / unit.picosecond,
                                                                       n_steps=50,
                                                                       reassign_velocities=True,
                                                                       n_restart_attempts=20,
                                                                       splitting="V R R R O R R R V",
                                                                       constraint_tolerance=1e-06)

V2 (aka version using LangevinSplittingDynamics move with simpler splitting and higher constraint tolerance):

mcmc.LangevinSplittingDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                       collision_rate=1.0 / unit.picosecond,
                                                                       n_steps=50,
                                                                       reassign_velocities=True,
                                                                       n_restart_attempts=20,
                                                                       splitting="V R O R V",
                                                                       constraint_tolerance=1e-05)

V3 (aka version using LangevinDynamics move):

mcmc.LangevinDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                       collision_rate=1.0 / unit.picosecond,
                                                                       n_steps=50,
                                                                       reassign_velocities=True,
                                                                       n_restart_attempts=20)

Note that LangevinSplittingDynamicsMove uses openmmtools BAOAB integrator, whereas LangevinDynamicsMove uses openmm LangevinIntegrator

Results:

Across all move versions, running vanilla MD on the old (unmodified) system is 2x faster than running vanilla MD on the hybrid system.
There is a small (10-30%) slowdown incurred when running repex vs vanilla MD using the hybrid system.
with V3, we can get close to 800 ns/day

Takeaways:

We should use V3 for the gpu repex consistency tests.
Since the LangevinDynamicsMove does not use the BAOAB integrator, I think we should use V2 to generate data for the barnase:barstar paper

Do you agree with my above takeaways?

Final thing to note: In the logs I posted before, I was seeing that the replica propagation time (when using V1) was taking closer to 2 seconds, whereas in the above table, I report 0.75 for the propagation time. This is because I've found that generating the htf in the same script as running repex causes the repex times to be >2x slower. I wonder if there is some thread contention that happens here.

@ijpulidos : Can you help me adapt the repex consistency tests (protein mutation and small molecule) given the above takeaways? And could you investigate why using the same script to generate the htf + run repex yields 2x slower repex iteration times // is there a workaround we can use to prevent the slowdown?

jchodera · 2022-07-09T03:26:08Z

I spoke with @dominicrufa about implementing a LangevinMiddleIntegrator variant of LangevinDynamicsMove, which would give us better configuration-space accuracy. This uses BAOAB, and should be very easy to put into the upcoming openmmtools release that @mikemhenry will need to cut.

In general, I think we want to set reassign_velocities=False once we have fixed the velocity resume issue. This will reduce correlation times.

The other slowdowns---generating the htf in the same script, as well as the difference between hybrid propagation and repex---are things we might want to look into as well later on.

zhang-ivy · 2022-07-09T14:23:25Z

@jchodera : A new LangevinDynamicsMove with LangevinMiddleIntegrator sounds great, and fixing the velocity resume issue would also be very helpful, but I'm wondering how essential it is for me to wait for these improvements to be able to start generating data for the barnase:barstar paper. I'll set up a time to chat about this (and other software roadblocks I've been experiencing, including the repex cycling problem that has reappeared) next week

jchodera · 2022-07-10T01:10:53Z

We should use a BAOAB-based integrator, which means you cannot use LangevinDynamicsMove since it is based on OBABO.

It may be easier just to modify LangevinDynamicsMove to use openmm.LangevinMiddleIntegrator, since there's no reason to specifically use openmm.LangevinIntetegrator. This is a one-line change here we should be able to get in right away.

Fixing the velocities resume bug would enable us to disable velocity reassignment every iteration, which is likely slowing down sampling significantly.

cc : choderalab/openmmtools#599 choderalab/openmmtools#600

zhang-ivy added 2 commits June 24, 2022 10:07

fix gpu tests and cleanup

2e245d8

change n_iterations to 3000 for charge

e28326a

zhang-ivy requested a review from ijpulidos June 24, 2022 14:19

zhang-ivy added this to the 0.10.1 - Bugfix release milestone Jun 24, 2022

ijpulidos requested changes Jun 24, 2022

View reviewed changes

zhang-ivy and others added 4 commits June 24, 2022 13:09

skip charge test

2956a73

Co-authored-by: Iván Pulido <ivanpulido@protonmail.com>

use smaller padding and add todo

d7416c7

Merge branch 'fix-repex-tests' of ssh://github.com/choderalab/perses …

0860d8b

…into fix-repex-tests

Merge branch 'main' into fix-repex-tests

f718500

ijpulidos self-requested a review June 24, 2022 18:07

ijpulidos approved these changes Jun 24, 2022

View reviewed changes

ijpulidos enabled auto-merge (squash) June 24, 2022 18:09

ijpulidos disabled auto-merge June 24, 2022 18:11

Enabling REST consistency tests in GPU CI.

ed784d7

ijpulidos enabled auto-merge (squash) June 24, 2022 18:16

ijpulidos merged commit 9e61c22 into main Jun 24, 2022

ijpulidos deleted the fix-repex-tests branch June 24, 2022 18:32

ijpulidos mentioned this pull request Aug 17, 2022

Speed up protein mutation charge changing repex consistency test #1058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix protein mutation repex consistency tests #1054

Fix protein mutation repex consistency tests #1054

zhang-ivy commented Jun 24, 2022 •

edited by ijpulidos

Loading

zhang-ivy commented Jun 24, 2022

codecov bot commented Jun 24, 2022 •

edited

Loading

ijpulidos left a comment

ijpulidos Jun 24, 2022

zhang-ivy Jun 24, 2022

ijpulidos left a comment

ijpulidos commented Jun 27, 2022 •

edited

Loading

jchodera commented Jul 4, 2022

jchodera commented Jul 4, 2022 •

edited

Loading

zhang-ivy commented Jul 5, 2022

jchodera commented Jul 5, 2022 via email

zhang-ivy commented Jul 5, 2022 •

edited

Loading

jchodera commented Jul 5, 2022 via email

zhang-ivy commented Jul 6, 2022 •

edited

Loading

zhang-ivy commented Jul 7, 2022

jchodera commented Jul 9, 2022 •

edited

Loading

zhang-ivy commented Jul 9, 2022 •

edited

Loading

jchodera commented Jul 10, 2022 •

edited

Loading

Fix protein mutation repex consistency tests #1054

Fix protein mutation repex consistency tests #1054

Conversation

zhang-ivy commented Jun 24, 2022 • edited by ijpulidos Loading

Description

Motivation and context

How has this been tested?

Change log

zhang-ivy commented Jun 24, 2022

codecov bot commented Jun 24, 2022 • edited Loading

Codecov Report

ijpulidos left a comment

Choose a reason for hiding this comment

ijpulidos Jun 24, 2022

Choose a reason for hiding this comment

zhang-ivy Jun 24, 2022

Choose a reason for hiding this comment

ijpulidos left a comment

Choose a reason for hiding this comment

ijpulidos commented Jun 27, 2022 • edited Loading

jchodera commented Jul 4, 2022

jchodera commented Jul 4, 2022 • edited Loading

zhang-ivy commented Jul 5, 2022

jchodera commented Jul 5, 2022 via email

zhang-ivy commented Jul 5, 2022 • edited Loading

jchodera commented Jul 5, 2022 via email

zhang-ivy commented Jul 6, 2022 • edited Loading

zhang-ivy commented Jul 7, 2022

jchodera commented Jul 9, 2022 • edited Loading

zhang-ivy commented Jul 9, 2022 • edited Loading

jchodera commented Jul 10, 2022 • edited Loading

zhang-ivy commented Jun 24, 2022 •

edited by ijpulidos

Loading

codecov bot commented Jun 24, 2022 •

edited

Loading

ijpulidos commented Jun 27, 2022 •

edited

Loading

jchodera commented Jul 4, 2022 •

edited

Loading

zhang-ivy commented Jul 5, 2022 •

edited

Loading

zhang-ivy commented Jul 6, 2022 •

edited

Loading

jchodera commented Jul 9, 2022 •

edited

Loading

zhang-ivy commented Jul 9, 2022 •

edited

Loading

jchodera commented Jul 10, 2022 •

edited

Loading