New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI bug when multiple GPUs are used per calculation #449
Comments
That sounds like the best thing to try now that you have a working test! You could use |
Hi, some comments. First, looking at the environment file in bug report here, the reporter was using YANK 0.23.7, hence that would narrow the range to 0.20.1<->0.23.7 . Secondly, I have seen this problem with various explicit solvent YANK calculations that I have run myself over two GPUs. More interestingly, if I run the hydration-phenol example, with (
and
What is interesting here, is that for the vacuum, there is no difference for the odd/even replicas. I am now speculating: is this something specific to NPT simulations? Could there be an issue with box_vectors and MPI in ReplicaExchangeSampler or its parent class? Do you know if this is seen with the NVT cucurbit[7]uri example over multiple GPUs? |
Has this been resolved? I am trying to set up a replica exchange sampler on a multi-GPU node, so could use the command-line and other set-up information needed to reproduce the bug discussed here on openmm-tools (i.e., without yank). |
I ran some repex simulations with multiple GPUs recently and was not able to reproduce this bug. I ran ~80 iterations of h-repex (with 12 replicas, swap-all) on 2 GPUs and am not able to reproduce the problem:
With 6 GPUs (500 iterations):
Using these mpi related packages:
We should be able to close this issue unless someone else is still able to reproduce. |
Thanks, @zhang-ivy ! Please re-open if this issue reappears. |
I'm re-opening this issue because I am seeing the same problematic behavior when running h-repex (36 replicas) on barnase:barstar with 2 gpus:
vs with 1 gpu:
I'm also not sure why I wasn't seeing the same problem in my previous experiments.. Here is the code to generate the number of states visited per replica, given a import openmmtools
import os
from perses.analysis import utils
reporter = openmmtools.multistate.MultiStateReporter(os.path.join("0_complex.nc"), 'r')
states = reporter.read_replica_thermodynamic_states()
for i in range(reporter.n_states):
print(i, len(set(states[:,i]))) Attached is the pickled barnase:barstar htf, bash script, and python script for running repex. cc: @jchodera @ijpulidos |
Thanks for providing a test system, @zhang-ivy! Which version of
|
No, I did not see that warning.
I'm attaching my environment here: |
Note that when I run my |
Thanks a lot for all the work and effort from @zhang-ivy , specially by pointing me to the differences between yank versions I made a PR with a probable solution. As far as I could tell, the @zhang-ivy if you can confirm that this solves it it with all your examples and systems it would be really nice as a validation. Just need to install the |
@ijpulidos : That was the bug! Thanks for catching it after all this time! |
@ijpulidos : Great work! Can confirm that your PR fixes the mixing problem for ala dipeptide in solvent t-repex and apo barstar h-repex. |
I wanted to create an issue about this in openmmtools as well since we completed the transfering of the
multistate
code from YANK.We're still experiencing mixing problem with the latest version of MPI when multiple GPUs are available (see choderalab/yank#1130 and #407). I've added a test to
test_sampling
checking this in #407, but I still haven't figure out the reason for the bug.I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the
multistate
module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.The text was updated successfully, but these errors were encountered: