Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI bug when multiple GPUs are used per calculation #449

Closed
andrrizzi opened this issue Nov 8, 2019 · 12 comments · Fixed by #562
Closed

MPI bug when multiple GPUs are used per calculation #449

andrrizzi opened this issue Nov 8, 2019 · 12 comments · Fixed by #562

Comments

@andrrizzi
Copy link
Contributor

I wanted to create an issue about this in openmmtools as well since we completed the transfering of the multistate code from YANK.

We're still experiencing mixing problem with the latest version of MPI when multiple GPUs are available (see choderalab/yank#1130 and #407). I've added a test to test_sampling checking this in #407, but I still haven't figure out the reason for the bug.

I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

@jchodera
Copy link
Member

jchodera commented Nov 9, 2019

I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

That sounds like the best thing to try now that you have a working test! You could use git bisect run to automate this process. Since the samplers moved from YANK to openmmtools, it could be a simple matter of testing the last version of YANK that included the multistate samplers (which presumably still have the bug) and then bisecting between that and 0.20.1.

@mjw99
Copy link
Collaborator

mjw99 commented Apr 13, 2020

Hi, some comments.

First, looking at the environment file in bug report here, the reporter was using YANK 0.23.7, hence that would narrow the range to 0.20.1<->0.23.7 .

Secondly, I have seen this problem with various explicit solvent YANK calculations that I have run myself over two GPUs. More interestingly, if I run the hydration-phenol example, with (default_number_of_iterations: 20000) and use the count_states.py script (with a minor mod using openmmtools packages) from the above bug report, I see:

Solvent1 (water)
0   19 
1   15 
2   19 
3   16 
4   19 
5   15 
6   19 
7   16 
8   19 
9   16 
10   19 
11   17 
12   19 
13   17 
14   19 
15   15 
16   19 
17   6 
18   19 

and

Solvent2 (vacuum)
0   19 
1   19 
2   19 
3   19 
4   19 
5   19 
6   19 
7   19 
8   19 
9   19 
10   19 
11   19 
12   19 
13   19 
14   19 
15   19 
16   19 
17   19 
18   19 

What is interesting here, is that for the vacuum, there is no difference for the odd/even replicas.

I am now speculating: is this something specific to NPT simulations? Could there be an issue with box_vectors and MPI in ReplicaExchangeSampler or its parent class? Do you know if this is seen with the NVT cucurbit[7]uri example over multiple GPUs?

@mpvenkatesh
Copy link

Has this been resolved? I am trying to set up a replica exchange sampler on a multi-GPU node, so could use the command-line and other set-up information needed to reproduce the bug discussed here on openmm-tools (i.e., without yank).

@zhang-ivy
Copy link
Contributor

I ran some repex simulations with multiple GPUs recently and was not able to reproduce this bug.

I ran ~80 iterations of h-repex (with 12 replicas, swap-all) on 2 GPUs and am not able to reproduce the problem:

# replica_number, total_number_states_visited
0 12
1 10
2 12
3 12
4 12
5 11
6 12
7 12
8 12
9 12
10 12
11 11

With 6 GPUs (500 iterations):

0 12
1 12
2 12
3 12
4 12
5 12
6 12
7 12
8 12
9 12
10 12
11 12

Using these mpi related packages:

mpi                       1.0                       mpich    conda-forge
mpi4py                    3.1.1            py39h6438238_0    conda-forge
mpich                     3.4.2              h846660c_100    conda-forge
mpiplus                   v0.0.1          py39hde42818_1002    conda-forge

We should be able to close this issue unless someone else is still able to reproduce.

@jchodera
Copy link
Member

jchodera commented Nov 9, 2021

Thanks, @zhang-ivy !

Please re-open if this issue reappears.

@jchodera jchodera closed this as completed Nov 9, 2021
@zhang-ivy
Copy link
Contributor

zhang-ivy commented Mar 23, 2022

I'm re-opening this issue because I am seeing the same problematic behavior when running h-repex (36 replicas) on barnase:barstar with 2 gpus:

# replica_number, total_number_states_visited
0 36
1 10
2 36
3 11
4 36
5 13
6 36
7 14
8 36
9 15
10 36
11 14
12 36
13 14
14 36
15 14
16 36
17 14
18 36
19 14
20 36
21 15
22 36
23 12
24 36
25 13
26 36
27 12
28 36
29 12
30 36
31 12
32 36
33 8
34 36
35 6

vs with 1 gpu:

0 36
1 36
2 36
3 36
4 36
5 36
6 36
7 36
8 36
9 36
10 36
11 36
12 36
13 36
14 36
15 36
16 36
17 36
18 36
19 36
20 36
21 36
22 36
23 36
24 36
25 36
26 36
27 36
28 36
29 36
30 36
31 36
32 36
33 36
34 36
35 36

I tried running the same experiment on alanine dipeptide in vacuum and am not able to reproduce the problem.
Edit: I actually think the problem is there with ala dipeptide in vacuum as well, its just more subtle.

I'm also not sure why I wasn't seeing the same problem in my previous experiments..

Here is the code to generate the number of states visited per replica, given a .nc file:

import openmmtools
import os
from perses.analysis import utils
reporter = openmmtools.multistate.MultiStateReporter(os.path.join("0_complex.nc"), 'r')
states = reporter.read_replica_thermodynamic_states()
for i in range(reporter.n_states):
    print(i, len(set(states[:,i])))

Attached is the pickled barnase:barstar htf, bash script, and python script for running repex.
replica_mixing_issue.zip

cc: @jchodera @ijpulidos

@jchodera jchodera reopened this Mar 24, 2022
@jchodera jchodera added this to the 0.21.3 milestone Mar 24, 2022
@jchodera
Copy link
Member

Thanks for providing a test system, @zhang-ivy!

Which version of clusterutils are you using?
Also, when you ran it, did you see this?

(openmm-dev) [chodera@lilac:replica_mixing_issue]$ build_mpirun_configfile --configfilepath configfile_apo --hostfilepath hostfile_apo "python /home/zhangi/choderalab/perses_benchmark/perses_protein_mutations/code/31_rest_over_protocol/run_rest_over_protocol.py $outdir $phase $n_states $n_cycles $T_max"
Detected MPICH version 4! 
Your host and configfiles creation will still be attempted, but you may have problems as build_mpirun_configfile only builds MPICH3 compatible files.

@zhang-ivy
Copy link
Contributor

No, I did not see that warning.

clusterutils              0.3.1              pyhd8ed1ab_1    conda-forge

I'm attaching my environment here:
openmm-dev.yml.zip

@zhang-ivy
Copy link
Contributor

Note that when I run my 1 gpu version, everything is the same in the bash script, except I changed the number of gpus to 1, i.e. I still use the clusterutils commands in the bash script for 1 gpu.

@ijpulidos
Copy link
Contributor

Thanks a lot for all the work and effort from @zhang-ivy , specially by pointing me to the differences between yank versions 0.20.1 and 0.21.0 which was when this issue first appeared. I think I managed to come up with a solution.

I made a PR with a probable solution. As far as I could tell, the _replica_thermodynamic_states attribute was not getting broadcasted to the MPI context. More details in the PR

@zhang-ivy if you can confirm that this solves it it with all your examples and systems it would be really nice as a validation. Just need to install the fix-mpi-replica-mix branch with something like pip install "git+https://github.com/choderalab/openmmtools.git@fix-mpi-replica-mix"

@jchodera
Copy link
Member

@ijpulidos : That was the bug! Thanks for catching it after all this time!

@zhang-ivy
Copy link
Contributor

@ijpulidos : Great work! Can confirm that your PR fixes the mixing problem for ala dipeptide in solvent t-repex and apo barstar h-repex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants