MPI bug when multiple GPUs are used per calculation #449

andrrizzi · 2019-11-08T09:57:14Z

I wanted to create an issue about this in openmmtools as well since we completed the transfering of the multistate code from YANK.

We're still experiencing mixing problem with the latest version of MPI when multiple GPUs are available (see choderalab/yank#1130 and #407). I've added a test to test_sampling checking this in #407, but I still haven't figure out the reason for the bug.

I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

The text was updated successfully, but these errors were encountered:

jchodera · 2019-11-09T18:49:10Z

I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

That sounds like the best thing to try now that you have a working test! You could use git bisect run to automate this process. Since the samplers moved from YANK to openmmtools, it could be a simple matter of testing the last version of YANK that included the multistate samplers (which presumably still have the bug) and then bisecting between that and 0.20.1.

mjw99 · 2020-04-13T15:12:44Z

Hi, some comments.

First, looking at the environment file in bug report here, the reporter was using YANK 0.23.7, hence that would narrow the range to 0.20.1<->0.23.7 .

Secondly, I have seen this problem with various explicit solvent YANK calculations that I have run myself over two GPUs. More interestingly, if I run the hydration-phenol example, with (default_number_of_iterations: 20000) and use the count_states.py script (with a minor mod using openmmtools packages) from the above bug report, I see:

Solvent1 (water)
0   19 
1   15 
2   19 
3   16 
4   19 
5   15 
6   19 
7   16 
8   19 
9   16 
10   19 
11   17 
12   19 
13   17 
14   19 
15   15 
16   19 
17   6 
18   19

and

Solvent2 (vacuum)
0   19 
1   19 
2   19 
3   19 
4   19 
5   19 
6   19 
7   19 
8   19 
9   19 
10   19 
11   19 
12   19 
13   19 
14   19 
15   19 
16   19 
17   19 
18   19

What is interesting here, is that for the vacuum, there is no difference for the odd/even replicas.

I am now speculating: is this something specific to NPT simulations? Could there be an issue with box_vectors and MPI in ReplicaExchangeSampler or its parent class? Do you know if this is seen with the NVT cucurbit[7]uri example over multiple GPUs?

mpvenkatesh · 2021-07-08T21:41:11Z

Has this been resolved? I am trying to set up a replica exchange sampler on a multi-GPU node, so could use the command-line and other set-up information needed to reproduce the bug discussed here on openmm-tools (i.e., without yank).

zhang-ivy · 2021-11-09T14:20:39Z

I ran some repex simulations with multiple GPUs recently and was not able to reproduce this bug.

I ran ~80 iterations of h-repex (with 12 replicas, swap-all) on 2 GPUs and am not able to reproduce the problem:

# replica_number, total_number_states_visited
0 12
1 10
2 12
3 12
4 12
5 11
6 12
7 12
8 12
9 12
10 12
11 11

With 6 GPUs (500 iterations):

Using these mpi related packages:

mpi                       1.0                       mpich    conda-forge
mpi4py                    3.1.1            py39h6438238_0    conda-forge
mpich                     3.4.2              h846660c_100    conda-forge
mpiplus                   v0.0.1          py39hde42818_1002    conda-forge

We should be able to close this issue unless someone else is still able to reproduce.

jchodera · 2021-11-09T17:28:52Z

Thanks, @zhang-ivy !

Please re-open if this issue reappears.

zhang-ivy · 2022-03-23T22:26:26Z

I'm re-opening this issue because I am seeing the same problematic behavior when running h-repex (36 replicas) on barnase:barstar with 2 gpus:

# replica_number, total_number_states_visited
0 36
1 10
2 36
3 11
4 36
5 13
6 36
7 14
8 36
9 15
10 36
11 14
12 36
13 14
14 36
15 14
16 36
17 14
18 36
19 14
20 36
21 15
22 36
23 12
24 36
25 13
26 36
27 12
28 36
29 12
30 36
31 12
32 36
33 8
34 36
35 6

vs with 1 gpu:

~~I tried running the same experiment on alanine dipeptide in vacuum and am not able to reproduce the problem.~~
Edit: I actually think the problem is there with ala dipeptide in vacuum as well, its just more subtle.

I'm also not sure why I wasn't seeing the same problem in my previous experiments..

Here is the code to generate the number of states visited per replica, given a .nc file:

import openmmtools
import os
from perses.analysis import utils
reporter = openmmtools.multistate.MultiStateReporter(os.path.join("0_complex.nc"), 'r')
states = reporter.read_replica_thermodynamic_states()
for i in range(reporter.n_states):
    print(i, len(set(states[:,i])))

Attached is the pickled barnase:barstar htf, bash script, and python script for running repex.
replica_mixing_issue.zip

cc: @jchodera @ijpulidos

jchodera · 2022-03-24T01:34:23Z

Thanks for providing a test system, @zhang-ivy!

Which version of clusterutils are you using?
Also, when you ran it, did you see this?

(openmm-dev) [chodera@lilac:replica_mixing_issue]$ build_mpirun_configfile --configfilepath configfile_apo --hostfilepath hostfile_apo "python /home/zhangi/choderalab/perses_benchmark/perses_protein_mutations/code/31_rest_over_protocol/run_rest_over_protocol.py $outdir $phase $n_states $n_cycles $T_max"
Detected MPICH version 4! 
Your host and configfiles creation will still be attempted, but you may have problems as build_mpirun_configfile only builds MPICH3 compatible files.

zhang-ivy · 2022-03-24T14:12:47Z

No, I did not see that warning.

clusterutils              0.3.1              pyhd8ed1ab_1    conda-forge

I'm attaching my environment here:
openmm-dev.yml.zip

zhang-ivy · 2022-03-24T14:14:01Z

Note that when I run my 1 gpu version, everything is the same in the bash script, except I changed the number of gpus to 1, i.e. I still use the clusterutils commands in the bash script for 1 gpu.

ijpulidos · 2022-03-29T01:19:47Z

Thanks a lot for all the work and effort from @zhang-ivy , specially by pointing me to the differences between yank versions 0.20.1 and 0.21.0 which was when this issue first appeared. I think I managed to come up with a solution.

I made a PR with a probable solution. As far as I could tell, the _replica_thermodynamic_states attribute was not getting broadcasted to the MPI context. More details in the PR

@zhang-ivy if you can confirm that this solves it it with all your examples and systems it would be really nice as a validation. Just need to install the fix-mpi-replica-mix branch with something like pip install "git+https://github.com/choderalab/openmmtools.git@fix-mpi-replica-mix"

jchodera · 2022-03-29T01:50:49Z

@ijpulidos : That was the bug! Thanks for catching it after all this time!

zhang-ivy · 2022-03-29T15:45:34Z

@ijpulidos : Great work! Can confirm that your PR fixes the mixing problem for ala dipeptide in solvent t-repex and apo barstar h-repex.

andrrizzi mentioned this issue Nov 8, 2019

MPI bug when multiple GPUs are used per calculation choderalab/yank#1130

Closed

andrrizzi added 🐛 bug 🚨 critical labels Nov 8, 2019

andrrizzi mentioned this issue Nov 8, 2019

Add test for repex mixing with MPI #407

Merged

andrrizzi mentioned this issue May 29, 2020

Example for running MPI for ReplicaExchangeSampler? #472

Open

jchodera mentioned this issue Sep 5, 2020

Ensemble validation issue for temperature replica exchange #480

Open

jchodera mentioned this issue Oct 22, 2020

Replace cython-accelerated all-swap replica mixing with numba implementation #485

Merged

5 tasks

jchodera closed this as completed Nov 9, 2021

jchodera reopened this Mar 24, 2022

jchodera added this to the 0.21.3 milestone Mar 24, 2022

jchodera assigned ijpulidos and jchodera Mar 24, 2022

ijpulidos mentioned this issue Mar 29, 2022

Returning and broadcasting _replica_thermodynamic_states #562

Merged

5 tasks

ijpulidos closed this as completed in #562 Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI bug when multiple GPUs are used per calculation #449

MPI bug when multiple GPUs are used per calculation #449

andrrizzi commented Nov 8, 2019

jchodera commented Nov 9, 2019 •

edited

mjw99 commented Apr 13, 2020

mpvenkatesh commented Jul 8, 2021

zhang-ivy commented Nov 9, 2021

jchodera commented Nov 9, 2021

zhang-ivy commented Mar 23, 2022 •

edited

jchodera commented Mar 24, 2022

zhang-ivy commented Mar 24, 2022

zhang-ivy commented Mar 24, 2022

ijpulidos commented Mar 29, 2022

jchodera commented Mar 29, 2022

zhang-ivy commented Mar 29, 2022

MPI bug when multiple GPUs are used per calculation #449

MPI bug when multiple GPUs are used per calculation #449

Comments

andrrizzi commented Nov 8, 2019

jchodera commented Nov 9, 2019 • edited

mjw99 commented Apr 13, 2020

mpvenkatesh commented Jul 8, 2021

zhang-ivy commented Nov 9, 2021

jchodera commented Nov 9, 2021

zhang-ivy commented Mar 23, 2022 • edited

jchodera commented Mar 24, 2022

zhang-ivy commented Mar 24, 2022

zhang-ivy commented Mar 24, 2022

ijpulidos commented Mar 29, 2022

jchodera commented Mar 29, 2022

zhang-ivy commented Mar 29, 2022

jchodera commented Nov 9, 2019 •

edited

zhang-ivy commented Mar 23, 2022 •

edited