Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI bug when multiple GPUs are used per calculation #1130

Closed
HershGupta404 opened this issue Jan 7, 2019 · 8 comments
Closed

MPI bug when multiple GPUs are used per calculation #1130

HershGupta404 opened this issue Jan 7, 2019 · 8 comments

Comments

@HershGupta404
Copy link

HershGupta404 commented Jan 7, 2019

When running a repex job for 2 systems on separate nodes with 2 GPUs per system, there appears to be uneven mixing of states depending on which GPU number the replica is. All of the even-numbered replicas (0 GPU) appear to be mixing decently, but the odd-numbered replicas (1 GPU) appear to not be mixing properly. The input files are below. The output files are at /data/chodera/hgupta/repex_flatbottom_nal_md2/no_restraint/experiments. I've also attached a script that reads the total number of states visited for each replica, and a sample output for data at /data/chodera/hgupta/repex_flatbottom_nal_md2/no_restraint/experiments/experiment-neg-MD2.

input_files.zip

count_states.zip

environment.yml.zip

@HershGupta404
Copy link
Author

I'm going to do a few more runs this week to check replicability. I'll post the results below.

@HershGupta404
Copy link
Author

HershGupta404 commented Jan 13, 2019

I ran same system above on shorter runs (500 iterations instead of 1500) with 1 GPU or 2 GPUs. It appears that the 2 GPU run is not mixing properly. See data in files attached.
state_count_sing.txt
state_count_mult.txt

@andrrizzi andrrizzi added this to the yank 1.0 milestone Jan 18, 2019
@andrrizzi andrrizzi added this to Bugs To Squish in YANK 1.0 Release Jan 18, 2019
@andrrizzi
Copy link
Contributor

I can confirm I can reproduce this. I'll dig into what could be happening as soon as I have time.

@jchodera
Copy link
Member

Great! This should be a high priority bug.

@jchodera jchodera changed the title Different GPUs on the same job show different mixing between replicas. MPI bug when multiple GPUs are used per calculation Feb 8, 2019
@HershGupta404
Copy link
Author

I haven't had much time to work on this due to school, but one thing I've noticed is that approximately the same amount of mixing is reported as occurring from the log files between a single GPU and multiple GPU run.

@andrrizzi
Copy link
Contributor

Thanks! Apologies I haven't been able to look into this yet, but other projects are keeping me busy. This is my next YANK-related thing to look at as soon as I can spare some time.

@andrrizzi
Copy link
Contributor

Since we moved the multistate module to openmmtools, I've moved the discussion here: choderalab/openmmtools#449. I would still leave this open until we figure this out.

@jchodera
Copy link
Member

jchodera commented Nov 9, 2021

We believe this is addressed (see choderalab/openmmtools#449 (comment)), though we are unclear what resolved the issue.

@jchodera jchodera closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
YANK 1.0 Release
Bugs To Squish
Development

No branches or pull requests

3 participants