-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for restart bugs (issue #5) #45
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great to me!
I have also taken some steps to confirm that this resolves issue #5 . In fact, the following steps confirm that both the configuration files and data dumps are completely identical both before and after restarting:
- Make the following modifications to line 26 of
input/checkpoint_ppm-8.in
:- Replacing
schedule { var = "cycle"; }
withschedule { var = "cycle"; list=[10,20];}
- Inserting the line
dir = ["checkpoint_ppm-8-data-%d","cycle"];
somewhere between lines 25 and 28
- Replacing
- Inserting the line
dir = ["checkpoint_ppm-1-data-%d","cycle"];
somewhere between lines 28 and 32 ininput/checkpoint_ppm-1.in
- Install
h5diff
- Download this script and save it as
test_checkpoint.sh
. - From the base directory of the repository separately execute
bash test_checkpoint.sh 1
andbash test_checkpoint.sh 8
I also confirmed that the outputs are identical before and after restarts if we use charmrun ++p 4
instead of charmrun ++local
.
@jobordner could you please fix the merge conflicts so that we can merge this? Thank you! |
I looked into this and the source of the merge conflict is really simple. The PR #53 simply moved This is trivial to resolve. @jobordner would you like me to take care of it? Edit: In fact, I've confirmed that |
@jobordner Out of curiosity, do you have any idea why the tests failed the first time they were run after you made the latest commit? I saw that they were timeout bugs - do you think that they were just an anomaly? |
@mabruzzo The adapt-L5-P1 test hung, which shouldn't be related to any of the changes in this PR. There were also error messages from circleci about violating concurrency limits, maybe it was somehow using more cores than requested at some point during the testing. Both single and double precision tests failed (I think I only checked where in one, however). When I reran it with ssh it still seemed to be slow. Not sure what happened, but it's something to keep an eye on. |
Thanks. I'll file an issue about it so that if that kind of issue pops up again, we'll have a record of it @bwoshea I think that this is ready to be merged |
2 approves, I'm going to merge1 |
This PR addresses two bugs associated with Issue #5: