Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for restart bugs (issue #5) #45

Merged
merged 4 commits into from
Feb 11, 2021

Conversation

jobordner
Copy link
Contributor

This PR addresses two bugs associated with Issue #5:

  1. corrupted strings written to parameters.libconfig files after a restart (bug fix by Matthew Abruzzo)
  2. EnzoBlock attribute values not getting written to HDF5 files after restart.

@mabruzzo mabruzzo self-assigned this Jun 9, 2020
@brittonsmith brittonsmith reopened this Jun 9, 2020
Copy link
Contributor

@mabruzzo mabruzzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me!

I have also taken some steps to confirm that this resolves issue #5 . In fact, the following steps confirm that both the configuration files and data dumps are completely identical both before and after restarting:

  1. Make the following modifications to line 26 of input/checkpoint_ppm-8.in:
    • Replacing schedule { var = "cycle"; } with schedule { var = "cycle"; list=[10,20];}
    • Inserting the line dir = ["checkpoint_ppm-8-data-%d","cycle"]; somewhere between lines 25 and 28
  2. Inserting the line dir = ["checkpoint_ppm-1-data-%d","cycle"]; somewhere between lines 28 and 32 in input/checkpoint_ppm-1.in
  3. Install h5diff
  4. Download this script and save it as test_checkpoint.sh.
  5. From the base directory of the repository separately execute bash test_checkpoint.sh 1 and bash test_checkpoint.sh 8

I also confirmed that the outputs are identical before and after restarts if we use charmrun ++p 4 instead of charmrun ++local.

@mabruzzo mabruzzo mentioned this pull request Jun 30, 2020
@bwoshea
Copy link
Contributor

bwoshea commented Jan 15, 2021

@jobordner could you please fix the merge conflicts so that we can merge this? Thank you!

@mabruzzo
Copy link
Contributor

mabruzzo commented Jan 29, 2021

I looked into this and the source of the merge conflict is really simple. The PR #53 simply moved input/checkpoint_ppm-1.in and input/checkpoint_ppm-8.in to input/Checkpoint/checkpoint_ppm-1.in and input/Checkpoint/checkpoint_ppm-8.in. The include paths were also tweaked accordingly.

This is trivial to resolve. @jobordner would you like me to take care of it?

Edit: In fact, I've confirmed that git merge master will automatically resolve this conflict.

@mabruzzo mabruzzo linked an issue Feb 8, 2021 that may be closed by this pull request
@mabruzzo
Copy link
Contributor

@jobordner Out of curiosity, do you have any idea why the tests failed the first time they were run after you made the latest commit? I saw that they were timeout bugs - do you think that they were just an anomaly?

@jobordner
Copy link
Contributor Author

@mabruzzo The adapt-L5-P1 test hung, which shouldn't be related to any of the changes in this PR. There were also error messages from circleci about violating concurrency limits, maybe it was somehow using more cores than requested at some point during the testing. Both single and double precision tests failed (I think I only checked where in one, however). When I reran it with ssh it still seemed to be slow. Not sure what happened, but it's something to keep an eye on.

@mabruzzo
Copy link
Contributor

mabruzzo commented Feb 11, 2021

Thanks. I'll file an issue about it so that if that kind of issue pops up again, we'll have a record of it

@bwoshea I think that this is ready to be merged

@bwoshea
Copy link
Contributor

bwoshea commented Feb 11, 2021

2 approves, I'm going to merge1

@bwoshea bwoshea merged commit 345d701 into enzo-project:master Feb 11, 2021
@jobordner jobordner deleted the restart-bugfixes branch July 2, 2022 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data corruption after restart
5 participants