-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate random run ID, track restarts #2149
Conversation
Improve data provenance tracking, so that sets of runs can be linked to each other, and the run used to generate data identified. Each time the solver is run, generate a random number which is broadcast to all processors. This is stored as "run_id" in: - log files - BOUT.settings, though at the moment only if the run finishes - restart files - dump files When a simulation is restarted, the ID of the run it started from is also recorded as "run_restart_from". If restart files are used which don't have a run_id, then this will either cause the run to fail, or set run_id to 0. The `run_restart_from` ID therefore has two special values: - 1 means no restart, the run was started from scratch - 0 means restart from unknown run_id (missing -> run_id set to 0) There may well be a better way to handle missing run_id values in restart files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ This would be a really nice feature to have!
@@ -468,8 +468,28 @@ int Solver::solve(int NOUT, BoutReal TIMESTEP) { | |||
throw BoutException(_("Failed to initialise solver-> Aborting\n")); | |||
} | |||
|
|||
// Set the run ID to a random number | |||
run_restart_from = run_id; // Restarting from the previous run ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have run_restart_from
be a std::vector<unsigned_int>
, so if you restart from a restart (etc.) you know the IDs of all the previous runs. I think this needs an extra method or two adding to DataFile
to handle std::vector
(or Array
?), but shouldn't be too complicated because DataFormat
already supports C-arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be nice! These might need to be vectors in dump files, as at the moment if a simulation restarts and appends to the .dmp files then the run_id and run_restart_from will be overwritten. In that case possibly run_id
should be a vector of the runs whose outputs are contained in the dump file, and run_restart_from
could be a vector of previous runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a go at adding std::vector<int>
to Datafile
s. Hopefully PR coming soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #2150, which allows a std::vector<int>
to be added to a Datafile
.
Not sure what to do about simulations that append. In #2150 I've tried to treat a std::vector<int>
generically - I named the dimension "vec<n>" so we could add any number of vectors of any size, but that doesn't make sense if we want to append to them.
It wouldn't be crazy to have a dimension that we append one entry per run to - I'd actually quite like to have that for saving the git hashes of BOUT++ and physics models to - but that would be a bigger change to the Datafile
interface. I hesitate to put more work into Datafile
as we're going to (?) replace it with OptionsNetCDF
in v5. One possible temporary work-around would be to just make run_id
a time-dependent int
variable. It's only a slight waste of disk space and would cure the appending problem. I should think we can put a special case into xbout
's create_restarts()
method to get rid of duplicate entries and turn the time-dependent run_id
into a run_restart_from
vector. Then I guess we skip writing run_restart_from
to dump files if we're appending, as the info would already be there in the original run_restart_from
(if there was one) plus time-dependent run_id
. Finally, open a feature request for whatever I/O system we end up with in v5 to tidy this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started work on replacing Datafile
with OptionsNetCDF
, although not very far in yet.
Having a "number of restarts-length" unlimited dimension is a good idea. In the OptionsNetCDF
approach, we also add everything in Options
to the output file. This includes all the input parameters. Given that these could change in a restarted simulation, we would have to make basically everything have this dimension.
An alternative approach might be to create a group structure:
/
|-- time evolving variables
| |-- n
| \-- T
|-- build info
| |-- run001
| | |-- git hash
| | |-- has_petsc
| | ...
| \-- run002
| |-- git hash
| |-- has_petsc
| ...
\-- input parameters
|-- run001
| |-- run_id
| |-- nx
| ...
\-- run002
|-- run_id
|-- nx
...
or perhaps
/
|-- time evolving variables
| |-- n
| \-- T
|-- run001
| |-- run_id
| |-- build info
| | |-- git hash
| | |-- has_petsc
| | ...
| \-- input parameters
| |-- nx
| ...
|-- run001
| |-- run_id
| |-- build info
| | |-- git hash
| | |-- has_petsc
| | ...
| \-- input parameters
| |-- nx
| ...
...
I know that xarray Dataset
doesn't understand groups, but if we know what groups we're expecting xBOUT
can handle it -- flattening the separate run groups for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like this group structure approach: If the evolving variables are in the root directory as now, and the metadata is in groups, then it should be backwards compatible. Then we're not having to add lots of dimensions to things that probably don't need them most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a pretty nice idea. Would need an xbout update, but seems like it should be straightforward and would make things nicer. At the moment xbout puts all scalars into a 'metadata' attribute, because most of them are settings (like nx
, etc.), but there could also be scalar variables which it would be better to keep separate. xarray can load groups from a netCDF file (using the group
kwarg to open_dataset
), so we could instead create the metadata with something like
ds.attrs["metadata"] = open_dataset(\<first_file_paths\>, group="run001/input parameters")
or something a bit fancier to deal with multiple runs that might have the same input parameters (so we can use a single metadata
) or might have changed values (concatenate the Datasets along a 'run' dimension maybe?). Anyway seems like a nice thing to have.
While we're discussing - it might be nice to have the non-time-dependent variables (SAVE_ONCE()
ones) in the run001
, run002
, ..., groups in case they change. I'm not sure if we overwrite or keep the original value when appending at the moment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that said, I would love a backportable version of this PR (i.e. works with existing Datafile
), even if (for example) it doesn't support using append
nicely.
https://github.com/mariusbancila/stduuid is a header-only library for creating UUIDs, but requires C++20 features. This commit copies the header, but hacked to work with C++11. When BOUT++ requires C++20, we should replace this header with mariusbancila/stduuid in a submodule.
Keeps Solver::solve() cleaner.
Saving run_id and run_restart_from as time-evolving variables would make it easier to concatenate data from several consecutive runs. A bit hacky, but do not have a better way to create a per-run dimension at the moment.
Test that run_id is a valid UUID, and run_restart_from is equal to the original run_id.
sting_view is C++17, and not used in this version of uuid.h.
* next: (310 commits) Note removing support for HDF5 in changelog Better notation for indexing zShift Update note on zShift in grid-files section of manual Print error for HDF5 macros when running bout-v5-macro-upgrader.py Support macros being removed in bout-v5-macro-upgrader.py Remove HDF5 support Removed bout_runners from repo, and redirected to project site Allow test-options-netcdf on Fedora Add missing brackets in test-options-netcdf Skip test-io when using legacy netCDF interface clang-format FCIMap constructor Add braces around conditional body Pass dy by reference Guard OptionsNetCDF test against legacy netcdf Cache index offsets in FCI boundary loop Invert conditional in FCI boundary calculation Convert nested-loop to BOUT_FOR in FCI Make local variables const Pass dy down into FCIMap Remove some unused code from FCI ...
* next-run-id: Note removing support for HDF5 in changelog Better notation for indexing zShift Update note on zShift in grid-files section of manual Print error for HDF5 macros when running bout-v5-macro-upgrader.py Support macros being removed in bout-v5-macro-upgrader.py Remove HDF5 support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Fedora tests are currently failing, but it looks like @dschwoerer is already on it 🚀 : https://bugzilla.redhat.com/show_bug.cgi?id=1924231 |
The other test that's failing is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bendudson @ZedThree, I think this looks good now! I'll try and have a look at a backport to v4.4.0-alpha sometime...
The only thing nagging me about this PR at the mo is that all this lives in That is for another PR though! I do want to make two more changes: make all the run ID bits |
Actually a getter would be good to have for anything that creates a separate set of output files (e.g. https://github.com/johnomotani/BoutFastOutput that writes some stuff with higher frequency) so the run-id can be added there. Eventually if/when we can have multiple time dimensions in the dump file the need might go away, but for the moment would be useful. |
The Fedora issue is fixed, and the fix was released Yesterday, so as soon as the mirrors catch up, it should work 👍 |
Likewise for run_restart_from Also make createRunID public
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. One tiny suggestion - defining default_run_id
was a good idea!
Co-authored-by: johnomotani <john.omotani@ukaea.uk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I'm not sure if we quite have the full tracking of previous restarts working yet, but getting the group structure of runs is a task for the datafile refactoring |
Yeah, I think full tracking of previous restarts is a job for another PR, after we move I/O to OptionsNetCDF. What we've got here is definitely useful, and back-portable (I'm working on a back-port to 4.4 now). I think backporting full tracking would be a pain. |
Improve data provenance tracking, so that sets of runs can be linked to each other, and the run used to generate data identified.
Each time the solver is run, generate a random number which is broadcast to all processors. This is stored as "run_id" in:
When a simulation is restarted, the ID of the run it started from is also recorded as "run_restart_from".
If restart files are used which don't have a run_id, then this will either cause the run to fail, or set run_id to 0. The
run_restart_from
ID therefore has two special values:There may well be a better way to handle missing run_id values in restart files.