Generate random run ID, track restarts #2149

bendudson · 2020-11-20T17:39:37Z

Improve data provenance tracking, so that sets of runs can be linked to each other, and the run used to generate data identified.

Each time the solver is run, generate a random number which is broadcast to all processors. This is stored as "run_id" in:

log files
BOUT.settings, though at the moment only if the run finishes
restart files
dump files

When a simulation is restarted, the ID of the run it started from is also recorded as "run_restart_from".

If restart files are used which don't have a run_id, then this will either cause the run to fail, or set run_id to 0. The
run_restart_from ID therefore has two special values:

1 means no restart, the run was started from scratch
0 means restart from unknown run_id (missing -> run_id set to 0)

There may well be a better way to handle missing run_id values in restart files.

Improve data provenance tracking, so that sets of runs can be linked to each other, and the run used to generate data identified. Each time the solver is run, generate a random number which is broadcast to all processors. This is stored as "run_id" in: - log files - BOUT.settings, though at the moment only if the run finishes - restart files - dump files When a simulation is restarted, the ID of the run it started from is also recorded as "run_restart_from". If restart files are used which don't have a run_id, then this will either cause the run to fail, or set run_id to 0. The `run_restart_from` ID therefore has two special values: - 1 means no restart, the run was started from scratch - 0 means restart from unknown run_id (missing -> run_id set to 0) There may well be a better way to handle missing run_id values in restart files.

johnomotani

❤️ This would be a really nice feature to have!

src/solver/solver.cxx

johnomotani · 2020-11-20T18:27:51Z

src/solver/solver.cxx

@@ -468,8 +468,28 @@ int Solver::solve(int NOUT, BoutReal TIMESTEP) {
    throw BoutException(_("Failed to initialise solver-> Aborting\n"));
  }

+  // Set the run ID to a random number
+  run_restart_from = run_id; // Restarting from the previous run ID


It would be nice to have run_restart_from be a std::vector<unsigned_int>, so if you restart from a restart (etc.) you know the IDs of all the previous runs. I think this needs an extra method or two adding to DataFile to handle std::vector (or Array?), but shouldn't be too complicated because DataFormat already supports C-arrays.

This would be nice! These might need to be vectors in dump files, as at the moment if a simulation restarts and appends to the .dmp files then the run_id and run_restart_from will be overwritten. In that case possibly run_id should be a vector of the runs whose outputs are contained in the dump file, and run_restart_from could be a vector of previous runs?

I'm having a go at adding std::vector<int> to Datafiles. Hopefully PR coming soon...

See #2150, which allows a std::vector<int> to be added to a Datafile.

Not sure what to do about simulations that append. In #2150 I've tried to treat a std::vector<int> generically - I named the dimension "vec<n>" so we could add any number of vectors of any size, but that doesn't make sense if we want to append to them.

It wouldn't be crazy to have a dimension that we append one entry per run to - I'd actually quite like to have that for saving the git hashes of BOUT++ and physics models to - but that would be a bigger change to the Datafile interface. I hesitate to put more work into Datafile as we're going to (?) replace it with OptionsNetCDF in v5. One possible temporary work-around would be to just make run_id a time-dependent int variable. It's only a slight waste of disk space and would cure the appending problem. I should think we can put a special case into xbout's create_restarts() method to get rid of duplicate entries and turn the time-dependent run_id into a run_restart_from vector. Then I guess we skip writing run_restart_from to dump files if we're appending, as the info would already be there in the original run_restart_from (if there was one) plus time-dependent run_id. Finally, open a feature request for whatever I/O system we end up with in v5 to tidy this up.

I've started work on replacing Datafile with OptionsNetCDF, although not very far in yet.

Having a "number of restarts-length" unlimited dimension is a good idea. In the OptionsNetCDF approach, we also add everything in Options to the output file. This includes all the input parameters. Given that these could change in a restarted simulation, we would have to make basically everything have this dimension.

An alternative approach might be to create a group structure:

/ |-- time evolving variables | |-- n | \-- T |-- build info | |-- run001 | | |-- git hash | | |-- has_petsc | | ... | \-- run002 | |-- git hash | |-- has_petsc | ... \-- input parameters |-- run001 | |-- run_id | |-- nx | ... \-- run002 |-- run_id |-- nx ...

or perhaps

/ |-- time evolving variables | |-- n | \-- T |-- run001 | |-- run_id | |-- build info | | |-- git hash | | |-- has_petsc | | ... | \-- input parameters | |-- nx | ... |-- run001 | |-- run_id | |-- build info | | |-- git hash | | |-- has_petsc | | ... | \-- input parameters | |-- nx | ... ...

I know that xarray Dataset doesn't understand groups, but if we know what groups we're expecting xBOUT can handle it -- flattening the separate run groups for instance.

I quite like this group structure approach: If the evolving variables are in the root directory as now, and the metadata is in groups, then it should be backwards compatible. Then we're not having to add lots of dimensions to things that probably don't need them most of the time.

This sounds like a pretty nice idea. Would need an xbout update, but seems like it should be straightforward and would make things nicer. At the moment xbout puts all scalars into a 'metadata' attribute, because most of them are settings (like nx, etc.), but there could also be scalar variables which it would be better to keep separate. xarray can load groups from a netCDF file (using the group kwarg to open_dataset), so we could instead create the metadata with something like

ds.attrs["metadata"] = open_dataset(\<first_file_paths\>, group="run001/input parameters")

or something a bit fancier to deal with multiple runs that might have the same input parameters (so we can use a single metadata) or might have changed values (concatenate the Datasets along a 'run' dimension maybe?). Anyway seems like a nice thing to have.

While we're discussing - it might be nice to have the non-time-dependent variables (SAVE_ONCE() ones) in the run001, run002, ..., groups in case they change. I'm not sure if we overwrite or keep the original value when appending at the moment...

All that said, I would love a backportable version of this PR (i.e. works with existing Datafile), even if (for example) it doesn't support using append nicely.

https://github.com/mariusbancila/stduuid is a header-only library for creating UUIDs, but requires C++20 features. This commit copies the header, but hacked to work with C++11. When BOUT++ requires C++20, we should replace this header with mariusbancila/stduuid in a submodule.

Keeps Solver::solve() cleaner.

Saving run_id and run_restart_from as time-evolving variables would make it easier to concatenate data from several consecutive runs. A bit hacky, but do not have a better way to create a per-run dimension at the moment.

Test that run_id is a valid UUID, and run_restart_from is equal to the original run_id.

sting_view is C++17, and not used in this version of uuid.h.

* next: (310 commits) Note removing support for HDF5 in changelog Better notation for indexing zShift Update note on zShift in grid-files section of manual Print error for HDF5 macros when running bout-v5-macro-upgrader.py Support macros being removed in bout-v5-macro-upgrader.py Remove HDF5 support Removed bout_runners from repo, and redirected to project site Allow test-options-netcdf on Fedora Add missing brackets in test-options-netcdf Skip test-io when using legacy netCDF interface clang-format FCIMap constructor Add braces around conditional body Pass dy by reference Guard OptionsNetCDF test against legacy netcdf Cache index offsets in FCI boundary loop Invert conditional in FCI boundary calculation Convert nested-loop to BOUT_FOR in FCI Make local variables const Pass dy down into FCIMap Remove some unused code from FCI ...

* next-run-id: Note removing support for HDF5 in changelog Better notation for indexing zShift Update note on zShift in grid-files section of manual Print error for HDF5 macros when running bout-v5-macro-upgrader.py Support macros being removed in bout-v5-macro-upgrader.py Remove HDF5 support

github-actions

clang-tidy made some suggestions

include/bout/solver.hxx

ZedThree · 2021-02-04T11:20:31Z

Fedora tests are currently failing, but it looks like @dschwoerer is already on it 🚀 : https://bugzilla.redhat.com/show_bug.cgi?id=1924231

ZedThree · 2021-02-04T11:25:48Z

The other test that's failing is test-restart-io, with Missing data for run_id in input. Set init_missing=true to set to zero.. I think that's expected, and the test just needs updating

Use UUID for run_id

johnomotani

Thanks @bendudson @ZedThree, I think this looks good now! I'll try and have a look at a backport to v4.4.0-alpha sometime...

ZedThree · 2021-02-05T09:37:04Z

The only thing nagging me about this PR at the mo is that all this lives in Solver, and it doesn't feel quite like it belongs there. I think it's there because Solver::solve actually does a bunch of the simulation setup/teardown, and maybe that would all be better off living somewhere else. I have been thinking about a BoutSim<PhysicsModel> class or something, that basically has BoutInitialise as its constructor and BoutFinalise as its destructor, and we move lots of the non-solver bits out of Solver into that.

That is for another PR though!

I do want to make two more changes: make all the run ID bits private in Solver, rather than protected (I don't think anything else actually needs these members? We can always add a getter if so), and install uuid-dev for clang-tidy-review. One of the Fedora builds passed, so hopefully that's fixed now too.

johnomotani · 2021-02-05T09:56:23Z

Actually a getter would be good to have for anything that creates a separate set of output files (e.g. https://github.com/johnomotani/BoutFastOutput that writes some stuff with higher frequency) so the run-id can be added there. Eventually if/when we can have multiple time dimensions in the dump file the need might go away, but for the moment would be useful.

dschwoerer · 2021-02-05T11:21:22Z

The Fedora issue is fixed, and the fix was released Yesterday, so as soon as the mirrors catch up, it should work 👍

Likewise for run_restart_from Also make createRunID public

github-actions · 2021-02-05T12:12:53Z

clang-tidy review says "All clean, LGTM! 👍"

include/bout/solver.hxx

github-actions

clang-tidy made some suggestions

src/solver/solver.cxx

tests/unit/solver/test_solver.cxx

johnomotani

Looks good to me. One tiny suggestion - defining default_run_id was a good idea!

src/solver/solver.cxx

Co-authored-by: johnomotani <john.omotani@ukaea.uk>

johnomotani

👍

ZedThree · 2021-02-09T15:55:56Z

I'm not sure if we quite have the full tracking of previous restarts working yet, but getting the group structure of runs is a task for the datafile refactoring

johnomotani · 2021-02-09T15:57:35Z

Yeah, I think full tracking of previous restarts is a job for another PR, after we move I/O to OptionsNetCDF. What we've got here is definitely useful, and back-portable (I'm working on a back-port to 4.4 now). I think backporting full tracking would be a pain.

bendudson added the work in progress Not ready for merging label Nov 20, 2020

johnomotani reviewed Nov 20, 2020

View reviewed changes

johnomotani added the backport candidiate Does not break backward compatibility, so can be back-ported to v4.4 label Nov 21, 2020

johnomotani mentioned this pull request Nov 21, 2020

I/O for std::vector<int> and std::vector<char> #2150

Closed

ZedThree mentioned this pull request Nov 24, 2020

DataFile rewrite design #1254

Closed

johnomotani added 6 commits January 2, 2021 23:42

Merge branch 'fix-string-read' into next-run-id

d46c7e3

Merge branch 'fix-descriptions' into next-run-id

b0744d9

Use a UUID for Solver::run_id

4ea48b7

Move UUID creation into a utility method Solver::createRunId()

c833d71

Keeps Solver::solve() cleaner.

johnomotani mentioned this pull request Jan 3, 2021

UUID for run_id #2192

Merged

johnomotani added 4 commits January 3, 2021 16:36

Create run_id in test-restart-io and test-restart-io_hdf5

afbcc3a

Test run_id, run_restart_from in test-restart-io, test-restart-io_hdf5

1ee6ab7

Test that run_id is a valid UUID, and run_restart_from is equal to the original run_id.

Remove #include <string_view>

fb24438

sting_view is C++17, and not used in this version of uuid.h.

Merge branch 'next' into uuid-run-id

4047181

This was referenced Jan 18, 2021

Add UUIDs to grid files boutproject/hypnotoad#66

Closed

Create a UUID for each grid file boutproject/hypnotoad#67

Merged

ZedThree added 2 commits February 4, 2021 10:53

github-actions bot reviewed Feb 4, 2021

View reviewed changes

include/bout/solver.hxx Outdated Show resolved Hide resolved

include/bout/solver.hxx Outdated Show resolved Hide resolved

include/bout/solver.hxx Outdated Show resolved Hide resolved

include/bout/solver.hxx Outdated Show resolved Hide resolved

ZedThree and others added 6 commits February 4, 2021 13:31

Add configure option to use system UUID generator

ba2c968

Use system UUID generator for run ID if available

dbe31cf

[clang-format-command] fixes

ecadc41

Fix build when system UUID generator isn't available

99389b4

Move UUID header include to solver implementation file

08f3f26

Ignore run IDs in test-squash

6426c09

ZedThree added 4 commits February 4, 2021 16:54

Preserve formatting in uuid header

f13cc4c

CMake: Fix fallback when libuuid not available

04f0274

Don't use const_cast to broadcast run_id

4b705e3

Merge pull request #2192 from boutproject/uuid-run-id

59ff2b5

Use UUID for run_id

johnomotani previously approved these changes Feb 4, 2021

View reviewed changes

ZedThree added 2 commits February 5, 2021 12:05

Make solver run_id private with public getter

2769748

Likewise for run_restart_from Also make createRunID public

Install uuid-dev for clang-tidy-review

d6f5143

ZedThree dismissed johnomotani’s stale review via d6f5143 February 5, 2021 12:06

johnomotani reviewed Feb 5, 2021

View reviewed changes

include/bout/solver.hxx Outdated Show resolved Hide resolved

ZedThree added 4 commits February 5, 2021 13:04

Make Solver::createRunID private

79baa87

Make run_id getter throw if it hasn't been set yet

43c2fa9

Fix uuids::uuid::is_valid_uuid for C++14

b13d0a5

Add unit tests for Solver::getRunID and getRunRestartFrom

59d14b7

github-actions bot reviewed Feb 5, 2021

View reviewed changes

johnomotani previously approved these changes Feb 5, 2021

View reviewed changes

src/solver/solver.cxx Outdated Show resolved Hide resolved

Fix check of run_restart_from being default value

e8069f4

Co-authored-by: johnomotani <john.omotani@ukaea.uk>

ZedThree dismissed johnomotani’s stale review via e8069f4 February 8, 2021 17:56

bendudson changed the title ~~WIP: Generate random run ID, track restarts~~ Generate random run ID, track restarts Feb 9, 2021

bendudson removed the work in progress Not ready for merging label Feb 9, 2021

johnomotani approved these changes Feb 9, 2021

View reviewed changes

ZedThree merged commit 807e501 into next Feb 9, 2021

ZedThree deleted the next-run-id branch February 9, 2021 15:54

This was referenced Feb 10, 2021

Only use save_repeat_run_id for dump files #2223

Merged

Generate random run ID, track restarts (4.4) #2224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate random run ID, track restarts #2149

Generate random run ID, track restarts #2149

bendudson commented Nov 20, 2020

johnomotani left a comment

johnomotani Nov 20, 2020

bendudson Nov 21, 2020

johnomotani Nov 21, 2020

johnomotani Nov 21, 2020 •

edited

ZedThree Nov 23, 2020

bendudson Nov 23, 2020

johnomotani Nov 23, 2020

johnomotani Nov 23, 2020

github-actions bot left a comment

ZedThree commented Feb 4, 2021

ZedThree commented Feb 4, 2021

johnomotani left a comment

ZedThree commented Feb 5, 2021

johnomotani commented Feb 5, 2021

dschwoerer commented Feb 5, 2021

github-actions bot commented Feb 5, 2021

github-actions bot left a comment

johnomotani left a comment

johnomotani left a comment

ZedThree commented Feb 9, 2021

johnomotani commented Feb 9, 2021

Generate random run ID, track restarts #2149

Generate random run ID, track restarts #2149

Conversation

bendudson commented Nov 20, 2020

johnomotani left a comment

Choose a reason for hiding this comment

johnomotani Nov 20, 2020

Choose a reason for hiding this comment

bendudson Nov 21, 2020

Choose a reason for hiding this comment

johnomotani Nov 21, 2020

Choose a reason for hiding this comment

johnomotani Nov 21, 2020 • edited

Choose a reason for hiding this comment

ZedThree Nov 23, 2020

Choose a reason for hiding this comment

bendudson Nov 23, 2020

Choose a reason for hiding this comment

johnomotani Nov 23, 2020

Choose a reason for hiding this comment

johnomotani Nov 23, 2020

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

ZedThree commented Feb 4, 2021

ZedThree commented Feb 4, 2021

johnomotani left a comment

Choose a reason for hiding this comment

ZedThree commented Feb 5, 2021

johnomotani commented Feb 5, 2021

dschwoerer commented Feb 5, 2021

github-actions bot commented Feb 5, 2021

github-actions bot left a comment

Choose a reason for hiding this comment

johnomotani left a comment

Choose a reason for hiding this comment

johnomotani left a comment

Choose a reason for hiding this comment

ZedThree commented Feb 9, 2021

johnomotani commented Feb 9, 2021

johnomotani Nov 21, 2020 •

edited