Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate random run ID, track restarts #2149

Merged
merged 30 commits into from
Feb 9, 2021
Merged

Generate random run ID, track restarts #2149

merged 30 commits into from
Feb 9, 2021

Conversation

bendudson
Copy link
Contributor

Improve data provenance tracking, so that sets of runs can be linked to each other, and the run used to generate data identified.

Each time the solver is run, generate a random number which is broadcast to all processors. This is stored as "run_id" in:

  • log files
  • BOUT.settings, though at the moment only if the run finishes
  • restart files
  • dump files

When a simulation is restarted, the ID of the run it started from is also recorded as "run_restart_from".

If restart files are used which don't have a run_id, then this will either cause the run to fail, or set run_id to 0. The
run_restart_from ID therefore has two special values:

  • 1 means no restart, the run was started from scratch
  • 0 means restart from unknown run_id (missing -> run_id set to 0)

There may well be a better way to handle missing run_id values in restart files.

Improve data provenance tracking, so that sets of runs can be linked
to each other, and the run used to generate data identified.

Each time the solver is run, generate a random number which is
broadcast to all processors. This is stored as "run_id" in:
- log files
- BOUT.settings, though at the moment only if the run finishes
- restart files
- dump files

When a simulation is restarted, the ID of the run it started from is
also recorded as "run_restart_from".

If restart files are used which don't have a run_id, then this will
either cause the run to fail, or set run_id to 0. The
`run_restart_from` ID therefore has two special values:
- 1 means no restart, the run was started from scratch
- 0 means restart from unknown run_id (missing -> run_id set to 0)

There may well be a better way to handle missing run_id values in
restart files.
@bendudson bendudson added the work in progress Not ready for merging label Nov 20, 2020
Copy link
Contributor

@johnomotani johnomotani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ This would be a really nice feature to have!

src/solver/solver.cxx Outdated Show resolved Hide resolved
@@ -468,8 +468,28 @@ int Solver::solve(int NOUT, BoutReal TIMESTEP) {
throw BoutException(_("Failed to initialise solver-> Aborting\n"));
}

// Set the run ID to a random number
run_restart_from = run_id; // Restarting from the previous run ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have run_restart_from be a std::vector<unsigned_int>, so if you restart from a restart (etc.) you know the IDs of all the previous runs. I think this needs an extra method or two adding to DataFile to handle std::vector (or Array?), but shouldn't be too complicated because DataFormat already supports C-arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be nice! These might need to be vectors in dump files, as at the moment if a simulation restarts and appends to the .dmp files then the run_id and run_restart_from will be overwritten. In that case possibly run_id should be a vector of the runs whose outputs are contained in the dump file, and run_restart_from could be a vector of previous runs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a go at adding std::vector<int> to Datafiles. Hopefully PR coming soon...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2150, which allows a std::vector<int> to be added to a Datafile.

Not sure what to do about simulations that append. In #2150 I've tried to treat a std::vector<int> generically - I named the dimension "vec<n>" so we could add any number of vectors of any size, but that doesn't make sense if we want to append to them.

It wouldn't be crazy to have a dimension that we append one entry per run to - I'd actually quite like to have that for saving the git hashes of BOUT++ and physics models to - but that would be a bigger change to the Datafile interface. I hesitate to put more work into Datafile as we're going to (?) replace it with OptionsNetCDF in v5. One possible temporary work-around would be to just make run_id a time-dependent int variable. It's only a slight waste of disk space and would cure the appending problem. I should think we can put a special case into xbout's create_restarts() method to get rid of duplicate entries and turn the time-dependent run_id into a run_restart_from vector. Then I guess we skip writing run_restart_from to dump files if we're appending, as the info would already be there in the original run_restart_from (if there was one) plus time-dependent run_id. Finally, open a feature request for whatever I/O system we end up with in v5 to tidy this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started work on replacing Datafile with OptionsNetCDF, although not very far in yet.

Having a "number of restarts-length" unlimited dimension is a good idea. In the OptionsNetCDF approach, we also add everything in Options to the output file. This includes all the input parameters. Given that these could change in a restarted simulation, we would have to make basically everything have this dimension.

An alternative approach might be to create a group structure:

/
|-- time evolving variables
| |-- n
| \-- T
|-- build info
| |-- run001
| | |-- git hash
| | |-- has_petsc
| | ...
| \-- run002
|   |-- git hash
|   |-- has_petsc
|   ...
\-- input parameters
  |-- run001
  | |-- run_id
  | |-- nx
  | ...
  \-- run002
    |-- run_id
    |-- nx
    ...

or perhaps

/
|-- time evolving variables
| |-- n
| \-- T
|-- run001
| |-- run_id
| |-- build info
| | |-- git hash
| | |-- has_petsc
| | ...
| \-- input parameters
|   |-- nx
|   ...
|-- run001
| |-- run_id
| |-- build info
| | |-- git hash
| | |-- has_petsc
| | ...
| \-- input parameters
|   |-- nx
|   ...
...

I know that xarray Dataset doesn't understand groups, but if we know what groups we're expecting xBOUT can handle it -- flattening the separate run groups for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like this group structure approach: If the evolving variables are in the root directory as now, and the metadata is in groups, then it should be backwards compatible. Then we're not having to add lots of dimensions to things that probably don't need them most of the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a pretty nice idea. Would need an xbout update, but seems like it should be straightforward and would make things nicer. At the moment xbout puts all scalars into a 'metadata' attribute, because most of them are settings (like nx, etc.), but there could also be scalar variables which it would be better to keep separate. xarray can load groups from a netCDF file (using the group kwarg to open_dataset), so we could instead create the metadata with something like

ds.attrs["metadata"] = open_dataset(\<first_file_paths\>, group="run001/input parameters")

or something a bit fancier to deal with multiple runs that might have the same input parameters (so we can use a single metadata) or might have changed values (concatenate the Datasets along a 'run' dimension maybe?). Anyway seems like a nice thing to have.

While we're discussing - it might be nice to have the non-time-dependent variables (SAVE_ONCE() ones) in the run001, run002, ..., groups in case they change. I'm not sure if we overwrite or keep the original value when appending at the moment...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that said, I would love a backportable version of this PR (i.e. works with existing Datafile), even if (for example) it doesn't support using append nicely.

@johnomotani johnomotani added the backport candidiate Does not break backward compatibility, so can be back-ported to v4.4 label Nov 21, 2020
https://github.com/mariusbancila/stduuid is a header-only library for
creating UUIDs, but requires C++20 features. This commit copies the
header, but hacked to work with C++11. When BOUT++ requires C++20, we
should replace this header with mariusbancila/stduuid in a submodule.
Saving run_id and run_restart_from as time-evolving variables would make
it easier to concatenate data from several consecutive runs. A bit
hacky, but do not have a better way to create a per-run dimension at the
moment.
@johnomotani johnomotani mentioned this pull request Jan 3, 2021
Test that run_id is a valid UUID, and run_restart_from is equal to the
original run_id.
sting_view is C++17, and not used in this version of uuid.h.
* next: (310 commits)
  Note removing support for HDF5 in changelog
  Better notation for indexing zShift
  Update note on zShift in grid-files section of manual
  Print error for HDF5 macros when running bout-v5-macro-upgrader.py
  Support macros being removed in bout-v5-macro-upgrader.py
  Remove HDF5 support
  Removed bout_runners from repo, and redirected to project site
  Allow test-options-netcdf on Fedora
  Add missing brackets in test-options-netcdf
  Skip test-io when using legacy netCDF interface
  clang-format FCIMap constructor
  Add braces around conditional body
  Pass dy by reference
  Guard OptionsNetCDF test against legacy netcdf
  Cache index offsets in FCI boundary loop
  Invert conditional in FCI boundary calculation
  Convert nested-loop to BOUT_FOR in FCI
  Make local variables const
  Pass dy down into FCIMap
  Remove some unused code from FCI
  ...
* next-run-id:
  Note removing support for HDF5 in changelog
  Better notation for indexing zShift
  Update note on zShift in grid-files section of manual
  Print error for HDF5 macros when running bout-v5-macro-upgrader.py
  Support macros being removed in bout-v5-macro-upgrader.py
  Remove HDF5 support
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/bout/solver.hxx Outdated Show resolved Hide resolved
include/bout/solver.hxx Outdated Show resolved Hide resolved
include/bout/solver.hxx Outdated Show resolved Hide resolved
include/bout/solver.hxx Outdated Show resolved Hide resolved
@ZedThree
Copy link
Member

ZedThree commented Feb 4, 2021

Fedora tests are currently failing, but it looks like @dschwoerer is already on it 🚀 : https://bugzilla.redhat.com/show_bug.cgi?id=1924231

@ZedThree
Copy link
Member

ZedThree commented Feb 4, 2021

The other test that's failing is test-restart-io, with Missing data for run_id in input. Set init_missing=true to set to zero.. I think that's expected, and the test just needs updating

johnomotani
johnomotani previously approved these changes Feb 4, 2021
Copy link
Contributor

@johnomotani johnomotani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bendudson @ZedThree, I think this looks good now! I'll try and have a look at a backport to v4.4.0-alpha sometime...

@ZedThree
Copy link
Member

ZedThree commented Feb 5, 2021

The only thing nagging me about this PR at the mo is that all this lives in Solver, and it doesn't feel quite like it belongs there. I think it's there because Solver::solve actually does a bunch of the simulation setup/teardown, and maybe that would all be better off living somewhere else. I have been thinking about a BoutSim<PhysicsModel> class or something, that basically has BoutInitialise as its constructor and BoutFinalise as its destructor, and we move lots of the non-solver bits out of Solver into that.

That is for another PR though!

I do want to make two more changes: make all the run ID bits private in Solver, rather than protected (I don't think anything else actually needs these members? We can always add a getter if so), and install uuid-dev for clang-tidy-review. One of the Fedora builds passed, so hopefully that's fixed now too.

@johnomotani
Copy link
Contributor

Actually a getter would be good to have for anything that creates a separate set of output files (e.g. https://github.com/johnomotani/BoutFastOutput that writes some stuff with higher frequency) so the run-id can be added there. Eventually if/when we can have multiple time dimensions in the dump file the need might go away, but for the moment would be useful.

@dschwoerer
Copy link
Contributor

The Fedora issue is fixed, and the fix was released Yesterday, so as soon as the mirrors catch up, it should work 👍

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2021

clang-tidy review says "All clean, LGTM! 👍"

include/bout/solver.hxx Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/solver/solver.cxx Show resolved Hide resolved
src/solver/solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
tests/unit/solver/test_solver.cxx Show resolved Hide resolved
johnomotani
johnomotani previously approved these changes Feb 5, 2021
Copy link
Contributor

@johnomotani johnomotani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. One tiny suggestion - defining default_run_id was a good idea!

src/solver/solver.cxx Outdated Show resolved Hide resolved
Co-authored-by: johnomotani <john.omotani@ukaea.uk>
@bendudson bendudson changed the title WIP: Generate random run ID, track restarts Generate random run ID, track restarts Feb 9, 2021
@bendudson bendudson removed the work in progress Not ready for merging label Feb 9, 2021
Copy link
Contributor

@johnomotani johnomotani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ZedThree ZedThree merged commit 807e501 into next Feb 9, 2021
@ZedThree ZedThree deleted the next-run-id branch February 9, 2021 15:54
@ZedThree
Copy link
Member

ZedThree commented Feb 9, 2021

I'm not sure if we quite have the full tracking of previous restarts working yet, but getting the group structure of runs is a task for the datafile refactoring

@johnomotani
Copy link
Contributor

Yeah, I think full tracking of previous restarts is a job for another PR, after we move I/O to OptionsNetCDF. What we've got here is definitely useful, and back-portable (I'm working on a back-port to 4.4 now). I think backporting full tracking would be a pain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport candidiate Does not break backward compatibility, so can be back-ported to v4.4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants