Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle deletion of empty datasets #248

Merged
merged 1 commit into from
Jun 5, 2023

Conversation

peytondmurray
Copy link
Collaborator

@peytondmurray peytondmurray commented Jun 1, 2023

This PR modifies how empty datasets are deleted, closing #244.

Changes

  • As I was debugging, I broke out function definitions from inside delete_versions and _recreate_raw_data to help separate logic a little more. I don't think we gain anything by redefining _walk, _get_np_fillvalue, and _delete_dataset during each call, but I can change this back if it's just confusing.
  • Added a bunch of type annotations for the functions I touched, it really helped with debugging once I knew what types different objects were.
  • When the user deletes one or more versions, and those versions contain one or more datasets which don't exist in the versions being kept (and therefore the raw data needs to be deleted), we now skip recreating the raw data, recalculating hash tables, and recreating the virtual datasets associated with the dataset.
    This is done because empty datasets cannot be virtual; see 85507e9. In other words, empty datasets at _version_data/versions/<version>/<name> are real, and since they don't refer to data in _version_data/<name>/raw_data as they would if they held actual data, they can safely just be deleted. As far as I can tell, this is a limitation of h5py and hdf5 being unable to store empty virtual datasets.
  • Added a test to cover the reproducer from Error when combining delete_versions and h5repack on empty Dataset #244.

@peytondmurray peytondmurray force-pushed the fix-empty-dataset-issue branch 3 times, most recently from 9c166a5 to 8b8e9f6 Compare June 1, 2023 19:00
@peytondmurray peytondmurray changed the title [WIP] Correctly handle deletion of empty datasets Correctly handle deletion of empty datasets Jun 1, 2023
@peytondmurray peytondmurray marked this pull request as ready for review June 1, 2023 19:21
versioned_hdf5/replay.py Outdated Show resolved Hide resolved
versioned_hdf5/replay.py Outdated Show resolved Hide resolved
@asmeurer
Copy link
Collaborator

asmeurer commented Jun 2, 2023

It's been a while, but it seems correct. If you want to be extra sure you could try testing more extensive combinations of different versions including or not including the dataset with and without non-fill data.

I do rather wish I had made use of hypothesis when developing this library (#239), as its state machine strategies would be a perfect fit for extensively testing this sort of thing. It's something I would heavily suggest adding if we ever decide to add any significant new functionality.

@peytondmurray
Copy link
Collaborator Author

I really like the idea of making hypothesis do the heavy lifting for the tests. I'm going to try to spend a little more time doing some general upkeep on this library, so there's a good chance I'll do this in the future.

Copy link
Collaborator

@ArvidJB ArvidJB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defer to @asmeurer, looks good to me, though!

Your observation that empty datasets are "special" and cannot be virtual seems to explain what's going on here. I think whenever we ran into this problem it did involve versions which were empty datasets.

@peytondmurray peytondmurray merged commit 0f75f42 into deshaw:master Jun 5, 2023
6 checks passed
@peytondmurray peytondmurray deleted the fix-empty-dataset-issue branch June 5, 2023 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants