Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset integrity check using hash for internal datasets #151

Merged
merged 10 commits into from Nov 24, 2023

Conversation

ostreech1997
Copy link
Collaborator

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

Closing issues

Closes #138.

Copy link

codecov bot commented Nov 22, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (dd1ace5) 88.93% compared to head (8df44cf) 88.95%.
Report is 2 commits behind head on internal_datasets.

Files Patch % Lines
etna/datasets/internal_datasets.py 84.61% 6 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                  @@
##           internal_datasets     #151      +/-   ##
=====================================================
+ Coverage              88.93%   88.95%   +0.01%     
=====================================================
  Files                    205      205              
  Lines                  13026    13042      +16     
=====================================================
+ Hits                   11585    11601      +16     
  Misses                  1441     1441              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Nov 22, 2023

🚀 Deployed on https://deploy-preview-151--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request November 22, 2023 07:30 Inactive
CHANGELOG.md Outdated Show resolved Hide resolved
etna/datasets/internal_datasets.py Outdated Show resolved Hide resolved
etna/datasets/internal_datasets.py Outdated Show resolved Hide resolved
)
data, dataset_hash = read_dataset(dataset_path=dataset_dir / f"{name}_{parts_[0]}.csv.gz")
if dataset_hash != datasets_dict[name]["hash"][parts_[0]]:
warnings.warn(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can see there are possible reasons for hash difference:

  • User saved the dataset, updated the library and there is differerence between saved version and library expected version. For example, because we added some additional preprocessing before saving. In that cases the dataset should be reloaded to fix the issue.
  • User saved the dataset which is newer than installed library has fixed. In that case the library could be updated to fix the issue (if there is such update).

I think current explanation doesn't give enough info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this warning message?

warnings.warn(
    f"Local hash and expected hash are different for {name} dataset part {part}. "
    "This could be because the dataset is prepared differently in the new version of the library, "
    "then you can try setting the argument 'rebuild_dataset=True'. Another reason could be that the data "
    "in the source has changed and the current version of the library does not know about it, "
    "then you can try to update library."
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

Local hash and expected hash are different for {name} record part {part}.
The first possible reason is that the local copy of the dataset is out of date. In this case you can try setting rebuild_dataset=True to rebuild the dataset.
The second possible reason is that the local copy of the dataset reflects a more recent version of the data than your version of the library. In this case you can try updating the library version.

@github-actions github-actions bot temporarily deployed to pull request November 22, 2023 10:51 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 22, 2023 22:48 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 23, 2023 06:51 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 24, 2023 01:43 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 24, 2023 06:19 Inactive
@ostreech1997 ostreech1997 merged commit 6fc62be into internal_datasets Nov 24, 2023
15 checks passed
@ostreech1997 ostreech1997 deleted the issue-138 branch November 24, 2023 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants