New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset integrity check using hash for internal datasets #151
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## internal_datasets #151 +/- ##
=====================================================
+ Coverage 88.93% 88.95% +0.01%
=====================================================
Files 205 205
Lines 13026 13042 +16
=====================================================
+ Hits 11585 11601 +16
Misses 1441 1441 ☔ View full report in Codecov by Sentry. |
🚀 Deployed on https://deploy-preview-151--etna-docs.netlify.app |
) | ||
data, dataset_hash = read_dataset(dataset_path=dataset_dir / f"{name}_{parts_[0]}.csv.gz") | ||
if dataset_hash != datasets_dict[name]["hash"][parts_[0]]: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I can see there are possible reasons for hash difference:
- User saved the dataset, updated the library and there is differerence between saved version and library expected version. For example, because we added some additional preprocessing before saving. In that cases the dataset should be reloaded to fix the issue.
- User saved the dataset which is newer than installed library has fixed. In that case the library could be updated to fix the issue (if there is such update).
I think current explanation doesn't give enough info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this warning message?
warnings.warn(
f"Local hash and expected hash are different for {name} dataset part {part}. "
"This could be because the dataset is prepared differently in the new version of the library, "
"then you can try setting the argument 'rebuild_dataset=True'. Another reason could be that the data "
"in the source has changed and the current version of the library does not know about it, "
"then you can try to update library."
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
Local hash and expected hash are different for {name} record part {part}.
The first possible reason is that the local copy of the dataset is out of date. In this case you can try setting rebuild_dataset=True
to rebuild the dataset.
The second possible reason is that the local copy of the dataset reflects a more recent version of the data than your version of the library. In this case you can try updating the library version.
Before submitting (must do checklist)
Proposed Changes
Closing issues
Closes #138.