Add dataset integrity check using hash for internal datasets #151

ostreech1997 · 2023-11-22T06:55:42Z

Before submitting (must do checklist)

Did you read the contribution guide?
Did you update the docs? We use Numpy format for all the methods and classes.
Did you write any new necessary tests?
Did you update the CHANGELOG?

Proposed Changes

Closing issues

Closes #138.

codecov · 2023-11-22T06:58:11Z

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (dd1ace5) 88.93% compared to head (8df44cf) 88.95%.
Report is 2 commits behind head on internal_datasets.

Files	Patch %	Lines
etna/datasets/internal_datasets.py	84.61%	6 Missing ⚠️

Additional details and impacted files

@@                  Coverage Diff                  @@
##           internal_datasets     #151      +/-   ##
=====================================================
+ Coverage              88.93%   88.95%   +0.01%     
=====================================================
  Files                    205      205              
  Lines                  13026    13042      +16     
=====================================================
+ Hits                   11585    11601      +16     
  Misses                  1441     1441

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2023-11-22T07:02:55Z

🚀 Deployed on https://deploy-preview-151--etna-docs.netlify.app

CHANGELOG.md

etna/datasets/internal_datasets.py

d-a-bunin · 2023-11-22T10:40:40Z

etna/datasets/internal_datasets.py

-        )
+        data, dataset_hash = read_dataset(dataset_path=dataset_dir / f"{name}_{parts_[0]}.csv.gz")
+        if dataset_hash != datasets_dict[name]["hash"][parts_[0]]:
+            warnings.warn(


As I can see there are possible reasons for hash difference:

User saved the dataset, updated the library and there is differerence between saved version and library expected version. For example, because we added some additional preprocessing before saving. In that cases the dataset should be reloaded to fix the issue.

User saved the dataset which is newer than installed library has fixed. In that case the library could be updated to fix the issue (if there is such update).

I think current explanation doesn't give enough info.

What do you think about this warning message?

warnings.warn( f"Local hash and expected hash are different for {name} dataset part {part}. " "This could be because the dataset is prepared differently in the new version of the library, " "then you can try setting the argument 'rebuild_dataset=True'. Another reason could be that the data " "in the source has changed and the current version of the library does not know about it, " "then you can try to update library." )

Suggestion:

Local hash and expected hash are different for {name} record part {part}.
The first possible reason is that the local copy of the dataset is out of date. In this case you can try setting rebuild_dataset=True to rebuild the dataset.
The second possible reason is that the local copy of the dataset reflects a more recent version of the data than your version of the library. In this case you can try updating the library version.

Initial commit

36f9ace

Update CHANGELOG.md

168dcb5

ostreech1997 added 3 commits November 22, 2023 10:07

Merge with internal_datasets

b2087dd

Add hash for IHEPC_T

6f111c6

Code refactoring v1

e8466a9

github-actions bot temporarily deployed to pull request November 22, 2023 07:30 Inactive

d-a-bunin requested changes Nov 22, 2023

View reviewed changes

Fix tests

83e82b5

github-actions bot temporarily deployed to pull request November 22, 2023 10:51 Inactive

Change save float precision

21f7cbf

github-actions bot temporarily deployed to pull request November 22, 2023 22:48 Inactive

Fix some hashes

bcccf9b

github-actions bot temporarily deployed to pull request November 23, 2023 06:51 Inactive

Fix problems with hash due to OS and float precision

b1a4494

github-actions bot temporarily deployed to pull request November 24, 2023 01:43 Inactive

Fix hash of m3_monthly

8df44cf

github-actions bot temporarily deployed to pull request November 24, 2023 06:19 Inactive

ostreech1997 requested a review from d-a-bunin November 24, 2023 07:13

d-a-bunin approved these changes Nov 24, 2023

View reviewed changes

ostreech1997 merged commit 6fc62be into internal_datasets Nov 24, 2023
15 checks passed

ostreech1997 deleted the issue-138 branch November 24, 2023 09:07

ostreech1997 mentioned this pull request Nov 24, 2023

Add dataset integrity check using hash for internal datasets #138

Closed

ostreech1997 added a commit that referenced this pull request Dec 4, 2023

Add dataset integrity check using hash for internal datasets (#151)

d9abc96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset integrity check using hash for internal datasets #151

Add dataset integrity check using hash for internal datasets #151

ostreech1997 commented Nov 22, 2023

codecov bot commented Nov 22, 2023 •

edited

github-actions bot commented Nov 22, 2023 •

edited

d-a-bunin Nov 22, 2023

ostreech1997 Nov 22, 2023

d-a-bunin Nov 23, 2023

Add dataset integrity check using hash for internal datasets #151

Add dataset integrity check using hash for internal datasets #151

Conversation

ostreech1997 commented Nov 22, 2023

Before submitting (must do checklist)

Proposed Changes

Closing issues

codecov bot commented Nov 22, 2023 • edited

Codecov Report

github-actions bot commented Nov 22, 2023 • edited

d-a-bunin Nov 22, 2023

Choose a reason for hiding this comment

ostreech1997 Nov 22, 2023

Choose a reason for hiding this comment

d-a-bunin Nov 23, 2023

Choose a reason for hiding this comment

codecov bot commented Nov 22, 2023 •

edited

github-actions bot commented Nov 22, 2023 •

edited