Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compresses Diff format and fix diff bug in Transforms #1379

Merged
merged 9 commits into from
Dec 16, 2021
Merged

Conversation

AbhinavTuli
Copy link
Contributor

@AbhinavTuli AbhinavTuli commented Dec 8, 2021

🚀 🚀 Pull Request

Checklist:

  • My code follows the style guidelines of this project and the Contributing document
  • I have commented my code, particularly in hard-to-understand areas
  • I have kept the coverage-rate up
  • I have performed a self-review of my own code and resolved any problems
  • I have checked to ensure there aren't any other open Pull Requests for the same change
  • I have described and made corresponding changes to the relevant documentation
  • New and existing unit tests pass locally with my changes

Changes

Compresses data_added in diff, to take up less storage by only storing first and last index added instead of full set.

The format of the dicts returned in diff has also been changed:-

{
    "image": {"data_added": [3, 6], "data_updated": {0, 2}, "created": False, "info_updated": False, "data_transformed_in_place": False},
    "label": {"data_added": [0, 3], "data_updated": {}, "created": True, "info_updated": False, "data_transformed_in_place": False},
    "other/stuff" : {data_added: [3, 3], data_updated: {1, 2}, created: True, "info_updated": False, "data_transformed_in_place": False}
}

Here the data_adeded is a range of sample indexes that were added to the tensor.
For example [3, 6] means that sample 3, 4 and 5 were added.
Another example [3, 3] means that no samples were added as the range is empty

data_updated on the other hand is a set of sample indexes that were updated.
For example {0, 2} means that sample 0 and 2 were updated.

"data_transformed_in_place" informs whether an inplace transformation happened in between the commits that are being looked at.

Also fixes bug in which diff wasn't being considered in transforms

@codecov
Copy link

codecov bot commented Dec 8, 2021

Codecov Report

Merging #1379 (52a1e86) into main (619be82) will decrease coverage by 0.52%.
The diff coverage is 97.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1379      +/-   ##
==========================================
- Coverage   92.91%   92.39%   -0.53%     
==========================================
  Files         175      175              
  Lines       13049    12945     -104     
==========================================
- Hits        12125    11961     -164     
- Misses        924      984      +60     
Flag Coverage Δ
unittests 92.39% <97.76%> (-0.53%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
hub/core/version_control/test_version_control.py 99.85% <ø> (ø)
hub/util/diff.py 96.40% <93.93%> (-1.98%) ⬇️
hub/core/version_control/commit_diff.py 97.14% <95.00%> (-2.86%) ⬇️
hub/core/chunk_engine.py 97.52% <100.00%> (-0.03%) ⬇️
hub/core/dataset/dataset.py 94.20% <100.00%> (-0.09%) ⬇️
hub/core/meta/encode/chunk_id.py 95.55% <100.00%> (ø)
hub/core/transform/test_transform.py 97.12% <100.00%> (+0.16%) ⬆️
hub/core/transform/transform.py 96.63% <100.00%> (+0.02%) ⬆️
hub/util/encoder.py 99.13% <100.00%> (+0.14%) ⬆️
hub/util/transform.py 92.12% <100.00%> (+0.14%) ⬆️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 816ba4d...52a1e86. Read the comment docs.

self.data_updated: Set[int] = set()
self.info_updated = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now this will always be False. I've added this to the format so we don't break backwards compatibility in the future.

@AbhinavTuli AbhinavTuli changed the title Compresses Diff format Compresses Diff format and fix diff bug in Transforms Dec 13, 2021
@AbhinavTuli AbhinavTuli merged commit 2e23cf9 into main Dec 16, 2021
@AbhinavTuli AbhinavTuli deleted the compress/diff branch December 16, 2021 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants