Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding back transforms for parallel dataset uploads #1086

Merged
merged 47 commits into from Jul 31, 2021

Conversation

AbhinavTuli
Copy link
Contributor

@AbhinavTuli AbhinavTuli commented Jul 27, 2021

🚀 🚀 Pull Request

Checklist:

  • My code follows the style guidelines of this project and the Contributing document
  • I have commented my code, particularly in hard-to-understand areas
  • I have kept the coverage-rate up
  • I have performed a self-review of my own code and resolved any problems
  • I have checked to ensure there aren't any other open Pull Requests for the same change
  • I have described and made corresponding changes to the relevant documentation
  • New and existing unit tests pass locally with my changes

Changes

  • This PR adds support for transforms after support was broken in a previous release. The API for the same has also been tweaked.
  • Also fixes extraneous print statements on slicing datasets.

Currently does not support:-

  • Corner chunk merging similar to the one we had in previous version of transforms for v2.
  • Inplace transforms

Some parts of the code can be further simplified once we get rid of cachable.

hub/core/chunk_engine.py Outdated Show resolved Hide resolved
hub/core/sample.py Outdated Show resolved Hide resolved
return sliced


class TransformDatasetShard:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using this class with interface very similar to Dataset to avoid chunk engine overhead while storing samples during transforms. The append, extend and numpy syntax should be very similar to Dataset currently, but might not work for advanced indexing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add docstrings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try to avoid Shard naming? it's kind of confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none come to mind, but i feel like we use Shard and Chunk interchangably

hub/util/transform.py Outdated Show resolved Hide resolved
hub/util/transform.py Outdated Show resolved Hide resolved
@@ -0,0 +1,59 @@
from hub.core.sample import Sample # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the differences between TransformDatasetShard and TransformDatasetTensor?

also, maybe we can drop the Dataset from both class names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're literally basic version of Dataset and Tensor respectively, used for the sample_out argument.
The underlying implementation does not involve any chunking/compression though, raw data is just stored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, is it possible to use Dataset & Tensor?

if i'm a user, a likely thing i will do is print out the variables, and if i see Dataset/Tensor i will know exactly what they are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it makes more sense to me to call these "View"s even if it's not as technically descriptive as "Shard"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but this isn't exactly a Dataset/Tensor, it's just a dummy class representing it. So a lot of the methods/attributes of Dataset won't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about TransformDataset and TransformTensor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that works well

hub/util/transform.py Outdated Show resolved Hide resolved
hub/util/transform.py Outdated Show resolved Hide resolved
hub/util/transform.py Outdated Show resolved Hide resolved
Copy link
Contributor

@verbose-void verbose-void left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite a few comments, but most logic looks pretty good. great code organization and docstrings!

hub/core/chunk_engine.py Outdated Show resolved Hide resolved
hub/core/sample.py Outdated Show resolved Hide resolved
hub/core/transform/transform_shard.py Outdated Show resolved Hide resolved
Copy link
Contributor

@farizrahman4u farizrahman4u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, will have to refac some the transform utils to base encoder.. but we can do that later.

Copy link
Contributor

@verbose-void verbose-void left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yessir

@verbose-void verbose-void merged commit 6c2033a into main Jul 31, 2021
@verbose-void verbose-void deleted the update/2.0/transforms branch July 31, 2021 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants