New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding back transforms for parallel dataset uploads #1086
Conversation
return sliced | ||
|
||
|
||
class TransformDatasetShard: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this class with interface very similar to Dataset to avoid chunk engine overhead while storing samples during transforms. The append, extend and numpy syntax should be very similar to Dataset currently, but might not work for advanced indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we try to avoid Shard
naming? it's kind of confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none come to mind, but i feel like we use Shard
and Chunk
interchangably
@@ -0,0 +1,59 @@ | |||
from hub.core.sample import Sample # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the differences between TransformDatasetShard
and TransformDatasetTensor
?
also, maybe we can drop the Dataset
from both class names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they're literally basic version of Dataset and Tensor respectively, used for the sample_out argument.
The underlying implementation does not involve any chunking/compression though, raw data is just stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, is it possible to use Dataset & Tensor?
if i'm a user, a likely thing i will do is print out the variables, and if i see Dataset/Tensor i will know exactly what they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it makes more sense to me to call these "View"s even if it's not as technically descriptive as "Shard"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but this isn't exactly a Dataset/Tensor, it's just a dummy class representing it. So a lot of the methods/attributes of Dataset won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about TransformDataset and TransformTensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that works well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite a few comments, but most logic looks pretty good. great code organization and docstrings!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, will have to refac some the transform utils to base encoder.. but we can do that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yessir
🚀 🚀 Pull Request
Checklist:
coverage-rate
upChanges
Currently does not support:-
Some parts of the code can be further simplified once we get rid of cachable.