Adding back transforms for parallel dataset uploads #1086

AbhinavTuli · 2021-07-27T11:04:15Z

🚀 🚀 Pull Request

Checklist:

My code follows the style guidelines of this project and the Contributing document
I have commented my code, particularly in hard-to-understand areas
I have kept the coverage-rate up
I have performed a self-review of my own code and resolved any problems
I have checked to ensure there aren't any other open Pull Requests for the same change
I have described and made corresponding changes to the relevant documentation
New and existing unit tests pass locally with my changes

Changes

This PR adds support for transforms after support was broken in a previous release. The API for the same has also been tweaked.
Also fixes extraneous print statements on slicing datasets.

Currently does not support:-

Corner chunk merging similar to the one we had in previous version of transforms for v2.
Inplace transforms

Some parts of the code can be further simplified once we get rid of cachable.

hub/core/chunk_engine.py

hub/core/sample.py

hub/core/storage/cachable.py

hub/core/transform/transform.py

AbhinavTuli · 2021-07-27T11:11:44Z

hub/core/transform/transform_shard.py

+    return sliced
+
+
+class TransformDatasetShard:


I'm using this class with interface very similar to Dataset to avoid chunk engine overhead while storing samples during transforms. The append, extend and numpy syntax should be very similar to Dataset currently, but might not work for advanced indexing.

Need to add docstrings

can we try to avoid Shard naming? it's kind of confusing

any suggestions?

none come to mind, but i feel like we use Shard and Chunk interchangably

hub/core/transform/transform_tensor.py

hub/core/chunk_engine.py

hub/util/transform.py

hub/core/transform/transform_tensor.py

verbose-void · 2021-07-29T23:22:09Z

hub/core/transform/transform_tensor.py

@@ -0,0 +1,59 @@
+from hub.core.sample import Sample  # type: ignore


what are the differences between TransformDatasetShard and TransformDatasetTensor?

also, maybe we can drop the Dataset from both class names?

they're literally basic version of Dataset and Tensor respectively, used for the sample_out argument.
The underlying implementation does not involve any chunking/compression though, raw data is just stored.

in that case, is it possible to use Dataset & Tensor?

if i'm a user, a likely thing i will do is print out the variables, and if i see Dataset/Tensor i will know exactly what they are.

also it makes more sense to me to call these "View"s even if it's not as technically descriptive as "Shard"

yeah but this isn't exactly a Dataset/Tensor, it's just a dummy class representing it. So a lot of the methods/attributes of Dataset won't work.

What about TransformDataset and TransformTensor?

yep that works well

hub/util/exceptions.py

hub/util/transform.py

verbose-void

quite a few comments, but most logic looks pretty good. great code organization and docstrings!

hub/core/chunk_engine.py

hub/core/sample.py

hub/core/transform/transform_shard.py

…r->TransformTensor

farizrahman4u

Nice work, will have to refac some the transform utils to base encoder.. but we can do that later.

verbose-void

yessir

AbhinavTuli added 17 commits July 13, 2021 01:50

restore old transform code

9fb5068

better checks on ds_out

c17a365

Merge remote-tracking branch 'origin/main' into update/2.0/transforms

f386183

Merge remote-tracking branch 'origin/main' into update/2.0/transforms

2373cf7

updated transform code

d8ae83e

Merge remote-tracking branch 'origin/main' into update/2.0/transforms

4e58950

threaded transforms work, processed fails

e670495

multiprocessing fixed

d6b36e4

removed logs on dataset slcing

d8c0c1f

moved chunk engine initialization into store_shard

2819456

cache TODO

3245098

added shardwise cache

8fb90d3

added TODO for complexity

9108e12

new transform api works, sample_out left

5b28118

removed commented out code

92b2e2c

update docstring

9d25165

samples_out syntax works

69d8f8c