Added benchmarks for Zarr compared to Hub #512

DebadityaPal · 2021-01-30T13:42:40Z

Added benchmarking files similar to that with TileDB, except this is for Zarr.
For reproducibility, this also has been tested on a Google Colab Notebook environment.
The benchmarks are as follows:

Dataset:  activeloop/mnist with Batch Size:  70000
Performance of Zarr
Batch 1 dt: 0.3488335609436035
Time: 0.3494699001312256s
Performance of Hub (Stored on the Cloud):
Batch 1 dt: 1.3345892429351807
Time: 1.3362071514129639s
Performance of Hub (Stored Locally):
Batch 1 dt: 0.06419062614440918
Time: 0.06442713737487793s
Dataset:  activeloop/mnist with Batch Size:  7000
Performance of Zarr
Batch 1 dt: 0.2592613697052002
Batch 2 dt: 0.25765037536621094
Batch 3 dt: 0.2523312568664551
Batch 4 dt: 0.26584601402282715
Batch 5 dt: 0.26589441299438477
Batch 6 dt: 0.2726469039916992
Batch 7 dt: 0.2609729766845703
Batch 8 dt: 0.27009081840515137
Batch 9 dt: 0.26024317741394043
Batch 10 dt: 0.26842737197875977
Time: 2.633971691131592s
Performance of Hub (Stored on the Cloud):
Batch 1 dt: 0.8596398830413818
Batch 2 dt: 0.024734973907470703
Batch 3 dt: 0.023304224014282227
Batch 4 dt: 0.2799103260040283
Batch 5 dt: 0.020690202713012695
Batch 6 dt: 0.020921945571899414
Batch 7 dt: 0.2269284725189209
Batch 8 dt: 0.021099328994750977
Batch 9 dt: 0.022265911102294922
Batch 10 dt: 0.10095000267028809
Time: 1.6010329723358154s
Performance of Hub (Stored Locally):
Batch 1 dt: 0.02331089973449707
Batch 2 dt: 0.02063584327697754
Batch 3 dt: 0.026700973510742188
Batch 4 dt: 0.03954935073852539
Batch 5 dt: 0.022269725799560547
Batch 6 dt: 0.0210113525390625
Batch 7 dt: 0.035251617431640625
Batch 8 dt: 0.02073359489440918
Batch 9 dt: 0.021008968353271484
Batch 10 dt: 0.02947402000427246
Time: 0.2601170539855957s

github-actions · 2021-01-30T13:43:12Z

Locust summary

Git references

Initial: 31943d2
Terminal: c813750

benchmarks/benchmark_zarr_hub.py

Changes:

Name: time_zarr
Type: function
Changed lines: 29
Total lines: 29
Name: time_hub
Type: function
Changed lines: 17
Total lines: 17

codecov · 2021-01-30T13:46:04Z

Codecov Report

Merging #512 (bd96374) into master (31943d2) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #512   +/-   ##
=======================================
  Coverage   88.46%   88.46%           
=======================================
  Files          52       52           
  Lines        3745     3745           
=======================================
  Hits         3313     3313           
  Misses        432      432

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31943d2...bd96374. Read the comment docs.

haiyangdeperci · 2021-01-30T14:27:16Z

Awesome, thanks again for your sustained support. Would it be possible for you to include places365_small dataset in your code? We want to see how hub performs on the. You don't need to run these benchmarks again but it would be useful to have it in the code. The Hub's signature for it is hydp/places365_small_train.

DebadityaPal · 2021-01-31T16:04:45Z

I have added the dataset and modified the existing functions to support it. However, the process is really memory inefficient. The Dataset is stored thrice, once as .tfrecord, once in the zarr format, and once in the hub format.

mynameisvinn · 2021-01-31T16:21:35Z

@DebadityaPal Where is places365 saved as a tfrecord? My understanding (and I might be wrong) is that from_tdfs returns a hub.Dataset, not the underlying tfrecord.

DebadityaPal · 2021-01-31T16:25:55Z

What is happening is that from_tfds is returning a simple transform on the tfds dataset which is first getting downloaded at root/tensorflow-datasets/places365_small as a .tfrecord file. From there when we use dataset.store(), the transform is computed and the hub dataset is stored. But it is first stored as a .tfrecord file.

mynameisvinn · 2021-01-31T17:08:05Z

@DebadityaPal Gotcha, thanks.

Can you show me where tfrecord is transformed? I've been staring at store and don't see it.

DebadityaPal · 2021-01-31T17:45:48Z

In Line 939 of dataset.py ,from_tfds() we have

ds = tfds.load(dataset, split=split)

I think the way tfds.load() is programmed is that it looks for the data in the tensorflow_dataset root directory, if the dataset doesn't exist, it downloads and prepares it. It has an argument download which is a boolean value set to True by default. If it is set to False then the execution throws an error if the dataset is not present in the root directory. So either way, the dataset needs to be downloaded by tfds before we use it further.

haiyangdeperci · 2021-01-31T19:02:54Z

Is it necessary that we use from_tfds in the process?

DebadityaPal · 2021-02-01T05:16:20Z

It's not necessary, however, it is faster.
Another option would be to use an identity transform and use that to store the cloud fetched hub dataset locally. That usually takes a much longer time.

haiyangdeperci · 2021-02-01T08:08:32Z

We've been discussing this issue. I'll try to run these benchmarks on our set-up. Afterwards, we'll investigate the problem. I think @AbhinavTuli wanted to explore the issue with the transforms.

haiyangdeperci · 2021-02-01T17:30:47Z

I'll merge it. I run the benchmarks and the results seems alright, albeit I noted the memory issue. I am hoping to raise this problem separately before our next benchmark call.

Added benchmarks for Zarr compared to Hub

c813750

mynameisvinn added this to In Development in Development Roadmap Jan 30, 2021

Added places365_small_train dataset

bd96374

haiyangdeperci merged commit 837dab9 into activeloopai:master Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added benchmarks for Zarr compared to Hub #512

Added benchmarks for Zarr compared to Hub #512

DebadityaPal commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

codecov bot commented Jan 30, 2021 •

edited

haiyangdeperci commented Jan 30, 2021

DebadityaPal commented Jan 31, 2021

mynameisvinn commented Jan 31, 2021

DebadityaPal commented Jan 31, 2021

mynameisvinn commented Jan 31, 2021

DebadityaPal commented Jan 31, 2021 •

edited

haiyangdeperci commented Jan 31, 2021

DebadityaPal commented Feb 1, 2021

haiyangdeperci commented Feb 1, 2021

haiyangdeperci commented Feb 1, 2021

Added benchmarks for Zarr compared to Hub #512

Added benchmarks for Zarr compared to Hub #512

Conversation

DebadityaPal commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

Locust summary

Git references

codecov bot commented Jan 30, 2021 • edited

Codecov Report

haiyangdeperci commented Jan 30, 2021

DebadityaPal commented Jan 31, 2021

mynameisvinn commented Jan 31, 2021

DebadityaPal commented Jan 31, 2021

mynameisvinn commented Jan 31, 2021

DebadityaPal commented Jan 31, 2021 • edited

haiyangdeperci commented Jan 31, 2021

DebadityaPal commented Feb 1, 2021

haiyangdeperci commented Feb 1, 2021

haiyangdeperci commented Feb 1, 2021

codecov bot commented Jan 30, 2021 •

edited

DebadityaPal commented Jan 31, 2021 •

edited