Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added benchmarks for Zarr compared to Hub #512

Merged
merged 2 commits into from Feb 1, 2021

Conversation

DebadityaPal
Copy link
Contributor

Added benchmarking files similar to that with TileDB, except this is for Zarr.
For reproducibility, this also has been tested on a Google Colab Notebook environment.
The benchmarks are as follows:

Dataset:  activeloop/mnist with Batch Size:  70000
Performance of Zarr
Batch 1 dt: 0.3488335609436035
Time: 0.3494699001312256s
Performance of Hub (Stored on the Cloud):
Batch 1 dt: 1.3345892429351807
Time: 1.3362071514129639s
Performance of Hub (Stored Locally):
Batch 1 dt: 0.06419062614440918
Time: 0.06442713737487793s
Dataset:  activeloop/mnist with Batch Size:  7000
Performance of Zarr
Batch 1 dt: 0.2592613697052002
Batch 2 dt: 0.25765037536621094
Batch 3 dt: 0.2523312568664551
Batch 4 dt: 0.26584601402282715
Batch 5 dt: 0.26589441299438477
Batch 6 dt: 0.2726469039916992
Batch 7 dt: 0.2609729766845703
Batch 8 dt: 0.27009081840515137
Batch 9 dt: 0.26024317741394043
Batch 10 dt: 0.26842737197875977
Time: 2.633971691131592s
Performance of Hub (Stored on the Cloud):
Batch 1 dt: 0.8596398830413818
Batch 2 dt: 0.024734973907470703
Batch 3 dt: 0.023304224014282227
Batch 4 dt: 0.2799103260040283
Batch 5 dt: 0.020690202713012695
Batch 6 dt: 0.020921945571899414
Batch 7 dt: 0.2269284725189209
Batch 8 dt: 0.021099328994750977
Batch 9 dt: 0.022265911102294922
Batch 10 dt: 0.10095000267028809
Time: 1.6010329723358154s
Performance of Hub (Stored Locally):
Batch 1 dt: 0.02331089973449707
Batch 2 dt: 0.02063584327697754
Batch 3 dt: 0.026700973510742188
Batch 4 dt: 0.03954935073852539
Batch 5 dt: 0.022269725799560547
Batch 6 dt: 0.0210113525390625
Batch 7 dt: 0.035251617431640625
Batch 8 dt: 0.02073359489440918
Batch 9 dt: 0.021008968353271484
Batch 10 dt: 0.02947402000427246
Time: 0.2601170539855957s

@github-actions
Copy link

Locust summary

Git references

Initial: 31943d2
Terminal: c813750

benchmarks/benchmark_zarr_hub.py
Changes:
  • Name: time_zarr
    Type: function
    Changed lines: 29
    Total lines: 29
    • Name: time_hub
      Type: function
      Changed lines: 17
      Total lines: 17

      @codecov
      Copy link

      codecov bot commented Jan 30, 2021

      Codecov Report

      Merging #512 (bd96374) into master (31943d2) will not change coverage.
      The diff coverage is n/a.

      Impacted file tree graph

      @@           Coverage Diff           @@
      ##           master     #512   +/-   ##
      =======================================
        Coverage   88.46%   88.46%           
      =======================================
        Files          52       52           
        Lines        3745     3745           
      =======================================
        Hits         3313     3313           
        Misses        432      432           

      Continue to review full report at Codecov.

      Legend - Click here to learn more
      Δ = absolute <relative> (impact), ø = not affected, ? = missing data
      Powered by Codecov. Last update 31943d2...bd96374. Read the comment docs.

      @mynameisvinn mynameisvinn added this to In Development in Development Roadmap Jan 30, 2021
      @haiyangdeperci
      Copy link
      Contributor

      Awesome, thanks again for your sustained support. Would it be possible for you to include places365_small dataset in your code? We want to see how hub performs on the. You don't need to run these benchmarks again but it would be useful to have it in the code. The Hub's signature for it is hydp/places365_small_train.

      @DebadityaPal
      Copy link
      Contributor Author

      I have added the dataset and modified the existing functions to support it. However, the process is really memory inefficient. The Dataset is stored thrice, once as .tfrecord, once in the zarr format, and once in the hub format.

      @mynameisvinn
      Copy link
      Contributor

      @DebadityaPal Where is places365 saved as a tfrecord? My understanding (and I might be wrong) is that from_tdfs returns a hub.Dataset, not the underlying tfrecord.

      @DebadityaPal
      Copy link
      Contributor Author

      What is happening is that from_tfds is returning a simple transform on the tfds dataset which is first getting downloaded at root/tensorflow-datasets/places365_small as a .tfrecord file. From there when we use dataset.store(), the transform is computed and the hub dataset is stored. But it is first stored as a .tfrecord file.

      @mynameisvinn
      Copy link
      Contributor

      @DebadityaPal Gotcha, thanks.

      Can you show me where tfrecord is transformed? I've been staring at store and don't see it.

      @DebadityaPal
      Copy link
      Contributor Author

      DebadityaPal commented Jan 31, 2021

      In Line 939 of dataset.py ,from_tfds() we have

      ds = tfds.load(dataset, split=split)
      

      I think the way tfds.load() is programmed is that it looks for the data in the tensorflow_dataset root directory, if the dataset doesn't exist, it downloads and prepares it. It has an argument download which is a boolean value set to True by default. If it is set to False then the execution throws an error if the dataset is not present in the root directory. So either way, the dataset needs to be downloaded by tfds before we use it further.

      @haiyangdeperci
      Copy link
      Contributor

      Is it necessary that we use from_tfds in the process?

      @DebadityaPal
      Copy link
      Contributor Author

      It's not necessary, however, it is faster.
      Another option would be to use an identity transform and use that to store the cloud fetched hub dataset locally. That usually takes a much longer time.

      @haiyangdeperci
      Copy link
      Contributor

      We've been discussing this issue. I'll try to run these benchmarks on our set-up. Afterwards, we'll investigate the problem. I think @AbhinavTuli wanted to explore the issue with the transforms.

      @haiyangdeperci
      Copy link
      Contributor

      I'll merge it. I run the benchmarks and the results seems alright, albeit I noted the memory issue. I am hoping to raise this problem separately before our next benchmark call.

      @haiyangdeperci haiyangdeperci merged commit 837dab9 into activeloopai:master Feb 1, 2021
      Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
      Labels
      None yet
      Projects
      No open projects
      Development Roadmap
      In Development
      Development

      Successfully merging this pull request may close these issues.

      None yet

      3 participants