Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark for TileDB and Hub added #508

Merged
merged 4 commits into from Jan 29, 2021

Conversation

DebadityaPal
Copy link
Contributor

For reproducibility I used a Google Colab environment with Hardware Acceleration set to None

@github-actions
Copy link

Locust summary

Git references

Initial: c2a5bd9
Terminal: da4e42f

benchmarks/benchmark_tiledb_hub.py
Changes:
  • Name: time_tiledb
    Type: function
    Changed lines: 30
    Total lines: 30
    • Name: time_hub
      Type: function
      Changed lines: 17
      Total lines: 17

      @codecov
      Copy link

      codecov bot commented Jan 28, 2021

      Codecov Report

      Merging #508 (9b60667) into master (c2a5bd9) will not change coverage.
      The diff coverage is n/a.

      Impacted file tree graph

      @@           Coverage Diff           @@
      ##           master     #508   +/-   ##
      =======================================
        Coverage   88.46%   88.46%           
      =======================================
        Files          52       52           
        Lines        3745     3745           
      =======================================
        Hits         3313     3313           
        Misses        432      432           

      Continue to review full report at Codecov.

      Legend - Click here to learn more
      Δ = absolute <relative> (impact), ø = not affected, ? = missing data
      Powered by Codecov. Last update c2a5bd9...9b60667. Read the comment docs.

      @haiyangdeperci
      Copy link
      Contributor

      @DebadityaPal Thanks a lot for posting it. I'll fetch the results 👍 and I'll see if everything is in order.

      @DebadityaPal
      Copy link
      Contributor Author

      I just realized, i missed compute() on the hub dataset calls. So I have added that, current results are

      Dataset:  activeloop/mnist with Batch Size:  70000
      Performance of TileDB
      Batch 1 dt: 1.1255619525909424
      Time: 1.1257479190826416s
      Performance of Hub
      Batch 1 dt: 1.1036744117736816
      Time: 1.1043024063110352s
      Dataset:  activeloop/mnist with Batch Size:  7000
      Performance of TileDB
      Batch 1 dt: 0.9032697677612305
      Batch 2 dt: 1.0399649143218994
      Batch 3 dt: 1.0450794696807861
      Batch 4 dt: 1.0824565887451172
      Batch 5 dt: 1.0563287734985352
      Batch 6 dt: 1.0369079113006592
      Batch 7 dt: 1.0990698337554932
      Batch 8 dt: 1.0535051822662354
      Batch 9 dt: 1.1018755435943604
      Batch 10 dt: 1.0933120250701904
      Time: 10.51236867904663s
      Performance of Hub
      Batch 1 dt: 0.8278625011444092
      Batch 2 dt: 0.01743602752685547
      Batch 3 dt: 0.016743183135986328
      Batch 4 dt: 0.2736356258392334
      Batch 5 dt: 0.016682863235473633
      Batch 6 dt: 0.016436100006103516
      Batch 7 dt: 0.21753954887390137
      Batch 8 dt: 0.019435405731201172
      Batch 9 dt: 0.019253969192504883
      Batch 10 dt: 0.08768510818481445
      Time: 1.5131916999816895s
      

      @haiyangdeperci
      Copy link
      Contributor

      @DebadityaPal Now it looks more realistic ;) Thanks!

      @haiyangdeperci
      Copy link
      Contributor

      @DebadityaPal just a clarification question: you are testing the performance of sequential read right?

      @DebadityaPal
      Copy link
      Contributor Author

      Yes, these are sequential reads. Would you like me to submit a file with randomized reads as well?

      @haiyangdeperci
      Copy link
      Contributor

      @DebadityaPal Eventually - yes. At this moment, this is great and sufficient! It would be more useful for us if you focus on zarr for now as we spoke :) Great job! 💯

      @DebadityaPal
      Copy link
      Contributor Author

      @haiyangdeperci do you think these benchmarks are a bit unfair towards Hub because the TileDB data is stored locally whereas hub fetches the data from the cloud, thus creating overhead for cache misses? I modified this code to store both locally, hub performs significantly better in that case. Like, instead of
      Time: 1.5131916999816895s
      in the second task, it only takes
      Time: 0.233504056930542s

      @haiyangdeperci
      Copy link
      Contributor

      @DebadityaPal I think it would be good for the benchmark to present both cases. We want to show that even if the data is fetched remotely Hub still outperforms TileDB

      @mynameisvinn mynameisvinn added this to Committed in Development Roadmap Jan 29, 2021
      @mynameisvinn mynameisvinn moved this from Committed to In Development in Development Roadmap Jan 29, 2021
      @DebadityaPal
      Copy link
      Contributor Author

      Another thing that I noticed was the benchmarks for the cloud-stored hub dataset is a little inconsistent in the sense that it varies with variation in Internet Speed and Latency.

      @haiyangdeperci
      Copy link
      Contributor

      @DebadityaPal That's a good observation. I'm running these benchmarks on a machine closer in location to hub's cloud so it is less of an issue but for the end user it may be significant.

      @haiyangdeperci haiyangdeperci merged commit 31943d2 into activeloopai:master Jan 29, 2021
      @haiyangdeperci
      Copy link
      Contributor

      @DebadityaPal Thanks again for the great work! I'll update the results.

      Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
      Labels
      None yet
      Projects
      No open projects
      Development Roadmap
      In Development
      Development

      Successfully merging this pull request may close these issues.

      None yet

      2 participants