Add benchmarking result for Bigquery 10gb datasets #707

utkarsharma2 · 2022-08-21T17:23:55Z

Description

What is the current behavior?

When running benchmarking for 10GB files we faced issues in loading and running benchmarking.

closes: #702

What is the new behavior?

Added 10GB dataset benchmarking numbers. Added a new dataset that works with big queries.

Does this introduce a breaking change?

Nope

Checklist

Extended the README/documentation, if necessary

codecov · 2022-08-21T19:56:26Z

Codecov Report

Merging #707 (10060d9) into main (251e0dc) will not change coverage.
The diff coverage is n/a.

❗ Current head 10060d9 differs from pull request most recent head 3a2166e. Consider uploading reports for the commit 3a2166e to get more accurate results

@@           Coverage Diff           @@
##             main     #707   +/-   ##
=======================================
  Coverage   93.45%   93.45%           
=======================================
  Files          42       42           
  Lines        1726     1726           
  Branches      213      213           
=======================================
  Hits         1613     1613           
  Misses         91       91           
  Partials       22       22

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

sunank200 · 2022-08-22T06:23:38Z

tests/benchmark/results.md

+| imdb/title_ratings_10mb.csv                                              | 10MB | 19.40  |
+| stackoverflow/stackoverflow_posts_1g.ndjson                              | 1GB  | 30.26  |
+| trimmed/pypi/*                                                           | 5GB  | 59.90  |
+| gs://astro-sdk/benchmark/trimmed/stackoverflow/10gb/ - 10 Files 1gb each | 10GB |1.94min|


is this a new dataset? May be we should add it to the list in Dataset.md

Why is this a new dataset..?

Why didn't the previous dataset work? Should we still address the root cause?

Once we have more understanding of these two questions, we may want to re-run the benchmark for the other databases to use this new dataset, so we can have a fair comparison.

@tatiana

invalid rows with null values

There was a type mismatch as well

the issue was mostly related to data cleaning and that's why I choose a different dataset that worked for me in past.

# Description ## What is the current behavior? When running benchmarking for 10GB files we faced issues in loading and running benchmarking. closes: #702 ## What is the new behavior? Added 10GB dataset benchmarking numbers. Added a new dataset that works with big queries. ## Does this introduce a breaking change? Nope ### Checklist - [X] Extended the README/documentation, if necessary (cherry picked from commit 577095d)

Add benchmarking result for 10gb datasets

41b9094

utkarsharma2 requested review from dimberman, tatiana, sunank200, pankajastro, pankajkoti and feluelle as code owners August 21, 2022 17:23

Update benchmarking result S3 to BQ

10060d9

Update the GCS to Biquery results

e618d4d

sunank200 approved these changes Aug 22, 2022

View reviewed changes

Added a new dataset

3a2166e

utkarsharma2 merged commit 577095d into main Aug 22, 2022

utkarsharma2 deleted the Biguery10GBBenchmark branch August 22, 2022 10:49

kaxil added this to the 1.0.1 milestone Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarking result for Bigquery 10gb datasets #707

Add benchmarking result for Bigquery 10gb datasets #707

utkarsharma2 commented Aug 21, 2022

codecov bot commented Aug 21, 2022 •

edited

Loading

sunank200 Aug 22, 2022

tatiana Aug 22, 2022 •

edited

Loading

utkarsharma2 Aug 22, 2022 •

edited

Loading

Add benchmarking result for Bigquery 10gb datasets #707

Add benchmarking result for Bigquery 10gb datasets #707

Conversation

utkarsharma2 commented Aug 21, 2022

Description

What is the current behavior?

What is the new behavior?

Does this introduce a breaking change?

Checklist

codecov bot commented Aug 21, 2022 • edited Loading

Codecov Report

sunank200 Aug 22, 2022

Choose a reason for hiding this comment

tatiana Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

utkarsharma2 Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Aug 21, 2022 •

edited

Loading

tatiana Aug 22, 2022 •

edited

Loading

utkarsharma2 Aug 22, 2022 •

edited

Loading