-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarking result for Bigquery 10gb datasets #707
Conversation
Codecov Report
@@ Coverage Diff @@
## main #707 +/- ##
=======================================
Coverage 93.45% 93.45%
=======================================
Files 42 42
Lines 1726 1726
Branches 213 213
=======================================
Hits 1613 1613
Misses 91 91
Partials 22 22 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
tests/benchmark/results.md
Outdated
| imdb/title_ratings_10mb.csv | 10MB | 19.40 | | ||
| stackoverflow/stackoverflow_posts_1g.ndjson | 1GB | 30.26 | | ||
| trimmed/pypi/* | 5GB | 59.90 | | ||
| gs://astro-sdk/benchmark/trimmed/stackoverflow/10gb/ - 10 Files 1gb each | 10GB |1.94min| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a new dataset? May be we should add it to the list in Dataset.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a new dataset..?
Why didn't the previous dataset work? Should we still address the root cause?
Once we have more understanding of these two questions, we may want to re-run the benchmark for the other databases to use this new dataset, so we can have a fair comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- invalid rows with null values
- There was a type mismatch as well
the issue was mostly related to data cleaning and that's why I choose a different dataset that worked for me in past.
# Description ## What is the current behavior? When running benchmarking for 10GB files we faced issues in loading and running benchmarking. closes: #702 ## What is the new behavior? Added 10GB dataset benchmarking numbers. Added a new dataset that works with big queries. ## Does this introduce a breaking change? Nope ### Checklist - [X] Extended the README/documentation, if necessary (cherry picked from commit 577095d)
Description
What is the current behavior?
When running benchmarking for 10GB files we faced issues in loading and running benchmarking.
closes: #702
What is the new behavior?
Added 10GB dataset benchmarking numbers. Added a new dataset that works with big queries.
Does this introduce a breaking change?
Nope
Checklist