-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommended Benchmarking Arrangement #13
Comments
Fact to dimensions joins are pretty different degree of complexity than big to big join, where join column (even after distinct) does not fit into memory, and algorithm has to do merge in batches and then reduce the batches to single results. I am not sure if join task stresses that well enough, but that's the aim of q5. |
For a batch of query job, obviously users will configure to output reports as much as possible, it is not a problem. Normally they will request a web interface rather then Windows/Linux CLI. e.g. https://youtu.be/6hwbQmTXzMc e.g. for a retail business group, there are a massive volumn of invoices, fact table contains item code, quantity..., fact table does not have unit price and exchange rate. When an accountant need to prepare ad-hoc reports based on specific condition, e.g. Select transactions which represent 80% of sales revenues, so the process must included joining fact table with unit price and exchange rate in order to get total amount for each invoice. I had worked for accountancy over 2 decades, so I understand the painfulness of accountants to deal with data. Every working day are doing routine and ad-hoc reports. Now I am working as a programmer, so I have solved the performance issues of full join no matter it is running in <=32GB desktop PC for billions of rows. |
Hi @hkpeaks, Thank you for the suggestions. If you have any questions about the scripts to run the benchmark, feel free to ask them with reproducible steps. As far as I know, the script in As for every developer writing their own script to get the best performance, developers are encouraged to modify their For the Join Table test, yes it is interesting, but as Jan said big to big joins can be complex and stress the memory limits of a system, and it seems like every solution has trouble joining to 50GB files. If you would like to open a PR to add the peaks solution to the repro, I would be happy to review it. You can use the work done in #12 and #11 as a template |
H2O script is very complex for me and users so I prefer to write simple script for benchmarking. Coming benchmarking I plan to compare 3,000 files (total 300 Million Rows) for software Polars, DuckDB, Peaks, R Data.Table. But I concern whehter the script I write for third-parties software is a optimized script for performance. So you can help to review the below DuckDB script whether it is optimal. ***** Visitors of this page can recommend more software to be included in the benchmarking given that they provide relevant scripts. ***** Test data: https://github.com/hkpeaks/peaks-consolidation/blob/main/Input/1.csv
******** DuckDB ******** import duckdb import time s = time.time() con = duckdb.connect() con.execute("""copy (SELECT Ledger, Account, DC, Currency, SUM(Base_Amount) as Total_Base_Amount e = time.time() print("DuckDB FilterGroupBy CSV Time = {}".format(round(e-s,3))) ******** Polars ******** import polars as pl table1 = ( path = "Output/PolarsFilterGroupByCSV.csv" e = time.time() ******** Data.Table ******** library(data.table) s <- Sys.time() setDTthreads(10) temp <- list.files ("D:/Benchmark/Input/3000Files", full.names = TRUE, pattern = "\.csv$") e <- Sys.time() ******** Peaks Consolidation ******** Select{D:/Benchmark/Input/3000Files/*.csv | Ledger(L30..L70)} In fact I don't understand why many open source software requires users to fill-in excessive settings to do simple task. |
I'm not sure what you want me to do here. I can say that the duckdb code looks fine, and I am happy to hear that it finishes in a timely manner. The scope of issues here is db-benchmark issues/feature requests/additional solutions etc. It seems like you would like to add peaks to the benchmark, and potentially add more benchmarking categories. I don't plan on writing the infrastructure to add peaks to the benchmark like the project is not popular enough yet for me to invest the time and effort. If there's anything else I can help you with that is an issue or feature request, let me know, otherwise I will close this issue as not planned |
I have tested the latest version of Polars 0.19.1 which can process 10 billion-row jointable with file size of 231GB (Output 389GB). Please consider using sink_csv function for your next benchmarking exercise. import polars as pl start_time = datetime.now() master = pl.scan_csv("Inbox/Master.csv") fact_table = pl.scan_csv("Inbox/10000M-Fact.csv") result = fact_table.join(master, on=["Product","Style"], result.sink_csv("Outbox/PolarsJoinResult.csv") elapsed = datetime.now() - start_time |
* should change arrow to show R-arrow * new arrow benchmarks report solution as R-arrow * update arrow to R-arrow in a few more places * Fix remaining issues in Tmonster#10 (#13) * Fix remaining issues in arrow -> R-arrow rename * Fix bug in rename code in report.R The previous code was causing something wild to happen. The changed code is idiomatic code for replacing values in a data.frame based on a condition. --------- Co-authored-by: Bryce Mecum <petridish@gmail.com>
I would like to thank the DuckDB team for keeping this benchmark going! It can be expensive to run benchmark tests for many software. As a single developer working on my current dataframe project, I encounter problems if I spend too much time learning how to configure other software to do benchmark.
However, I suggest that if each software developer writes their own script, the results can be more fair as different script settings will result in different performance. The software developers must be able to configure their script to get optimized performance.
It is unbelievable that none of the software of the benchmark can complete the JoinTable test.
I use a 67 GB csv fact table of 1 billion rows x 14 columns to join a master of 99696 rows, full join for 3-composit key, returning a 90GB csv file. Peaks can complete the Jointable using only 32GB memory. You can see https://github.com/hkpeaks/peaks-consolidation/tree/main/Polars-PeaksBenchmarking for more information.
You can use the Peaks pre-release to see if it can complete Jointabe on your much better machine. https://github.com/hkpeaks/peaks-consolidation/releases . I want to know the results as I do not have a Google Cloud account. I am exploring which cloud service can accept PayPal for prepayment. Using credit card for online payment is very risky for me.
I suggest that your coming benchmark can be tested on the following categories by selecting all records except filter test:
This way it can be more easily understood by business users.
The text was updated successfully, but these errors were encountered: