-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add benchmarks for testing row filtering #3769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ]; | ||
|
|
||
| let filter_matrix = vec![ | ||
| // Selective-ish filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well-defined test case and test data! 👍
| path: PathBuf, | ||
|
|
||
| /// Batch size when reading Parquet files | ||
| #[structopt(short = "s", long = "batch-size", default_value = "8192")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there two short options 's'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact when you run the example in debug mode it asserts on exactly this problem:
cargo run --bin parquet_filter_pushdown -- --path ./data --\|alamb@aal-dev:~/arrow-datafusion$
scale-factor 1.0
...
Running `target/debug/parquet_filter_pushdown --path ./data --scale-factor 1.0` |error[E0433]: failed to resolve: use of undeclared type `WriterProperties`
thread 'main' panicked at 'Argument short must be unique | --> benchmarks/src/bin/parquet_filter_pushdown.rs:235:17
| |
-s is already in use', /home/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/cla\|235 | let props = WriterProperties::builder()
p-2.34.0/src/app/parser.rs:190:13 | | ^^^^^^^^^^^^^^^^ not found in this scope
note: run with `RUST_BACKTRACE=1` environment variable to display a baThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just removed one of them. I don't think batch size needs to be a cli option in this benchmark.
|
@thinkharderdev thanks for your great bench. There are at most two pages in one col, I think if we adjust to get more pages in one col (like reduce the page size), it will get greater performance in enable FYI, i see impala choose to use fixed row number in one page to do benchmark for getting good performance. |
|
Thank you @thinkharderdev -- I plan to review this PR in detail later today |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- thank you @thinkharderdev
I also verified the parquet file that was created
$ du -s -h /tmp/data/logs.parquet
988M /tmp/data/logs.parquetIt looks good to me (using the neat pqrs tool from @manojkarthick)
alamb@aal-dev:~/2022-10-05-slow-query-high-cardinality$ pqrs cat --csv /tmp/data/logs.parquet | head
############################
File: /tmp/data/logs.parquet
############################
service,host,pod,container,image,time,client_addr,request_duration_ns,request_user_agent,request_method,request_host,request_bytes,response_bytes,response_status
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000000000+00:00,127.216.178.64,-1261239112,rkxttrfiiietlsaygzphhwlqcgngnumuphliejmxfdznuurswhdcicrlprbnocibvsbukiohjjbjdygwbfhxqvurm,PUT,https://backend.mydomain.com,-312099516,1\
448834362,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000001024+00:00,187.49.24.179,1374800428,sdxkctvmvuqxhwigrhjaouwdzvasqlqphymcgqvfmsbjswswnzgvanmalnmvsvruakcudmqvzateabhlya,PATCH,https://backend.mydomain.com,-1363067408,1111765\
98,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000002048+00:00,14.29.229.168,-1795280692,bhlvymbbtgcqrwzujukyotusnsoidygnklhx,GET,https://backend.mydomain.com,-1323615082,-705662117,400
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000003072+00:00,180.188.29.17,-717290117,hjaynltdswdekcguqmrkucsepzqjhasklmimkibabijihitimmsglgettywifdzmraipvyvekczuwxettayslrffyz,HEAD,https://backend.mydomain.com,-1847395296,\
1206750179,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000004096+00:00,68.92.115.208,759902764,yupopowlaqbwskdwvtlitugpzzxoajhvnmndhca,DELETE,https://backend.mydomain.com,-50170254,-415949533,403
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000005120+00:00,230.160.203.201,-1271567754,pwbruedgdgtsavjuksxwkecxulbnjbsaltuvcjxcmblhnraawouvrunwwsmvjbq,GET,https://backend.mydomain.com,-1193079450,1281912293,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000006144+00:00,249.254.50.191,-971196614,amtuqookzibtvrtqfnyzuyesikbrafhcfnjhoaoedvmlwpkypfsedtbbwlbnzigwgjpzcwdxtwhrykcibmhlxnkckynvgli,PATCH,https://backend.mydomain.com,-2627\
74709,-1695212300,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000007168+00:00,77.183.81.164,-547300163,ogkufdxssjqzjphxwvegwvofchpsgntbyslgarcyqcawokzfoppdftoctmtlwcvikazwrujlgrzrlqueaaceibxvdicfhp,HEAD,https://backend.mydomain.com,-1349820\
595,-327759246,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000008192+00:00,63.17.88.115,-88404773,ogardohhoorttptpnkxmvyenqfzvvkjabcrfwapoywttjdunvmlgwgstmsjbefxqta,HEAD,https://backend.mydomain.com,1830978558,,200
Error: ArrowReadWriteError(CsvError("Broken pipe (os error 32)"))| let generator = Generator::new(); | ||
|
|
||
| let file = File::create(&path).unwrap(); | ||
| let mut writer = ArrowWriter::try_new(file, generator.schema.clone(), None).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should make the properties used here explicit?
Like maybe explicitly set what type of statistics are created as well as potentially setting ZSTD compression
https://docs.rs/parquet/24.0.0/parquet/file/properties/struct.WriterPropertiesBuilder.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll need to revisit this again once apache/arrow-rs#2854 is released and pulled in so we cam generate the files with proper page sizes (which should make a significant difference)
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
I stole the gen code from @tustvold so you know it works :) |
|
CI check is unrelated : #3798 |
| combine_filters(&[ | ||
| col("request_method").not_eq(lit("GET")), | ||
| col("response_status").eq(lit(400_u16)), | ||
| // TODO this fails in the FilterExec with Error: Internal("The type of Dictionary(Int32, Utf8) = Utf8 of binary physical should be same") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coercion!
|
Benchmark runs are scheduled for baseline = ae5b23e and contender = fb39d5d. fb39d5d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3457
Rationale for this change
Need a set benchmarks for evaluating performance implications of parquet predicate pushdown. This sets up some very basic benchmarks which can be used for that purpose. Thanks to @tustvold for cooking up a script to generate synthetic datasets for this purpose.
What changes are included in this PR?
Add new benchmark script
parquet_filter_pushdownwhich will execute a series ofParquetExecplans with different filter predicates. For each predicate in the suite, we will execute the plan with all three differentParquetScanOptionsconfigurations:Are there any user-facing changes?
No