-
Notifications
You must be signed in to change notification settings - Fork 1.8k
TEST: enable pushdown_filters and reorder_filters by default #18873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I am also testing with just I am going to focus my efforts on profiling these queries which seem to have gotten the most slower: Here is the query: set datafusion.execution.parquet.binary_as_string = true
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;Basically my next steps are to profile these queries and see what is slower (and if it is related to filter representation, I will go focus on apache/arrow-rs#8902) |
Looks like we are very close! FYI, there a couple more slow than query 24: |
|
I did some more analysis: The idea is to isolate why filter pushdown is slowing down clickbench q24 See more details here #18873 This is after upgrading to arrow 57.1.0 The only difference in the two binaries is if filter pushdown is on by default: -rwxr-xr-x@ 1 andrewlamb staff 81331152 Nov 23 07:31 datafusion-cli-alamb_upgrade_arrow_57.1.0
-rwxr-xr-x@ 1 andrewlamb staff 81331152 Nov 22 07:57 datafusion-cli-almab_pushdown_no_reorderUsing hits partitioned dataset ln -s ~/Software/datafusion/benchmarks/data/hits_partitioned ./hitsHere is q24.sql set datafusion.execution.parquet.binary_as_string = true;
-- turn on pushdown (is hard coded)
-- set datafusion.execution.parquet.pushdown_filters = true;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;You can see the pushdown is slightly slower ./datafusion-cli-almab_pushdown_no_reorder -f q24.sql | grep Elapsed
Elapsed 0.000 seconds.
Elapsed 0.183 seconds.
Elapsed 0.154 seconds.
Elapsed 0.155 seconds.
Elapsed 0.153 seconds.
Elapsed 0.154 seconds.
Elapsed 0.154 seconds.
Elapsed 0.150 seconds.
Elapsed 0.154 seconds.
Elapsed 0.156 seconds.
Elapsed 0.152 seconds../datafusion-cli-alamb_upgrade_arrow_57.1.0 -f q24.sql | grep Elapsed
Elapsed 0.002 seconds.
Elapsed 0.164 seconds.
Elapsed 0.137 seconds.
Elapsed 0.137 seconds.
Elapsed 0.133 seconds.
Elapsed 0.132 seconds.
Elapsed 0.135 seconds.
Elapsed 0.131 seconds.
Elapsed 0.137 seconds.
Elapsed 0.137 seconds.
Elapsed 0.133 seconds.So let's profile what the pushdown one is doing
So more than 5% of the time is being spent converting filters back and forth. Thus, this gives me more motivation to keep working on |

( I am using this PR to test, I don't intend to merge it yet )
Which issue does this PR close?
filter_pushdown) by default #3463Rationale for this change
We have made non trivial progress in filter representation in Parquet. Let's see where performance is now.
What changes are included in this PR?
arrow,parquet57.1.0 #18820pushdown_filtersandreorder_filtersAre these changes tested?
By CI tests
Are there any user-facing changes?