You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Imagine that we have a parquet file with dimensions and facts columns:
ex:
dimension columns: store_code, store_region, store_nation .....
facts columns: net_sales_current_year, net_sales_previous_year....
now, we want to create a pretty UI where the user can run some small BI analysis work on top of the mentioned parquet file.
for simple queries the filter pushdown on the parquet file works fantastically.
but for a complex scenario (like this query):
SELECT * FROM {parquet} WHERE store_code IN ('k1', 'k3', 'k999')
the query is unable to pushdown the filter, NOT ONLY THE "store_code" FILTER, but the entire filtering condition,
with this problem, the query intead of taking 2 seconds is taking almost 4 minutes:
looking at the analyze query, duckdb is loading the entire dataset, MOREOVER, is not even running the file filter pushdown, on the 2 columns used inside the hive partitioning
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Why do you want this feature?
I want this feature, for the following scenario:
Imagine that we have a parquet file with dimensions and facts columns:
ex:
dimension columns: store_code, store_region, store_nation .....
facts columns: net_sales_current_year, net_sales_previous_year....
now, we want to create a pretty UI where the user can run some small BI analysis work on top of the mentioned parquet file.
for simple queries the filter pushdown on the parquet file works fantastically.
but for a complex scenario (like this query):
SELECT * FROM {parquet} WHERE store_code IN ('k1', 'k3', 'k999')the query is unable to pushdown the filter, NOT ONLY THE "store_code" FILTER, but the entire filtering condition,
with this problem, the query intead of taking 2 seconds is taking almost 4 minutes:
EXPLAIN ANALYZE WITHOUT IN:
SAME QUERY, WHERE I USE 'IN' OPERATOR FOR GETTING 2 STORE CODES:
looking at the analyze query, duckdb is loading the entire dataset, MOREOVER, is not even running the file filter pushdown, on the 2 columns used inside the hive partitioning
Beta Was this translation helpful? Give feedback.
All reactions