Skip to content

Conversation

@Sevenannn
Copy link
Contributor

@Sevenannn Sevenannn commented Oct 26, 2024

Which issue does this PR close?

NA

Rationale for this change

In ParquetExec, when filter_pushdown is not enabled, predicates are simply ignored, causing incorrect results for queries with filters pushed down in TableScan.

For example, for the following query that's supposed to return empty results:

with tmp as (
 select ss_quantity, 's' sale_type from store_sales)
 select * from tmp where sale_type = 'w';

The predicate=false in physical plan simply get ignored in ParquetExec implementation, causing wrong results.

+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                            |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | SubqueryAlias: tmp                                                                                                                                                                                                                                                                                                                                                                              |
|               |   Projection: store_sales.ss_quantity, Utf8("s") AS sale_type                                                                                                                                                                                                                                                                                                                                   |
|               |     BytesProcessedNode                                                                                                                                                                                                                                                                                                                                                                          |
|               |       TableScan: store_sales projection=[ss_quantity], full_filters=[Boolean(false) AS Utf8("s") = Utf8("w")]                                                                                                                                                                                                                                                                                   |
| physical_plan | ProjectionExec: expr=[ss_quantity@0 as ss_quantity, s as sale_type]                                                                                                                                                                                                                                                                                                                             |
|               |   BytesProcessedExec                                                                                                                                                                                                                                                                                                                                                                            |
|               |     ParquetExec: file_groups={10 groups: [[tpcds/store_sales/store_sales.parquet:0..15420768], [tpcds/store_sales/store_sales.parquet:15420768..30841536], [tpcds/store_sales/store_sales.parquet:30841536..46262304], [tpcds/store_sales/store_sales.parquet:46262304..61683072], [tpcds/store_sales/store_sales.parquet:61683072..77103840], ...]}, projection=[ss_quantity], predicate=false |
|               |                                                                                                                                                                                                                                                                                                                                                                                                 |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

What changes are included in this PR?

Changes in this PR will ensure that predicates don't get omitted when filter_pushdown is not enabled, including.

  • When filter_pushdown is not enabled, ese predicates to evaluate and filter RecordBatch returned from parquet.
  • A function to recursively update the column indexes of filter according to the schema of RecordBatch.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Oct 26, 2024
@Sevenannn Sevenannn marked this pull request as ready for review October 26, 2024 01:51
@Sevenannn Sevenannn marked this pull request as draft October 26, 2024 02:00
@Sevenannn Sevenannn closed this Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant