[SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema #37881

sunchao · 2022-09-14T16:51:36Z

What changes were proposed in this pull request?

Currently in Parquet V1 read path, Spark will pushdown data filters even if they have no reference in the Parquet read schema. This can cause correctness issues as described in SPARK-39833.

The root cause, it seems, is because in the V1 path, we first use AttributeReference equality to filter out data columns without partition columns, and then use AttributeSet equality to filter out filters with only references to data columns.
There's inconsistency in the two steps, when case sensitive check is false.

Take the following scenario as example:

data column: [COL, a]
partition column: [col]
filter: col > 10

With AttributeReference equality, COL is not considered equal to col (because their names are different), and thus the filtered out data column set is still [COL, a]. However, when calculating filters with only reference to data columns, COL is considered equal to col. Consequently, the filter col > 10, when checking with [COL, a], is considered to have reference to data columns, and thus will be pushed down to Parquet as data filter.

On the Parquet side, since col doesn't exist in the file schema (it only has COL), when column index enabled, it will incorrectly return wrong number of rows. See PARQUET-2170 for more detail.

In general, where data columns overlap with partition columns and case sensitivity is false, partition filters will not be filter out before we calculate filters with only reference to data columns, which is incorrect.

Why are the changes needed?

This fixes the correctness bug described in SPARK-39833.

Does this PR introduce any user-facing change?

No

How was this patch tested?

There are existing test cases for this issue from SPARK-39833. This also modified them to test the scenarios when case sensitivity is on or off.

dongjoon-hyun

+1, LGTM. (Pending CIs). Thank you, @sunchao .

cc @viirya and @huaxingao

viirya

Hmm, do we have any test case for this situation?

sunchao · 2022-09-14T17:19:30Z

@viirya please see test cases added in SPARK-39833.

cc @sadikovi @cloud-fan @HyukjinKwon too.

sadikovi · 2022-09-14T22:16:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {

      // Partition keys are not available in the statistics of the files.
      // `dataColumns` might have partition columns, we need to filter them out.
-      val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains)
+      val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- partitionColumns


I am not sure this is correct. Can you elaborate?

I am not an expert in AttributeSet so it would be good if you could explain how it makes this work so I can reference it in the future 😄.

@sadikovi np. I explained a bit in the PR description. Let me add more details.

There are two steps when calculating data filters for V1 file source:

compute dataColumnsWithoutPartitionCols

call extractPredicatesWithinOutputSet using dataColumnsWithoutPartitionCols on the filters, to obtain data only filters which are supposed to be pushed down to data sources

In the first step, the equality check is done via AttributeReference.equals, which checks attribute name, among other things.

In the second step, however, the equality is checked via AttributeEquals.equals which only checks expression ID.

This inconsistency poses an issue when case sensitive check is false (which is the default behavior). For the example in the PR description:

data column: [COL, a]

partition column: [col]

filter: col > 10

The expression ID for data column COL and partition column col are the same because of case insensitivity. In the first step above, however, COL and col are not considered equal and thus the dataColumnsWithoutPartitionCols will still be [COL, a]. In the second step, the data filters are calculated using AttributeEquals.equals and COL is treated as data column. As result, filter col > 10 is considered as data filter and pushed down.

In general, where data columns overlap with partition columns and case sensitivity is false, the first step will not filter out partition columns, so they will still be used in the second step. This is incorrect.

cloud-fan · 2022-09-15T14:20:07Z

This seems like a corner case when data columns and partition columns overlap (assuming you didn't set the case sensitivity flag to true).

When data columns and partition columns overlap, Spark reads the actual values from partition columns and ignore the overlapping data columns. See HadoopFsRelation.schema. That said, in your example, the filter col > 10 should be a partition filter, not data filter.

cloud-fan · 2022-09-15T14:21:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {

      // Partition keys are not available in the statistics of the files.
      // `dataColumns` might have partition columns, we need to filter them out.
-      val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains)


I think the fix should be dataColumns.filterNot(partitionSet.contains)

I think semantically both are the same. In

AttributeSet(dataColumns) -- partitionColumns

partitionColumns is first wrapped into AttributeSet and then compared with AttributeSet(dataColumns).

Your version does require one less line of change though :)

sadikovi

Thanks for fixing the issue!

LuciferYang

+1, LGTM

…o data schema ### What changes were proposed in this pull request? Currently in Parquet V1 read path, Spark will pushdown data filters even if they have no reference in the Parquet read schema. This can cause correctness issues as described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). The root cause, it seems, is because in the V1 path, we first use `AttributeReference` equality to filter out data columns without partition columns, and then use `AttributeSet` equality to filter out filters with only references to data columns. There's inconsistency in the two steps, when case sensitive check is false. Take the following scenario as example: - data column: `[COL, a]` - partition column: `[col]` - filter: `col > 10` With `AttributeReference` equality, `COL` is not considered equal to `col` (because their names are different), and thus the filtered out data column set is still `[COL, a]`. However, when calculating filters with only reference to data columns, `COL` is **considered equal** to `col`. Consequently, the filter `col > 10`, when checking with `[COL, a]`, is considered to have reference to data columns, and thus will be pushed down to Parquet as data filter. On the Parquet side, since `col` doesn't exist in the file schema (it only has `COL`), when column index enabled, it will incorrectly return wrong number of rows. See [PARQUET-2170](https://issues.apache.org/jira/browse/PARQUET-2170) for more detail. In general, where data columns overlap with partition columns and case sensitivity is false, partition filters will not be filter out before we calculate filters with only reference to data columns, which is incorrect. ### Why are the changes needed? This fixes the correctness bug described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? There are existing test cases for this issue from [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). This also modified them to test the scenarios when case sensitivity is on or off. Closes #37881 from sunchao/SPARK-40169. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com>

sunchao · 2022-09-16T17:48:35Z

Thanks! merged to master/branch-3.3/branch-3.2 (test failure unrelated).

…o data schema ### What changes were proposed in this pull request? Currently in Parquet V1 read path, Spark will pushdown data filters even if they have no reference in the Parquet read schema. This can cause correctness issues as described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). The root cause, it seems, is because in the V1 path, we first use `AttributeReference` equality to filter out data columns without partition columns, and then use `AttributeSet` equality to filter out filters with only references to data columns. There's inconsistency in the two steps, when case sensitive check is false. Take the following scenario as example: - data column: `[COL, a]` - partition column: `[col]` - filter: `col > 10` With `AttributeReference` equality, `COL` is not considered equal to `col` (because their names are different), and thus the filtered out data column set is still `[COL, a]`. However, when calculating filters with only reference to data columns, `COL` is **considered equal** to `col`. Consequently, the filter `col > 10`, when checking with `[COL, a]`, is considered to have reference to data columns, and thus will be pushed down to Parquet as data filter. On the Parquet side, since `col` doesn't exist in the file schema (it only has `COL`), when column index enabled, it will incorrectly return wrong number of rows. See [PARQUET-2170](https://issues.apache.org/jira/browse/PARQUET-2170) for more detail. In general, where data columns overlap with partition columns and case sensitivity is false, partition filters will not be filter out before we calculate filters with only reference to data columns, which is incorrect. ### Why are the changes needed? This fixes the correctness bug described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? There are existing test cases for this issue from [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). This also modified them to test the scenarios when case sensitivity is on or off. Closes #37881 from sunchao/SPARK-40169. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com>

dongjoon-hyun · 2022-09-16T18:51:12Z

Thank you, @sunchao and all!

…o data schema ### What changes were proposed in this pull request? Currently in Parquet V1 read path, Spark will pushdown data filters even if they have no reference in the Parquet read schema. This can cause correctness issues as described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). The root cause, it seems, is because in the V1 path, we first use `AttributeReference` equality to filter out data columns without partition columns, and then use `AttributeSet` equality to filter out filters with only references to data columns. There's inconsistency in the two steps, when case sensitive check is false. Take the following scenario as example: - data column: `[COL, a]` - partition column: `[col]` - filter: `col > 10` With `AttributeReference` equality, `COL` is not considered equal to `col` (because their names are different), and thus the filtered out data column set is still `[COL, a]`. However, when calculating filters with only reference to data columns, `COL` is **considered equal** to `col`. Consequently, the filter `col > 10`, when checking with `[COL, a]`, is considered to have reference to data columns, and thus will be pushed down to Parquet as data filter. On the Parquet side, since `col` doesn't exist in the file schema (it only has `COL`), when column index enabled, it will incorrectly return wrong number of rows. See [PARQUET-2170](https://issues.apache.org/jira/browse/PARQUET-2170) for more detail. In general, where data columns overlap with partition columns and case sensitivity is false, partition filters will not be filter out before we calculate filters with only reference to data columns, which is incorrect. ### Why are the changes needed? This fixes the correctness bug described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? There are existing test cases for this issue from [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). This also modified them to test the scenarios when case sensitivity is on or off. Closes apache#37881 from sunchao/SPARK-40169. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com>

initial commit

5329809

dongjoon-hyun approved these changes Sep 14, 2022

View reviewed changes

github-actions bot added the SQL label Sep 14, 2022

viirya reviewed Sep 14, 2022

View reviewed changes

viirya approved these changes Sep 14, 2022

View reviewed changes

huaxingao approved these changes Sep 14, 2022

View reviewed changes

sadikovi reviewed Sep 14, 2022

View reviewed changes

cloud-fan reviewed Sep 15, 2022

View reviewed changes

comments and tests

f79376c

sadikovi approved these changes Sep 15, 2022

View reviewed changes

cloud-fan approved these changes Sep 16, 2022

View reviewed changes

LuciferYang approved these changes Sep 16, 2022

View reviewed changes

sunchao closed this in 4e0fea2 Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema #37881

[SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema #37881

sunchao commented Sep 14, 2022 •

edited

Loading

dongjoon-hyun left a comment

viirya left a comment

sunchao commented Sep 14, 2022

sadikovi Sep 14, 2022

sadikovi Sep 15, 2022 •

edited

Loading

sunchao Sep 15, 2022

sadikovi Sep 15, 2022

cloud-fan commented Sep 15, 2022

cloud-fan Sep 15, 2022

sunchao Sep 15, 2022

sadikovi left a comment

LuciferYang left a comment

sunchao commented Sep 16, 2022 •

edited

Loading

dongjoon-hyun commented Sep 16, 2022

[SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema #37881

[SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema #37881

Conversation

sunchao commented Sep 14, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

sunchao commented Sep 14, 2022

sadikovi Sep 14, 2022

Choose a reason for hiding this comment

sadikovi Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

sunchao Sep 15, 2022

Choose a reason for hiding this comment

sadikovi Sep 15, 2022

Choose a reason for hiding this comment

cloud-fan commented Sep 15, 2022

cloud-fan Sep 15, 2022

Choose a reason for hiding this comment

sunchao Sep 15, 2022

Choose a reason for hiding this comment

sadikovi left a comment

Choose a reason for hiding this comment

LuciferYang left a comment

Choose a reason for hiding this comment

sunchao commented Sep 16, 2022 • edited Loading

dongjoon-hyun commented Sep 16, 2022

sunchao commented Sep 14, 2022 •

edited

Loading

sadikovi Sep 15, 2022 •

edited

Loading

sunchao commented Sep 16, 2022 •

edited

Loading