Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 12, 2019

Parquet may call the filter with a null value to check whether nulls are
accepted. While it seems Spark avoids that path in Parquet with 1.10, in
1.11 that causes Spark unit tests to fail.

Tested with Parquet 1.11 (and new unit test).

Parquet may call the filter with a null value to check whether nulls are
accepted. While it seems Spark avoids that path in Parquet with 1.10, in
1.11 that causes Spark unit tests to fail.

Tested with Parquet 1.11.
@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2019

BTW this fix was also part of a separate PR (which has a lot of other changes that are not needed for this): https://github.com/apache/spark/pull/23721/files#diff-67a76299606811fd795f69f8d53b6f2bR594

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28371][sql] Make Parquet "StartsWith" filter null-safe. [SPARK-28371][SQL] Make Parquet "StartsWith" filter null-safe Jul 12, 2019
@dongjoon-hyun
Copy link
Member

cc @gatorsmile and @rdblue

@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2019

This sounds like a Parquet regression to me. Shouldn't this behavior be changed in Parquet instead?

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2019

I was told that this is expected behavior (and a parquet dev pointed me at PARQUET-1489).

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2019

(But if this is really a parquet issue then great, just let me know.)

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2019

Also FYI the actual Parquet change that introduced the call that triggers the Spark unit test failure is PARQUET-1201.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107611 has finished for PR 25140 at commit 172ea92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Jul 12, 2019

I try to fix this issue before. more details please see PARQUET-1488.

@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2019

@wangyum, thanks for the additional context. Looks like it was undefined whether a UserDefinedPredicate needed to handle null. The requirement was stated only after adding the call with null to check whether a column of all nulls should be kept.

I think this is a regression in Parquet. Parquet should catch exceptions from UserDefinedPredicate and read the row group. That would fix the regression, while allowing Parquet to handle columns of all null values.

At the same time, I think that Spark should update its predicates to handle nulls, so that the filter works correctly. So let's go ahead with this PR and I'll re-open PARQUET-1488.

@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2019

The change looks correct to me. Is there a test suite for the StartsWith predicate? I'd like to see a test updated as well.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2019

Is there a test suite for the StartsWith predicate?

Yes, in ParquetFilterSuite, but the test only hits the bad code path when you use Parquet 1.11. I don't really know how make the Parquet code call the Spark filter with a null in 1.10 (tried a few things to no avail).

But let me see if I can call the Spark code more directly...

@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2019

+1, thanks for the explanation, @vanzin!

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2019

Ok, the added test fails without the fix (even if in "normal" operations everything seems fine with 1.10).

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2019

If anyone is curious, here's the stack trace with 1.10 (which looks like generated code):

[info]   Cause: java.lang.NullPointerException:
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:616)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:595)
[info]   at org.apache.parquet.filter2.recordlevel.IncrementallyUpdatedFilterPredicateBuilder$50.updateNull(IncrementallyUpdatedFilterPredicateBuilder.java:937)
[info]   at org.apache.parquet.filter2.recordlevel.IncrementallyUpdatedFilterPredicateEvaluator.visit(IncrementallyUpdatedFilterPredicateEvaluator.java:49)
[info]   at org.apache.parquet.filter2.recordlevel.IncrementallyUpdatedFilterPredicate$ValueInspector.accept(IncrementallyUpdatedFilterPredicate.java:122)
[info]   at org.apache.parquet.filter2.recordlevel.IncrementallyUpdatedFilterPredicateEvaluator.evaluate(IncrementallyUpdatedFilterPredicateEvaluator.java:41)
[info]   at org.apache.parquet.filter2.recordlevel.FilteringRecordMaterializer.getCurrentRecord(FilteringRecordMaterializer.java:93)

@SparkQA
Copy link

SparkQA commented Jul 13, 2019

Test build #107620 has finished for PR 25140 at commit 63c0aba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Jul 13, 2019
Parquet may call the filter with a null value to check whether nulls are
accepted. While it seems Spark avoids that path in Parquet with 1.10, in
1.11 that causes Spark unit tests to fail.

Tested with Parquet 1.11 (and new unit test).

Closes #25140 from vanzin/SPARK-28371.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 7f9da2b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Thank you so much, @vanzin , @rdblue , @wangyum , @HyukjinKwon .
Merged to master/2.4.

@vanzin vanzin deleted the SPARK-28371 branch July 15, 2019 17:58
vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
Parquet may call the filter with a null value to check whether nulls are
accepted. While it seems Spark avoids that path in Parquet with 1.10, in
1.11 that causes Spark unit tests to fail.

Tested with Parquet 1.11 (and new unit test).

Closes apache#25140 from vanzin/SPARK-28371.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
Parquet may call the filter with a null value to check whether nulls are
accepted. While it seems Spark avoids that path in Parquet with 1.10, in
1.11 that causes Spark unit tests to fail.

Tested with Parquet 1.11 (and new unit test).

Closes apache#25140 from vanzin/SPARK-28371.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 7f9da2b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
Parquet may call the filter with a null value to check whether nulls are
accepted. While it seems Spark avoids that path in Parquet with 1.10, in
1.11 that causes Spark unit tests to fail.

Tested with Parquet 1.11 (and new unit test).

Closes apache#25140 from vanzin/SPARK-28371.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 7f9da2b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants