-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-28371][SQL] Make Parquet "StartsWith" filter null-safe #25140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Parquet may call the filter with a null value to check whether nulls are accepted. While it seems Spark avoids that path in Parquet with 1.10, in 1.11 that causes Spark unit tests to fail. Tested with Parquet 1.11.
|
BTW this fix was also part of a separate PR (which has a lot of other changes that are not needed for this): https://github.com/apache/spark/pull/23721/files#diff-67a76299606811fd795f69f8d53b6f2bR594 |
|
cc @gatorsmile and @rdblue |
|
This sounds like a Parquet regression to me. Shouldn't this behavior be changed in Parquet instead? |
|
I was told that this is expected behavior (and a parquet dev pointed me at PARQUET-1489). |
|
(But if this is really a parquet issue then great, just let me know.) |
|
Also FYI the actual Parquet change that introduced the call that triggers the Spark unit test failure is PARQUET-1201. |
|
Test build #107611 has finished for PR 25140 at commit
|
|
I try to fix this issue before. more details please see PARQUET-1488. |
|
@wangyum, thanks for the additional context. Looks like it was undefined whether a I think this is a regression in Parquet. Parquet should catch exceptions from UserDefinedPredicate and read the row group. That would fix the regression, while allowing Parquet to handle columns of all null values. At the same time, I think that Spark should update its predicates to handle nulls, so that the filter works correctly. So let's go ahead with this PR and I'll re-open PARQUET-1488. |
|
The change looks correct to me. Is there a test suite for the StartsWith predicate? I'd like to see a test updated as well. |
Yes, in But let me see if I can call the Spark code more directly... |
|
+1, thanks for the explanation, @vanzin! |
|
Ok, the added test fails without the fix (even if in "normal" operations everything seems fine with 1.10). |
|
If anyone is curious, here's the stack trace with 1.10 (which looks like generated code): |
|
Test build #107620 has finished for PR 25140 at commit
|
Parquet may call the filter with a null value to check whether nulls are accepted. While it seems Spark avoids that path in Parquet with 1.10, in 1.11 that causes Spark unit tests to fail. Tested with Parquet 1.11 (and new unit test). Closes #25140 from vanzin/SPARK-28371. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7f9da2b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
Thank you so much, @vanzin , @rdblue , @wangyum , @HyukjinKwon . |
Parquet may call the filter with a null value to check whether nulls are accepted. While it seems Spark avoids that path in Parquet with 1.10, in 1.11 that causes Spark unit tests to fail. Tested with Parquet 1.11 (and new unit test). Closes apache#25140 from vanzin/SPARK-28371. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Parquet may call the filter with a null value to check whether nulls are accepted. While it seems Spark avoids that path in Parquet with 1.10, in 1.11 that causes Spark unit tests to fail. Tested with Parquet 1.11 (and new unit test). Closes apache#25140 from vanzin/SPARK-28371. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7f9da2b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Parquet may call the filter with a null value to check whether nulls are accepted. While it seems Spark avoids that path in Parquet with 1.10, in 1.11 that causes Spark unit tests to fail. Tested with Parquet 1.11 (and new unit test). Closes apache#25140 from vanzin/SPARK-28371. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7f9da2b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Parquet may call the filter with a null value to check whether nulls are
accepted. While it seems Spark avoids that path in Parquet with 1.10, in
1.11 that causes Spark unit tests to fail.
Tested with Parquet 1.11 (and new unit test).