Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet #3367

Closed
wants to merge 2 commits into from

Conversation

liancheng
Copy link
Contributor

Predicates like a = NULL and a < NULL can't be pushed down since Parquet Lt, LtEq, Gt, GtEq doesn't accept null value. Note that Eq and NotEq can only be used with null to represent predicates like a IS NULL and a IS NOT NULL.

However, normally this issue doesn't cause NPE because any value compared to NULL results NULL, and Spark SQL automatically optimizes out NULL predicate in the SimplifyFilters rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do NOT think we need to backport this to branch-1.1

This PR restricts Lt, LtEq, Gt and GtEq to non-null values only, and only uses Eq with null value to pushdown IsNull and IsNotNull. Also, added support for Parquet NotEq filter for completeness and (tiny) performance gain, it's also used to pushdown IsNotNull.

Review on Reviewable

FilterApi.eq(binaryColumn(n), Binary.fromByteArray(v.asInstanceOf[Array[Byte]]))
(n: String, v: Any) => FilterApi.eq(
binaryColumn(n),
Option(v).map(b => Binary.fromByteArray(v.asInstanceOf[Array[Byte]])).orNull)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary.fromString and Binary.fromByteArray don't accept null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add this as a comment.

@SparkQA
Copy link

SparkQA commented Nov 19, 2014

Test build #23612 has started for PR 3367 at commit de7de28.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 19, 2014

Test build #23612 has finished for PR 3367 at commit de7de28.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23612/
Test FAILed.

@liancheng
Copy link
Contributor Author

Build failure due to syncing issue between GitHub and ASF Git repo.

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23654 has started for PR 3367 at commit de7de28.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23654 has finished for PR 3367 at commit de7de28.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23654/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #530 has started for PR 3367 at commit de7de28.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #530 has finished for PR 3367 at commit de7de28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LinearBinaryClassificationModel(LinearModel):
    • class LogisticRegressionModel(LinearBinaryClassificationModel):
    • class LogisticRegressionWithLBFGS(object):
    • class SVMModel(LinearBinaryClassificationModel):
    • class Rating(namedtuple("Rating", ["user", "product", "rating"])):
    • class RDDRangeSampler(RDDSamplerBase):
    • class SizeLimitedStream(object):
    • class CompressedStream(object):
    • class LargeObjectSerializer(Serializer):
    • class CompressedSerializer(Serializer):

@@ -85,6 +86,7 @@ class ParquetQuerySuite extends QueryTest with FunSuiteLike with BeforeAndAfterA
TestData // Load test data tables.

var testRDD: SchemaRDD = null
var originalParquetFilterPushdownEnabled = TestSQLContext.parquetFilterPushDown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why var?

@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2014

Minor comments, otherwise LGTM.

@liancheng
Copy link
Contributor Author

Addressed all styling issues. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24026 has started for PR 3367 at commit 12c9d1c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24027 has started for PR 3367 at commit cc41281.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24026 has finished for PR 3367 at commit 12c9d1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24026/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24027 has finished for PR 3367 at commit cc41281.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24027/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

asfgit pushed a commit that referenced this pull request Dec 30, 2014
This is a follow-up of #3367 and #3644.

At the time #3644 was written, #3367 hadn't been merged yet, thus `IsNull` and `IsNotNull` filters are not covered in the first version of `ParquetFilterSuite`. This PR adds corresponding test cases.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3748)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3748 from liancheng/test-null-filters and squashes the following commits:

1ab943f [Cheng Lian] IsNull and IsNotNull Parquet filter test case for boolean type
bcd616b [Cheng Lian] Adds Parquet filter pushedown tests for IsNull and IsNotNull
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants