Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9082][SQL] Filter using non-deterministic expressions should not be pushed down #7446

Closed
wants to merge 6 commits into from

Conversation

cloud-fan
Copy link
Contributor

No description provided.

@cloud-fan
Copy link
Contributor Author

cc @yhuai

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37499 has finished for PR 7446 at commit 804754d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test("nondeterministic: can't push down filter through project") {
val originalQuery = testRelation
.select(Rand(10).as('rand))
.where('rand > 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to use a condition having both deterministic and non-deterministic expressions, e.g.

 val originalQuery = testRelation
  .select(Rand(10).as('rand), 'a)
  .where('rand > 5 && 'a > 5)

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37518 has finished for PR 7446 at commit b5b3c85.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jul 16, 2015

Thanks for the fix! There is one thing I think it is better to double check. Let's say I have a partitioned Parquet table and I select a few columns from it and then I have predicates on top of it. These predicates have non-deterministic expressions, a predicate on the partition column, and some predicates that can be pushed down to parquet table scan (pushed into parquet's reader). With our fix, will partitioning column get correctly pruned and those predicates get pushed down to parquet's reader? ParquetFilterSuite.scala is the file for testing parquet filter pushdown. We may need to add tests for partition pruning. @liancheng Where are our tests for partition pruning?

test("nondeterministic: can't push down filter through project") {
val originalQuery = testRelation
.select(Rand(10).as('rand), 'a)
.where('rand > 5 && 'a > 5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we should optimize in to

testRelation
  .where('a > 5)
  .select(Rand(10).as('rand), 'a)
  .where('rand > 5)

Can we do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it may require much more changes. I am fine if we do that in a separate pr in future.

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37527 has finished for PR 7446 at commit 10bdd29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@yhuai I believe expressions with UDF(s) are never pushed down in Parquet or any other data sources. Only simple comparison and string predicates dealing with constants can be pushed down. I'm double checking partition pruning.

@yhuai
Copy link
Contributor

yhuai commented Jul 16, 2015

Yeah. But my concern is if this fix will prevent any legitimate predicates from being pushed down.

// We only push down filter if their overlapped expressions are all
// deterministic.
val hasNondeterministic = condition.collect {
case a: Attribute if aliasMap.contains(a) => aliasMap(a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can not use the contains here, as we have to use the semanticEquals for finding the identical expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But aliasMap is AttributeMap, I think it should be safe to call contains here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, the original code seems has bug, it use the .toMap

val hasNondeterministic = projectList1.flatMap(_.collect {
case a: Attribute if aliasMap.contains(a) => aliasMap(a).child
}).exists(_.find(!_.deterministic).isDefined)
val canCollapse = projectList1.find(hasNondeterministic(_, aliasMap)).isEmpty
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an unrelated but small change, we don't need to go through the whole project list, can stop once we find nondeterministic expressions.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37618 has finished for PR 7446 at commit 0e5e2d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("isMulticlass = " + metadata.isMulticlass)
    • * (i.e., if isMulticlass && isSpaceSufficientForAllCategoricalSplits),
    • logDebug("isMulticlass = " + metadata.isMulticlass)
    • abstract class UnsafeProjection extends Projection
    • case class FromUnsafeProjection(fields: Seq[DataType]) extends Projection
    • abstract class BaseProjection extends Projection
    • class SpecificProjection extends $
    • class SpecificProjection extends $

val hasNondeterministic = projectList1.flatMap(_.collect {
case a: Attribute if aliasMap.contains(a) => aliasMap(a).child
}).exists(_.find(!_.deterministic).isDefined)
val noCollapse = projectList1.exists(hasNondeterministic(_, aliasMap))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an unrelated but small change, we don't need to go through the whole project list, can stop once we find nondeterministic expressions.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37630 has finished for PR 7446 at commit 33eb2d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sourceAliases: AttributeMap[Alias]) = {
project.exists {
case a: Attribute =>
sourceAliases.get(a).map(_.child.exists(!_.deterministic)).getOrElse(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        sourceAliases.get(a).exists(_.child.exists(!_.deterministic))

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37901 has finished for PR 7446 at commit 330021e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait ExpectsInputTypes extends Expression
    • trait ImplicitCastInputTypes extends ExpectsInputTypes
    • trait Unevaluable extends Expression
    • trait Nondeterministic extends Expression
    • trait CodegenFallback extends Expression
    • case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Ascii(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Base64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class UnBase64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class FakeFileStatus(

@cloud-fan
Copy link
Contributor Author

ping @yhuai

@liancheng
Copy link
Contributor

@yhuai Sorry that I misunderstood your question at first. I think this PR is safe for normal filter push-down. Verified that partition pruning and Parquet filter push-down are both working properly.

@liancheng
Copy link
Contributor

LGTM

@yhuai
Copy link
Contributor

yhuai commented Jul 22, 2015

LGTM. I am merging it to master.

// Split the condition into small conditions by `And`, so that we can push down part of this
// condition without nondeterministic expressions.
val andConditions = splitConjunctivePredicates(condition)
val nondeterministicConditions = andConditions.filter(hasNondeterministic(_, aliasMap))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a partition at here will be better?

@asfgit asfgit closed this in 7652095 Jul 22, 2015
asfgit pushed a commit that referenced this pull request Jul 23, 2015
…ghProject`

a follow up of #7446

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7607 from cloud-fan/tmp and squashes the following commits:

7106989 [Wenchen Fan] use `partition` in `PushPredicateThroughProject`
@cloud-fan cloud-fan deleted the filter branch August 27, 2015 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants