Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348

Conversation

liancheng
Copy link
Contributor

What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See graphframes/graphframes#23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

How was this patch tested?

A test case is added in FilterPushdownSuite. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

@liancheng
Copy link
Contributor Author

cc @mengxr

@@ -156,6 +156,17 @@ class FilterPushdownSuite extends PlanTest {
comparePlans(optimized, originalQuery)
}

test("nondeterministic: can't push down filter through project with nondeterministic field") {
val originalQuery = testRelation
.select(Rand(10).as('rand), 'a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: indentation

@marmbrus
Copy link
Contributor

LGTM

@liancheng
Copy link
Contributor Author

We should also revise nondeterminism handling in PushPredicateThroughGenerate, PushPredicateThroughAggregate, and PushPredicateThroughJoin. But they can be added in follow-up PRs.

@liancheng liancheng force-pushed the spark-13473-no-ppd-through-nondeterministic-project-field branch from 0f3175a to 863c5ec Compare February 24, 2016 18:38
@liancheng
Copy link
Contributor Author

test this please

1 similar comment
@yhuai
Copy link
Contributor

yhuai commented Feb 24, 2016

test this please

@SparkQA
Copy link

SparkQA commented Feb 24, 2016

Test build #51890 has finished for PR 11348 at commit 863c5ec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2016

Test build #51888 has finished for PR 11348 at commit 863c5ec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

We can probably further simplify PushPredicateThroughProject. Before this PR, we don't push a predicate if it refers to any non-deterministic field(s). However, as what this PR fixes, we shouldn't push a predicate through any project that has non-deterministic field(s). Another case worth noting is that a predicate containing non-deterministic expression(s) but not referring to any non-deterministic field(s) is OK to be pushed down. For example, it's OK to push down the following filter predicate:

// from:
sqlContext.range(3).select('id as 'a, 'id 'as 'b).filter(rand(42) > 0.5)

// to:
sqlContext.range(3).filter(rand(42) > 0.5).select('id as 'a, 'id 'as 'b)

This means that we can push down a filter predicate through a project if and only if all fields of the project are deterministic. That's why those two test cases are considered outdated and removed.

cc @cloud-fan

(To be safe, I won't do the above update in this PR since it also targets to 1.6 and 1.5.)

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51957 has finished for PR 11348 at commit 51ad500.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -804,7 +804,9 @@ object SimplifyFilters extends Rule[LogicalPlan] {
*/
object PushPredicateThroughProject extends Rule[LogicalPlan] with PredicateHelper {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case filter @ Filter(condition, project @ Project(fields, grandChild)) =>
case filter @ Filter(condition, project @ Project(fields, grandChild))
if fields.forall(_.deterministic) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we add some comments to explain why we can't push down filter through project with non-deterministic fields? e.g. number of input rows is also an implicit input for non-deterministic expressions, push down filter will break it.

@cloud-fan
Copy link
Contributor

LGTM except one comment

@liancheng
Copy link
Contributor Author

Thanks for the review, comment added.

@asfgit asfgit closed this in 3fa6491 Feb 25, 2016
asfgit pushed a commit that referenced this pull request Feb 25, 2016
…ministic field(s)

## What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See graphframes/graphframes#23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

## How was this patch tested?

A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

Author: Cheng Lian <lian@databricks.com>

Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.

(cherry picked from commit 3fa6491)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Feb 25, 2016
…ministic field(s)

## What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See graphframes/graphframes#23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

## How was this patch tested?

A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

Author: Cheng Lian <lian@databricks.com>

Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.

(cherry picked from commit 3fa6491)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

The last commit is adding comments, so it's safe to merge after style check passed.
Thanks, merged into master, branch-1.6 and branch-1.5!

@liancheng liancheng deleted the spark-13473-no-ppd-through-nondeterministic-project-field branch February 25, 2016 14:14
@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51969 has finished for PR 11348 at commit 24b96bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This is a follow-up of PR #11348.

After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.

To be more specific, the following optimization is allowed:

```scala
// From:
df.select('a, 'b).filter('c > rand(42))
// To:
df.filter('c > rand(42)).select('a, 'b)
```

while this isn't:

```scala
// From:
df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb)
// To:
df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c)
```

## How was this patch tested?

Existing test cases should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11864 from liancheng/spark-13473-cleanup.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This is a follow-up of PR apache#11348.

After PR apache#11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.

To be more specific, the following optimization is allowed:

```scala
// From:
df.select('a, 'b).filter('c > rand(42))
// To:
df.filter('c > rand(42)).select('a, 'b)
```

while this isn't:

```scala
// From:
df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb)
// To:
df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c)
```

## How was this patch tested?

Existing test cases should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes apache#11864 from liancheng/spark-13473-cleanup.
@wecharyu
Copy link
Contributor

wecharyu commented Jul 9, 2024

Hi @liancheng, @mengxr, @cloud-fan,

I'm trying to understand why pushing down filters with nondeterministic fields is considered a bug. How would the different nondeterministic results impact the query?

For instance, other engines like Hive do push down filters in these cases. Could this change lead to performance regressions in our queries?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants