Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17244] Catalyst should not pushdown non-deterministic join conditions #14815

Closed

Conversation

sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.

How was this patch tested?

A new test in FilterPushdownSuite that checks catalyst behavior for both deterministic and non-deterministic join conditions.

@sameeragarwal
Copy link
Member Author

cc @hvanhovell @gatorsmile

@sameeragarwal
Copy link
Member Author

cc @brkyvz who found this bug

@@ -1386,15 +1386,17 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper {
/**
* Splits join condition expressions into three categories based on the attributes required
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this PR is targeting both join conditions and filter predicates. Could you update the code comment of this line and the title of the PR?

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64436 has finished for PR 14815 at commit 6728fc3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64437 has finished for PR 14815 at commit d0b1009.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member Author

Thanks @gatorsmile, I address your comments.

@gatorsmile
Copy link
Member

Looks pretty good to me!

Could you also add one more test case for testing the following scenario:

only consider pushing down those expressions that precede the first non-deterministic expression in the condition.

I am just afraid the others might break it in the future. Thanks!

@sameeragarwal
Copy link
Member Author

That's a good idea. More generally, I think the best way to fix these class of problems is to unify all predicate pushdown logic at a common place.

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64496 has finished for PR 14815 at commit f86f5ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

In 2.0, we combined the PPD rules. Now, only two PPD rules left, PushPredicateThroughJoin and PushDownPredicate. IMO, the major reason why we did not combine them into the same one is because PPD for Join is a little bit special.

@gatorsmile
Copy link
Member

LGTM pending testing

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64500 has finished for PR 14815 at commit b326cd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Aug 26, 2016

LGTM. Merging to master and branch 2.0.

asfgit pushed a commit that referenced this pull request Aug 26, 2016
…ditions

## What changes were proposed in this pull request?

Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.

## How was this patch tested?

A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14815 from sameeragarwal/constraint-inputfile.

(cherry picked from commit 540e912)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in 540e912 Aug 26, 2016
* to evaluate them.
* Splits join condition expressions or filter predicates (on a given join's output) into three
* categories based on the attributes required to evaluate them. Note that we explicitly exclude
* on-deterministic (i.e., stateful) condition expressions in canEvaluateInLeft or
Copy link
Contributor

@cloud-fan cloud-fan Aug 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: non-deterministic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good eye! can you please fold this change into one of your open PRs :) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I will :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants