Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19994][SQL] Wrong outputOrdering for right/full outer smj #17331

Closed
wants to merge 2 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Mar 17, 2017

What changes were proposed in this pull request?

For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so nullOrdering of the left key can't be guaranteed. We should output right key order instead of left key order.

For full outer join, neither left key nor right key guarantees nullOrdering. We should not output any ordering.

In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in PlannerSuite by putting together tests for Sort, and also extracts common logic in Sort tests into a method.

How was this patch tested?

Corresponding test cases are added.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 17, 2017

cc @cloud-fan @hvanhovell

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74731 has finished for PR 17331 at commit 8dc2c04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("EnsureRequirements for sort operator after left outer sort merge join") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we just moving test around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added three test cases for left/right/full outer join. For other tests, I moved tests about sort together and extract common logic to a method and some private fields.

case RightOuter =>
// For right outer join, values of the left key will be filled with nulls if it can't
// match the value of the right key, so `nullOrdering` of the left key can't be guaranteed.
// We should output right key order here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is misleading. The output ordering is mainly affected by how we implement SortMergeJoinExec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// For left and right outer joins, the output is ordered by the streamed input's join keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll use those comments in the original PR.

@gatorsmile
Copy link
Member

The bug was introduced when we merge SortMergeJoin and SortMergerOuterJoin

https://github.com/apache/spark/pull/11743/files#diff-b669f8cf35f1d2d786582f4d8c49ed14

case FullOuter =>
// Neither left key nor right key guarantees `nullOrdering` after full outer join.
Nil
case _ =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, please use the white list. Otherwise, we might forget to update this when adding new join types. Then, we should issue the exception for the default case.

@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74809 has started for PR 17331 at commit e4c41dc.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 19, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74817 has finished for PR 17331 at commit e4c41dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 20, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74835 has finished for PR 17331 at commit e4c41dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Mar 20, 2017
## What changes were proposed in this pull request?

For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order.

For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering.

In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method.

## How was this patch tested?

Corresponding test cases are added.

Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>

Closes #17331 from wzhfy/wrongOrdering.

(cherry picked from commit 965a5ab)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.1/2.0!

asfgit pushed a commit that referenced this pull request Mar 20, 2017
## What changes were proposed in this pull request?

For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order.

For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering.

In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method.

## How was this patch tested?

Corresponding test cases are added.

Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>

Closes #17331 from wzhfy/wrongOrdering.

(cherry picked from commit 965a5ab)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in 965a5ab Mar 20, 2017
@HyukjinKwon
Copy link
Member

It seems this breaks the build in 2.0 https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.0-compile-maven-hadoop-2.7/lastBuild/console

[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala:87: not found: type InnerLike
[error]     case _: InnerLike | LeftExistence(_) => requiredOrders(leftKeys)
[error]  

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 21, 2017

@HyukjinKwon branch2.0 doesn't have InnerLike, I'll fix this.

@HyukjinKwon
Copy link
Member

@wzhfy Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants