Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27485] EnsureRequirements.reorder should handle duplicate expressions gracefully #25167

Closed
wants to merge 1 commit into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. Seq(a, a, b) vs Seq(a, b, b). This fails with an index lookup failure in the reorder function.

This PR fixes this removing the equality checking logic from the reorderJoinKeys function, and by doing the multiset equality in the reorder function while building the reordered key sequences.

How was this patch tested?

Added a unit test to the PlannerSuite and added an integration test to JoinSuite

@hvanhovell
Copy link
Contributor Author

cc @mgaido91

return (leftKeys, rightKeys)
}

// Build a lookup between an expression and the positions its holds in the current key seq.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation had the potential for quadratic behavior in quite a few places so I changed all that. I might have gotten carried away here, especially given the fact that number of keys is often quite low.

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107705 has finished for PR 25167 at commit 6f78684.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 421d9d5 Jul 16, 2019
@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 16, 2019

also backported to 2.4, cc @dongjoon-hyun

cloud-fan pushed a commit that referenced this pull request Jul 16, 2019
…essions gracefully

## What changes were proposed in this pull request?
When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. `Seq(a, a, b)` vs `Seq(a, b, b)`. This fails with an index lookup failure in the `reorder` function.

This PR fixes this removing the equality checking logic from the `reorderJoinKeys` function, and by doing the multiset equality in the `reorder` function while building the reordered key sequences.

## How was this patch tested?
Added a unit test to the `PlannerSuite` and added an integration test to `JoinSuite`

Closes #25167 from hvanhovell/SPARK-27485.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@gatorsmile
Copy link
Member

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/502/

Just reverted the commit from 2.4 because it breaks the build.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 16, 2019

Thank you for pinging me, @cloud-fan . Thank you for recovering branch-2.4.
Hi, @hvanhovell . Could you make a backporting PR to branch-2.4, please?

@mgaido91
Copy link
Contributor

a late LGTM, sorry for the delay and thanks for the fix @hvanhovell !

hvanhovell added a commit to hvanhovell/spark that referenced this pull request Jul 16, 2019
…essions gracefully

## What changes were proposed in this pull request?
When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. `Seq(a, a, b)` vs `Seq(a, b, b)`. This fails with an index lookup failure in the `reorder` function.

This PR fixes this removing the equality checking logic from the `reorderJoinKeys` function, and by doing the multiset equality in the `reorder` function while building the reordered key sequences.

## How was this patch tested?
Added a unit test to the `PlannerSuite` and added an integration test to `JoinSuite`

Closes apache#25167 from hvanhovell/SPARK-27485.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon
Copy link
Member

Sorry for late response. Looks good to me too.

vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
…essions gracefully

## What changes were proposed in this pull request?
When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. `Seq(a, a, b)` vs `Seq(a, b, b)`. This fails with an index lookup failure in the `reorder` function.

This PR fixes this removing the equality checking logic from the `reorderJoinKeys` function, and by doing the multiset equality in the `reorder` function while building the reordered key sequences.

## How was this patch tested?
Added a unit test to the `PlannerSuite` and added an integration test to `JoinSuite`

Closes apache#25167 from hvanhovell/SPARK-27485.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…essions gracefully

## What changes were proposed in this pull request?
When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. `Seq(a, a, b)` vs `Seq(a, b, b)`. This fails with an index lookup failure in the `reorder` function.

This PR fixes this removing the equality checking logic from the `reorderJoinKeys` function, and by doing the multiset equality in the `reorder` function while building the reordered key sequences.

## How was this patch tested?
Added a unit test to the `PlannerSuite` and added an integration test to `JoinSuite`

Closes apache#25167 from hvanhovell/SPARK-27485.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…essions gracefully

## What changes were proposed in this pull request?
When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. `Seq(a, a, b)` vs `Seq(a, b, b)`. This fails with an index lookup failure in the `reorder` function.

This PR fixes this removing the equality checking logic from the `reorderJoinKeys` function, and by doing the multiset equality in the `reorder` function while building the reordered key sequences.

## How was this patch tested?
Added a unit test to the `PlannerSuite` and added an integration test to `JoinSuite`

Closes apache#25167 from hvanhovell/SPARK-27485.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants