Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle #27226

Closed
wants to merge 6 commits into from

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Jan 16, 2020

What changes were proposed in this pull request?

OptimizeSkewedJoin rule change the outputPartitioning after inserting PartialShuffleReaderExec or SkewedPartitionReaderExec. So it may need to introduce additional to ensure the right result. This PR disable OptimizeSkewedJoin rule when introducing additional shuffle.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new ut

@JkSelf JkSelf changed the title Disable OptimizeSkewedJoin rule when introducing additional shuffle [SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle Jan 16, 2020
@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 16, 2020

@cloud-fan @hvanhovell @maryannxue Please help review if you have available time. Thanks for your help.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116807 has finished for PR 27226 at commit 4bed602.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) =
plan match {
case stage: ShuffleQueryStageExec => (true, stage)
case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case SortExec(_, _, s: ShuffleQueryStageExec, _)

private def reOptimizeChild(
skewedReader: SkewedPartitionReaderExec,
child: SparkPlan): SparkPlan = child match {
case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

|Try to optimize skewed join.
|Left side partition size: $leftSizeInfo
|Right side partition size: $rightSizeInfo
|Try to optimize skewed join.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous indentation seems corrected.

s1 @ SortExec(_, _, left: ShuffleQueryStageExec, _),
s2 @ SortExec(_, _, right: ShuffleQueryStageExec, _))
if supportedJoinTypes.contains(joinType) =>
private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just return Option[ShuffleQueryStageExec]? we can rename the method to getShuffleQueryStage

child: SparkPlan): SparkPlan = child match {
case sort @ SortExec(_, _, s: ShuffleQueryStageExec, _) =>
sort.copy(child = skewedReader)
case _ => child
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be: case _: ShuffleQueryStageExec => skewedReader?

}
}
}
logDebug(s"number of skewed partitions is ${skewedPartitions.size}")
if (skewedPartitions.nonEmpty) {
val visitedStages = HashSet.empty[Int]
val optimizedSmj = smj.transformDown {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about transformUp? Then we don't need the visitedStages

val optimizedSmj = smj.transformDown {
case sort @ SortExec(_, _, shuffleStage: ShuffleQueryStageExec, _) =>
sort.copy(child = PartialShuffleReaderExec(shuffleStage, skewedPartitions.toSet))
case shuffleStage: ShuffleQueryStageExec if !visitedStages.contains(shuffleStage.id) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be safe, we should do case s: ShuffleQueryStageExec if s.id == left.id || s.id == right.id

@@ -189,6 +230,21 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
}
}

def handleSkewJoin(plan: SparkPlan): SparkPlan = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a long method, maybe just inline it in apply?

@@ -579,6 +579,33 @@ class AdaptiveQueryExecSuite
}
}

test("SPARK-30524: AQE should disable OptimizeSkewedJoin rule" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SPARK-30524: Do not optimize skew join if introduce additional shuffle

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116810 has finished for PR 27226 at commit 903309f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116816 has finished for PR 27226 at commit 81ad999.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116819 has finished for PR 27226 at commit c37f397.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116808 has finished for PR 27226 at commit c1c05d4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116835 has finished for PR 27226 at commit c37f397.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6e5b4bf Jan 16, 2020
handleSkewJoin(plan)
// When multi table join, there will be too many complex combination to consider.
// Currently we only handle 2 table join like following two use cases.
// SMJ SMJ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that my previous comment was wrong. Once we have shuffle, there should always be a sort. So we don't need to match this.

@@ -95,8 +96,20 @@ case class SortMergeJoinExec(
s"${getClass.getSimpleName} should not take $x as the JoinType")
}

override def requiredChildDistribution: Seq[Distribution] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make this a flag to indicate it's a partial SMJ. This whole matching is too tightly coupled with the skew join rule itself.

@maryannxue
Copy link
Contributor

@JkSelf can you do a quick follow up for the comments above as well as this one:
https://github.com/apache/spark/pull/26434/files#r367506078 ?

cloud-fan pushed a commit that referenced this pull request Jan 19, 2020
### What changes were proposed in this pull request?
Resolve the remaining comments in [PR#27226](#27226).

### Why are the changes needed?
Resolve the comments.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing unit tests.

Closes #27253 from JkSelf/followup-skewjoinoptimization2.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants