New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23229][SQL] Dataset.hint should use planWithBarrier logical plan #20405
[SPARK-23229][SQL] Dataset.hint should use planWithBarrier logical plan #20405
Conversation
/cc @cloud-fan |
@@ -1216,7 +1216,7 @@ class Dataset[T] private[sql]( | |||
*/ | |||
@scala.annotation.varargs | |||
def hint(name: String, parameters: Any*): Dataset[T] = withTypedPlan { | |||
UnresolvedHint(name, parameters, logicalPlan) | |||
UnresolvedHint(name, parameters, planWithBarrier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ResolveBroadcastHints
rule will traverse recursively the children of logical plan. If we wrap it with a barrier, we can't be traverse down the tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding however is that planWithBarrier
is already analyzed (and ResolveBroadcastHints
as the very first rule had its chance to do its work). That's the extra processing hint
does every time it's called. Using planWithBarrier
makes it less "painful".
Just use hint
twice and see the analyzed plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @viirya has a valid concern. think about
val df1 = spark.table("t").select("id")
df1.hint("broadcast", "t")
We should transform down the plan of df1
, find the bottom table relation and apply the hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that that's what ResolveBroadcastHints
does --> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala#L93-L101, doesn't it? I'm going to write a test case for it to confirm (and that's what I was asking for in the email to dev@spark the other day).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaceklaskowski Because the logical plan is wrapped in analysis barrier, ResolveBroadcastHints
can't traverse down it to reach the UnresolvedRelation
/SubqueryAlias
. at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala#L60-L61.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a possible workaround is to explicitly go through the barrier in the hint resolution rules, so that we can still use barrier here and skip analysis in other analyzer rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we go with this workaround?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. cc @jaceklaskowski
Test build #86702 has finished for PR 20405 at commit
|
Looks like the tests failed due to "java.io.IOException: Failed to delete: /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-5a9b5811-306d-4ba0-8bfb-9e263ddf47b8" Is this because of the change or a "misnomer"? |
retest this please |
Test build #93078 has finished for PR 20405 at commit
|
@HyukjinKwon Does the existing AnalysisBarrier introduce a regression when users use the hint like |
When we use a hint an analyzed plan goes through analysis again. If I remember it correctly it is happened before analysis barrier is introduced. This PR wants to wrap an AnalysisBarrier on it so we won't re-analyze it. |
Test build #93559 has finished for PR 20405 at commit
|
I think this can be closed in favor of #21822 |
Hi, @jaceklaskowski . Could you close this PR? |
Gentle ping, @jaceklaskowski . |
kindly ping |
ping @jaceklaskowski |
@jaceklaskowski seems to ignore Git messages here. Should we ping him email since he is Spark dev mailing list? |
probably. |
What changes were proposed in this pull request?
Every time
Dataset.hint
is used it triggers execution of logical commands, their unions and hint resolution (among other things that analyzer does).hint
should useplanWithBarrier
instead.Fixes https://issues.apache.org/jira/browse/SPARK-23229
How was this patch tested?
Existing unit tests, local build + awaiting Jenkins