Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23229][SQL] Dataset.hint should use planWithBarrier logical plan #20405

Conversation

jaceklaskowski
Copy link
Contributor

What changes were proposed in this pull request?

Every time Dataset.hint is used it triggers execution of logical commands, their unions and hint resolution (among other things that analyzer does).

hint should use planWithBarrier instead.

Fixes https://issues.apache.org/jira/browse/SPARK-23229

How was this patch tested?

Existing unit tests, local build + awaiting Jenkins

@jaceklaskowski
Copy link
Contributor Author

/cc @cloud-fan

@@ -1216,7 +1216,7 @@ class Dataset[T] private[sql](
*/
@scala.annotation.varargs
def hint(name: String, parameters: Any*): Dataset[T] = withTypedPlan {
UnresolvedHint(name, parameters, logicalPlan)
UnresolvedHint(name, parameters, planWithBarrier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ResolveBroadcastHints rule will traverse recursively the children of logical plan. If we wrap it with a barrier, we can't be traverse down the tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding however is that planWithBarrier is already analyzed (and ResolveBroadcastHints as the very first rule had its chance to do its work). That's the extra processing hint does every time it's called. Using planWithBarrier makes it less "painful".

Just use hint twice and see the analyzed plan.

Copy link
Contributor

@cloud-fan cloud-fan Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @viirya has a valid concern. think about

val df1 = spark.table("t").select("id")
df1.hint("broadcast", "t")

We should transform down the plan of df1, find the bottom table relation and apply the hint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that that's what ResolveBroadcastHints does --> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala#L93-L101, doesn't it? I'm going to write a test case for it to confirm (and that's what I was asking for in the email to dev@spark the other day).

Copy link
Member

@viirya viirya Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaceklaskowski Because the logical plan is wrapped in analysis barrier, ResolveBroadcastHints can't traverse down it to reach the UnresolvedRelation/SubqueryAlias. at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala#L60-L61.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a possible workaround is to explicitly go through the barrier in the hint resolution rules, so that we can still use barrier here and skip analysis in other analyzer rules.

Copy link
Member

@viirya viirya Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we go with this workaround?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. cc @jaceklaskowski

@SparkQA
Copy link

SparkQA commented Jan 26, 2018

Test build #86702 has finished for PR 20405 at commit 47bb245.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jaceklaskowski
Copy link
Contributor Author

Looks like the tests failed due to "java.io.IOException: Failed to delete: /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-5a9b5811-306d-4ba0-8bfb-9e263ddf47b8" Is this because of the change or a "misnomer"?

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93078 has finished for PR 20405 at commit 47bb245.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@HyukjinKwon Does the existing AnalysisBarrier introduce a regression when users use the hint like df1.hint("broadcast", "t")?

@viirya
Copy link
Member

viirya commented Jul 16, 2018

When we use a hint an analyzed plan goes through analysis again. If I remember it correctly it is happened before analysis barrier is introduced. This PR wants to wrap an AnalysisBarrier on it so we won't re-analyze it.

@SparkQA
Copy link

SparkQA commented Jul 25, 2018

Test build #93559 has finished for PR 20405 at commit 47bb245.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 27, 2018

I think this can be closed in favor of #21822

@dongjoon-hyun
Copy link
Member

Hi, @jaceklaskowski . Could you close this PR?

@dongjoon-hyun
Copy link
Member

Gentle ping, @jaceklaskowski .

@maropu
Copy link
Member

maropu commented Sep 18, 2018

kindly ping

@HyukjinKwon
Copy link
Member

ping @jaceklaskowski

@dongjoon-hyun
Copy link
Member

@jaceklaskowski seems to ignore Git messages here. Should we ping him email since he is Spark dev mailing list?

@felixcheung
Copy link
Member

probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants