Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22263][SQL]Refactor deterministic as lazy value #19478

Closed

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Oct 12, 2017

What changes were proposed in this pull request?

The method deterministic is frequently called in optimizer.
Refactor deterministic as lazy value, in order to avoid redundant computations.

How was this patch tested?

Simple benchmark test over TPC-DS queries, run time from query string to optimized plan(continuous 20 runs, and get the average of last 5 results):
Before changes: 12601 ms
After changes: 11993ms
This is 4.8% performance improvement.

Also run test with Unit test.

@SparkQA
Copy link

SparkQA commented Oct 12, 2017

Test build #82664 has finished for PR 19478 at commit bdeea55.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -79,7 +79,9 @@ abstract class Expression extends TreeNode[Expression] {
* An example would be `SparkPartitionID` that relies on the partition id returned by TaskContext.
* By default leaf expressions are deterministic as Nil.forall(_.deterministic) returns true.
*/
def deterministic: Boolean = children.forall(_.deterministic)
lazy val deterministic: Boolean = isDeterministic
Copy link
Member

@viirya viirya Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt how much time this can save. But why won't just:

lazy val deterministic: Boolean = children.forall(_.deterministic)

I think it is equal to this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@hvanhovell
Copy link
Contributor

@gengliangwang do you have any benchmark that shows that this is a performance bottleneck?

@SparkQA
Copy link

SparkQA commented Oct 12, 2017

Test build #82667 has finished for PR 19478 at commit 8143ee1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Compare the total optimization time for the TPC-DS queries?

@gengliangwang
Copy link
Member Author

@viirya @hvanhovell @gatorsmile

Thanks, I have attached the performance result in the description in this PR.
Overall I don't see any downside of the code change.
Also it is possible to take much more time to get the deterministic of UDF, while making it lazy value can avoid that.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @JoshRosen

@SparkQA
Copy link

SparkQA commented Oct 12, 2017

Test build #82698 has finished for PR 19478 at commit 03e88b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 13, 2017

Ok. The performance result looks good. LGTM.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 3ff766f Oct 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants