[SPARK-16360][SQL] Speed up SQL query performance by removing redundant executePlan call#14044
[SPARK-16360][SQL] Speed up SQL query performance by removing redundant executePlan call#14044dongjoon-hyun wants to merge 2 commits intoapache:masterfrom dongjoon-hyun:SPARK-16360
executePlan call#14044Conversation
…nt analysis in `Dataset`
| val qe = sparkSession.sessionState.executePlan(logicalPlan) | ||
| qe.assertAnalyzed() | ||
| new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema)) | ||
| new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true) |
There was a problem hiding this comment.
Here are two optimizations.
- By using
qe,sparkSession.sessionState.executePlan(logicalPlan)is not called again. - By using
skipAnalysis = true,qe.assertAnalyzed()is not called again.
There was a problem hiding this comment.
how about we remove the qe.assertAnalyzed() in ofRows? Then we don't need the skipAnalysis flag.
There was a problem hiding this comment.
Thank you for review, @cloud-fan .
It is used due to RowEncoder(qe.analyzed.schema), isn't it?
There was a problem hiding this comment.
can we test how much we can speed up by avoiding the duplicated check analysis? I think it's necessary to avoid duplicated analysis, but seems check analysis is not a big deal? e.g. let's remove the skipAnalysis flag and run it again
There was a problem hiding this comment.
I think I wrote the result in the PR description. Is it not what you mean?
There was a problem hiding this comment.
Oh, I misunderstood your point.
You mean 1) changing logicalPlan , but 2) skipAnalysis = false.
Okay. I'll report soon.
|
Hi, @liancheng and @rxin . |
|
cc @cloud-fan , too. |
|
Test build #61714 has finished for PR 14044 at commit
|
|
LGTM |
|
Thank you for review, @naliazheli . |
|
Any idea what causes the regression? 5 seconds seems way too long for analysis... |
|
Thank you for review, @hvanhovell . BTW, it's over 12 seconds for one single analysis. Elapsed time: 25.787751452s --> Elapsed time: 12.364812255s. The reason why I executed |
|
@dongjoon-hyun my point is that analysis should not be taking 12 seconds at all. You can see how much time is spent in a rule, if you add the following lines of code to your example: import org.apache.spark.sql.catalyst.rules.RuleExecutor
println(RuleExecutor.dumpTimeSpent)This yields the following result (timing in ns): I think we should take a look at |
|
Oh, I see. And, thank you for advice of |
|
Interesting result. We definitely need to take a look at |
|
|
|
Yep. I agree. |
|
Agree with @hvanhovell. Analysis should never take so long a time for such a simple query. We should avoid duplicated analysis work, but fixing performance issue(s) within the analyzer seems to be more resultful. |
|
Thank you for review, @liancheng . |
|
Hi, @cloud-fan , @hvanhovell , @liancheng . According to @cloud-fan 's advice, after changing the following, it turns out that the difference is not noticeable. Exactly as you guys told, the second call of I'll update the PR. |
DatasetexecutePlan call
|
Now, I update the title and description of PR/JIRA. - new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
+ new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))Thank you all for fast review & advice. At first commit, I thought it is important to remove all repeating logics. But, now only the minimum meaningful code change remains. |
|
LGTM pending Jenkins. |
|
Test build #61744 has finished for PR 14044 at commit
|
|
LGTM |
|
Merged to master. Thanks! |
|
Thank you for review and merging, @liancheng , @cloud-fan , @hvanhovell , and @naliazheli ! |
What changes were proposed in this pull request?
Currently, there are a few reports about Spark 2.0 query performance regression for large queries.
This PR speeds up SQL query processing performance by removing redundant consecutive
executePlancall inDataset.ofRowsfunction andDatasetinstantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is 25.78 sec -> 12.36 sec as expected.Sample Query
Before
After
How was this patch tested?
Manual by the above script.