-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15667
Conversation
Test build #67686 has finished for PR 15667 at commit
|
cc @ericl |
I checked the failed test. Special characters in the partition path cause the failure, e..g.,
|
How does the performance look like before / after this patch? |
@ericl I don't have the Hive environment to compare. We need to wait for issue reporter to verify that. |
Sorry, I'm late, for preparing the environment for testing cost me a lot of time. I have tested the performance before and after the patch. But it seems to improve a few after patching, where it costs 531 seconds before patching, and costs 518 seconds after patching. So I think I need to do more testing to find out the problem. |
@snodawn Interesting...I will try to find out it too. |
@snodawn Can you try it again? I've updated this. |
Test build #67785 has finished for PR 15667 at commit
|
@viirya I have tested the new patch, which performs better than expected. Before patching, it costs about 500~600 seconds, but now it just cost me about 16 seconds to run the same statement. But it still runs slow when I run such sql: insert overwrite table login4game partition(pt,dt) select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' as pid, 'mix' as dev, pt, dt from tbllog_login where pt='mix_en' and dt='2016-10-21'; It's the dynamic partition in hive, where we needn't to specify the partition value when inserting. I test it in hive 2.0.1, it costs 47.822 seconds, but in hive 1.2.1, it costs 574.33 seconds, as the same with what it does in spark, which is 526.44 seconds. |
@snodawn I just updated this with a new commit. Do you use this new patch to test? Yeah, current fixing doesn't consider dynamic partition. I would like to see if we can improve static partition. I will look into dynamic partition later. |
The execution logs in spark show me that, it does the same thing as what it does before I add the patch, which may be the reason why it runs so slow when running dynamic insert overwrite statement. |
@snodawn Current fixing does not do anything for dynamic partition. So we can expect this. |
@snodawn You just said the new patch runs better. Do you use the latest patch I just updated in about 1 hr ago? Thanks. |
Ok, I see. I haven't tested the newest code, I would try it later. |
@snodawn Thanks. I expect it should perform as good as you tested. But there are few tests failed I fix in newer patch. |
Test build #67799 has finished for PR 15667 at commit
|
@viirya I have tested the newest patch. It performs good in running the same sql as I ran before. |
@snodawn Thanks! I will address dynamic partition in next commit. |
throw new RuntimeException( | ||
"Cannot remove partition directory '" + partitionPath.toString) | ||
} else { | ||
fs.mkdirs(partitionPath, pathPermission) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the mkdir necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking Hive will complain if the dir is not existing. But looks like it won't. Let me remove this and see if all tests can be passed.
@@ -257,7 +258,31 @@ case class InsertIntoHiveTable( | |||
table.catalogTable.identifier.table, | |||
partitionSpec) | |||
|
|||
var doOverwrite = overwrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: doHiveOverwrite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. updated.
lgtm if tests pass |
@ericl Dynamic partition would be more complicated. Should we do it in this or in follow-up? |
Let's do it in a follow-up. |
Test build #67853 has finished for PR 15667 at commit
|
Merging in master. Thanks. |
…rk-sql than it does in hive-client ## What changes were proposed in this pull request? As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client. It seems there is a patch [HIVE-11940](apache/hive@ba21806) which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0. Because Spark SQL uses older Hive library, we can not benefit from such improvement. The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution. Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition. Note: The case reported on the jira is insert overwrite to partition. Since `Hive.loadTable` also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this. ## How was this patch tested? Jenkins tests. There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition. For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#15667 from viirya/improve-hive-insertoverwrite.
What changes were proposed in this pull request?
As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.
It seems there is a patch HIVE-11940 which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0.
Because Spark SQL uses older Hive library, we can not benefit from such improvement.
The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution.
Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition.
Note: The case reported on the jira is insert overwrite to partition. Since
Hive.loadTable
also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this.How was this patch tested?
Jenkins tests.
There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition.
For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.