[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15667

viirya · 2016-10-28T02:07:30Z

What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.

It seems there is a patch HIVE-11940 which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0.

Because Spark SQL uses older Hive library, we can not benefit from such improvement.

The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution.

Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition.

Note: The case reported on the jira is insert overwrite to partition. Since Hive.loadTable also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this.

How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

SparkQA · 2016-10-28T03:13:06Z

Test build #67686 has finished for PR 15667 at commit 81dbeb1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-28T08:25:12Z

cc @ericl

viirya · 2016-10-28T08:28:27Z

I checked the failed test. Special characters in the partition path cause the failure, e..g.,

alter table ppr_test add partition (ds = '12:4');
alter table ppr_test add partition (ds = '12%4');
alter table ppr_test add partition (ds = '12*4');

ericl · 2016-10-28T17:45:06Z

How does the performance look like before / after this patch?

viirya · 2016-10-28T23:21:33Z

@ericl I don't have the Hive environment to compare. We need to wait for issue reporter to verify that.

snodawn · 2016-10-29T06:57:12Z

Sorry, I'm late, for preparing the environment for testing cost me a lot of time. I have tested the performance before and after the patch. But it seems to improve a few after patching, where it costs 531 seconds before patching, and costs 518 seconds after patching. So I think I need to do more testing to find out the problem.

viirya · 2016-10-29T13:07:18Z

@snodawn Interesting...I will try to find out it too.

viirya · 2016-10-30T13:40:26Z

@snodawn Can you try it again? I've updated this.

SparkQA · 2016-10-30T14:49:04Z

Test build #67785 has finished for PR 15667 at commit 5257d7f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

snodawn · 2016-10-31T03:46:51Z

@viirya I have tested the new patch, which performs better than expected. Before patching, it costs about 500~600 seconds, but now it just cost me about 16 seconds to run the same statement. But it still runs slow when I run such sql:

insert overwrite table login4game partition(pt,dt) select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' as pid, 'mix' as dev, pt, dt from tbllog_login where pt='mix_en' and dt='2016-10-21';

It's the dynamic partition in hive, where we needn't to specify the partition value when inserting. I test it in hive 2.0.1, it costs 47.822 seconds, but in hive 1.2.1, it costs 574.33 seconds, as the same with what it does in spark, which is 526.44 seconds.

viirya · 2016-10-31T03:50:14Z

@snodawn I just updated this with a new commit. Do you use this new patch to test?

Yeah, current fixing doesn't consider dynamic partition. I would like to see if we can improve static partition. I will look into dynamic partition later.

snodawn · 2016-10-31T03:52:59Z

The execution logs in spark show me that, it does the same thing as what it does before I add the patch, which may be the reason why it runs so slow when running dynamic insert overwrite statement.

viirya · 2016-10-31T03:55:40Z

@snodawn Current fixing does not do anything for dynamic partition. So we can expect this.

viirya · 2016-10-31T03:56:43Z

@snodawn You just said the new patch runs better. Do you use the latest patch I just updated in about 1 hr ago? Thanks.

snodawn · 2016-10-31T03:57:02Z

Ok, I see. I haven't tested the newest code, I would try it later.

viirya · 2016-10-31T03:58:34Z

@snodawn Thanks. I expect it should perform as good as you tested. But there are few tests failed I fix in newer patch.

SparkQA · 2016-10-31T04:42:35Z

Test build #67799 has finished for PR 15667 at commit 74f0f3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

snodawn · 2016-10-31T07:00:22Z

@viirya I have tested the newest patch. It performs good in running the same sql as I ran before.

viirya · 2016-10-31T07:02:45Z

@snodawn Thanks! I will address dynamic partition in next commit.

ericl · 2016-10-31T18:52:56Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+                  throw new RuntimeException(
+                    "Cannot remove partition directory '" + partitionPath.toString)
+                } else {
+                  fs.mkdirs(partitionPath, pathPermission)


Is the mkdir necessary?

I was thinking Hive will complain if the dir is not existing. But looks like it won't. Let me remove this and see if all tests can be passed.

ericl · 2016-10-31T18:54:43Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

@@ -257,7 +258,31 @@ case class InsertIntoHiveTable(
            table.catalogTable.identifier.table,
            partitionSpec)

+        var doOverwrite = overwrite


nit: doHiveOverwrite?

ok. updated.

ericl · 2016-11-01T01:41:21Z

lgtm if tests pass

viirya · 2016-11-01T01:51:45Z

@ericl Dynamic partition would be more complicated. Should we do it in this or in follow-up?

ericl · 2016-11-01T02:03:35Z

Let's do it in a follow-up.

SparkQA · 2016-11-01T02:46:51Z

Test build #67853 has finished for PR 15667 at commit bd22150.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

snodawn · 2016-11-01T07:09:26Z

@viirya @ericl Thanks a lot. The new patch is also passed in my test. And I sincerely hope dynamic partition could be completed in the future, thanks.

rxin · 2016-11-01T07:23:21Z

Merging in master. Thanks.

viirya · 2016-11-01T07:28:58Z

@rxin @ericl Thanks!

@snodawn I will address dynamic partition in a follow-up pr.

…rk-sql than it does in hive-client ## What changes were proposed in this pull request? As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client. It seems there is a patch [HIVE-11940](apache/hive@ba21806) which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0. Because Spark SQL uses older Hive library, we can not benefit from such improvement. The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution. Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition. Note: The case reported on the jira is insert overwrite to partition. Since `Hive.loadTable` also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this. ## How was this patch tested? Jenkins tests. There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition. For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#15667 from viirya/improve-hive-insertoverwrite.

Drop partition before insert overwrite to Hive table.

81dbeb1

Reset overwrite flag.

5257d7f

Fix test.

74f0f3e

ericl reviewed Oct 31, 2016

View reviewed changes

No to mkdir.

bd22150

asfgit closed this in dd85eb5 Nov 1, 2016

viirya mentioned this pull request Nov 2, 2016

[SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15726

Closed

viirya deleted the improve-hive-insertoverwrite branch December 27, 2023 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15667

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15667

viirya commented Oct 28, 2016 •

edited

Loading

SparkQA commented Oct 28, 2016

rxin commented Oct 28, 2016

viirya commented Oct 28, 2016

ericl commented Oct 28, 2016

viirya commented Oct 28, 2016

snodawn commented Oct 29, 2016 •

edited

Loading

viirya commented Oct 29, 2016

viirya commented Oct 30, 2016

SparkQA commented Oct 30, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

viirya commented Oct 31, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

SparkQA commented Oct 31, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

ericl Oct 31, 2016

viirya Nov 1, 2016 •

edited

Loading

ericl Oct 31, 2016

viirya Nov 1, 2016

ericl commented Nov 1, 2016

viirya commented Nov 1, 2016

ericl commented Nov 1, 2016

SparkQA commented Nov 1, 2016

snodawn commented Nov 1, 2016

rxin commented Nov 1, 2016

viirya commented Nov 1, 2016

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15667

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15667

Conversation

viirya commented Oct 28, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 28, 2016

rxin commented Oct 28, 2016

viirya commented Oct 28, 2016

ericl commented Oct 28, 2016

viirya commented Oct 28, 2016

snodawn commented Oct 29, 2016 • edited Loading

viirya commented Oct 29, 2016

viirya commented Oct 30, 2016

SparkQA commented Oct 30, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

viirya commented Oct 31, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

SparkQA commented Oct 31, 2016

snodawn commented Oct 31, 2016

viirya commented Oct 31, 2016

ericl Oct 31, 2016

Choose a reason for hiding this comment

viirya Nov 1, 2016 • edited Loading

Choose a reason for hiding this comment

ericl Oct 31, 2016

Choose a reason for hiding this comment

viirya Nov 1, 2016

Choose a reason for hiding this comment

ericl commented Nov 1, 2016

viirya commented Nov 1, 2016

ericl commented Nov 1, 2016

SparkQA commented Nov 1, 2016

snodawn commented Nov 1, 2016

rxin commented Nov 1, 2016

viirya commented Nov 1, 2016

viirya commented Oct 28, 2016 •

edited

Loading

snodawn commented Oct 29, 2016 •

edited

Loading

viirya Nov 1, 2016 •

edited

Loading