Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15667

Closed
wants to merge 4 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Oct 28, 2016

What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.

It seems there is a patch HIVE-11940 which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0.

Because Spark SQL uses older Hive library, we can not benefit from such improvement.

The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution.

Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition.

Note: The case reported on the jira is insert overwrite to partition. Since Hive.loadTable also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this.

How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67686 has finished for PR 15667 at commit 81dbeb1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Oct 28, 2016

cc @ericl

@viirya
Copy link
Member Author

viirya commented Oct 28, 2016

I checked the failed test. Special characters in the partition path cause the failure, e..g.,

alter table ppr_test add partition (ds = '12:4');
alter table ppr_test add partition (ds = '12%4');
alter table ppr_test add partition (ds = '12*4');

@ericl
Copy link
Contributor

ericl commented Oct 28, 2016

How does the performance look like before / after this patch?

@viirya
Copy link
Member Author

viirya commented Oct 28, 2016

@ericl I don't have the Hive environment to compare. We need to wait for issue reporter to verify that.

@snodawn
Copy link

snodawn commented Oct 29, 2016

Sorry, I'm late, for preparing the environment for testing cost me a lot of time. I have tested the performance before and after the patch. But it seems to improve a few after patching, where it costs 531 seconds before patching, and costs 518 seconds after patching. So I think I need to do more testing to find out the problem.

@viirya
Copy link
Member Author

viirya commented Oct 29, 2016

@snodawn Interesting...I will try to find out it too.

@viirya
Copy link
Member Author

viirya commented Oct 30, 2016

@snodawn Can you try it again? I've updated this.

@SparkQA
Copy link

SparkQA commented Oct 30, 2016

Test build #67785 has finished for PR 15667 at commit 5257d7f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@snodawn
Copy link

snodawn commented Oct 31, 2016

@viirya I have tested the new patch, which performs better than expected. Before patching, it costs about 500~600 seconds, but now it just cost me about 16 seconds to run the same statement. But it still runs slow when I run such sql:

insert overwrite table login4game partition(pt,dt) select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' as pid, 'mix' as dev, pt, dt from tbllog_login where pt='mix_en' and dt='2016-10-21';

It's the dynamic partition in hive, where we needn't to specify the partition value when inserting. I test it in hive 2.0.1, it costs 47.822 seconds, but in hive 1.2.1, it costs 574.33 seconds, as the same with what it does in spark, which is 526.44 seconds.

@viirya
Copy link
Member Author

viirya commented Oct 31, 2016

@snodawn I just updated this with a new commit. Do you use this new patch to test?

Yeah, current fixing doesn't consider dynamic partition. I would like to see if we can improve static partition. I will look into dynamic partition later.

@snodawn
Copy link

snodawn commented Oct 31, 2016

The execution logs in spark show me that, it does the same thing as what it does before I add the patch, which may be the reason why it runs so slow when running dynamic insert overwrite statement.

@viirya
Copy link
Member Author

viirya commented Oct 31, 2016

@snodawn Current fixing does not do anything for dynamic partition. So we can expect this.

@viirya
Copy link
Member Author

viirya commented Oct 31, 2016

@snodawn You just said the new patch runs better. Do you use the latest patch I just updated in about 1 hr ago? Thanks.

@snodawn
Copy link

snodawn commented Oct 31, 2016

Ok, I see. I haven't tested the newest code, I would try it later.

@viirya
Copy link
Member Author

viirya commented Oct 31, 2016

@snodawn Thanks. I expect it should perform as good as you tested. But there are few tests failed I fix in newer patch.

@SparkQA
Copy link

SparkQA commented Oct 31, 2016

Test build #67799 has finished for PR 15667 at commit 74f0f3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@snodawn
Copy link

snodawn commented Oct 31, 2016

@viirya I have tested the newest patch. It performs good in running the same sql as I ran before.

@viirya
Copy link
Member Author

viirya commented Oct 31, 2016

@snodawn Thanks! I will address dynamic partition in next commit.

throw new RuntimeException(
"Cannot remove partition directory '" + partitionPath.toString)
} else {
fs.mkdirs(partitionPath, pathPermission)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the mkdir necessary?

Copy link
Member Author

@viirya viirya Nov 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking Hive will complain if the dir is not existing. But looks like it won't. Let me remove this and see if all tests can be passed.

@@ -257,7 +258,31 @@ case class InsertIntoHiveTable(
table.catalogTable.identifier.table,
partitionSpec)

var doOverwrite = overwrite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doHiveOverwrite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. updated.

@ericl
Copy link
Contributor

ericl commented Nov 1, 2016

lgtm if tests pass

@viirya
Copy link
Member Author

viirya commented Nov 1, 2016

@ericl Dynamic partition would be more complicated. Should we do it in this or in follow-up?

@ericl
Copy link
Contributor

ericl commented Nov 1, 2016

Let's do it in a follow-up.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67853 has finished for PR 15667 at commit bd22150.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@snodawn
Copy link

snodawn commented Nov 1, 2016

@viirya @ericl Thanks a lot. The new patch is also passed in my test. And I sincerely hope dynamic partition could be completed in the future, thanks.

@rxin
Copy link
Contributor

rxin commented Nov 1, 2016

Merging in master. Thanks.

@asfgit asfgit closed this in dd85eb5 Nov 1, 2016
@viirya
Copy link
Member Author

viirya commented Nov 1, 2016

@rxin @ericl Thanks!

@snodawn I will address dynamic partition in a follow-up pr.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…rk-sql than it does in hive-client

## What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.

It seems there is a patch [HIVE-11940](apache/hive@ba21806) which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0.

Because Spark SQL uses older Hive library, we can not benefit from such improvement.

The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution.

Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition.

Note: The case reported on the jira is insert overwrite to partition. Since `Hive.loadTable` also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this.
## How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#15667 from viirya/improve-hive-insertoverwrite.
@viirya viirya deleted the improve-hive-insertoverwrite branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants