Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20187][SQL] Replace loadTable with moveFile to speed up load table for many output files #17505

Closed
wants to merge 2 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Apr 1, 2017

What changes were proposed in this pull request?

HiveClientImpl.loadTable load files one by one, so this step will take a long time if a job generates many files. This PR replace Hive.loadTable with Hive.moveFile to speed up this step for create table tableName as select ... and insert overwrite table tableName select ....

How was this patch tested?

manual tests

@SparkQA
Copy link

SparkQA commented Apr 1, 2017

Test build #75445 has finished for PR 17505 at commit c608c4f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private lazy val moveFileMethod =
findMethod(
classOf[Hive],
"moveFile",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this exist in all the versions Spark supports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moveFile is supported from Hive-1.1.0.

isSrcLocal)
val tbl = client.getTable(tableName)
val fs = tbl.getDataLocation.getFileSystem(conf)
if (replace) {
Copy link
Member

@gatorsmile gatorsmile Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadTable is calling replaceFiles when replace is true. , instead of calling Hive.copyFiles.

replaceFiles is based on the calls of moveFile.

Feel free to let me know if anything is missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like your PR is similar to what Hive did in HIVE-12908 . Right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaceFiles will move files one by one, so it will take a long time if load many output files, as follows:

17/04/01 10:50:04 INFO TaskSetManager: Finished task 216432.0 in stage 0.0 (TID 216866) in 3022 ms on jqhadoop-test28-52.int.yihaodian.com (executor 87) (216868/216869)
17/04/01 10:50:04 INFO TaskSetManager: Finished task 207165.0 in stage 0.0 (TID 216796) in 5952 ms on jqhadoop-test28-8.int.yihaodian.com (executor 54) (216869/216869)
17/04/01 10:50:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/04/01 10:50:04 INFO DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376) finished in 541.797 s
17/04/01 10:50:04 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376, took 551.208919 s
17/04/01 10:50:04 INFO FileFormatWriter: Job null committed.
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00002-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00002-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00003-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00003-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00004-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00004-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00005-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00005-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00006-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00006-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00007-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00007-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00008-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00008-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true

...

17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99995-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99995-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99996-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99996-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99997-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99997-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99998-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99998-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:18 INFO SparkSqlParser: Parsing command: `tmp`.`spark_load_slow`
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
Time taken: 2178.736 seconds
17/04/01 11:16:18 INFO CliDriver: Time taken: 2178.736 seconds

moveFile only move once.

Full log can be find spark.loadTable.log.tar.gz, SPARK-20187.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has nothing to do with dynamic partition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you read the changes made in that PR, it is doing the similar things.

@SparkQA
Copy link

SparkQA commented Apr 5, 2017

Test build #75535 has finished for PR 17505 at commit 0f68ef5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Apr 6, 2017

@gatorsmile You are right. It is similar to HIVE-12908.

There are 3 bottlenecks for many output files:

  1. Set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 to speed up HadoopMapReduceCommitProtocol.commitJob for many output files, more see: SPARK-20107.
  2. Hive.loadTable load files one by one, it can be fixed by HIVE-12908.
  3. ShimLoader.getHadoopShims().setFullFileStatus(conf, destStatus, fs, destf) very slow for many files, It seems to be difficult to optimize.

@wangyum wangyum closed this Apr 6, 2017
@yuananf
Copy link

yuananf commented Apr 6, 2017

@wangyum @rxin @gatorsmile Is it possible to upgrade the built-in hive-exec to resolve this problem?

I believe the built-in hive-exec is this https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
Can we just upgrade this to release-2.0.1-spark2 or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants