New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20187][SQL] Replace loadTable with moveFile to speed up load table for many output files #17505
Conversation
Test build #75445 has finished for PR 17505 at commit
|
private lazy val moveFileMethod = | ||
findMethod( | ||
classOf[Hive], | ||
"moveFile", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this exist in all the versions Spark supports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moveFile
is supported from Hive-1.1.0.
isSrcLocal) | ||
val tbl = client.getTable(tableName) | ||
val fs = tbl.getDataLocation.getFileSystem(conf) | ||
if (replace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadTable
is calling replaceFiles
when replace
is true. , instead of calling Hive.copyFiles
.
replaceFiles
is based on the calls of moveFile
.
Feel free to let me know if anything is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like your PR is similar to what Hive did in HIVE-12908 . Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaceFiles
will move files one by one, so it will take a long time if load many output files, as follows:
17/04/01 10:50:04 INFO TaskSetManager: Finished task 216432.0 in stage 0.0 (TID 216866) in 3022 ms on jqhadoop-test28-52.int.yihaodian.com (executor 87) (216868/216869)
17/04/01 10:50:04 INFO TaskSetManager: Finished task 207165.0 in stage 0.0 (TID 216796) in 5952 ms on jqhadoop-test28-8.int.yihaodian.com (executor 54) (216869/216869)
17/04/01 10:50:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/04/01 10:50:04 INFO DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376) finished in 541.797 s
17/04/01 10:50:04 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376, took 551.208919 s
17/04/01 10:50:04 INFO FileFormatWriter: Job null committed.
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00002-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00002-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00003-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00003-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00004-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00004-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00005-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00005-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00006-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00006-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00007-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00007-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00008-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00008-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
...
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99995-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99995-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99996-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99996-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99997-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99997-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99998-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99998-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
17/04/01 11:16:18 INFO SparkSqlParser: Parsing command: `tmp`.`spark_load_slow`
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
Time taken: 2178.736 seconds
17/04/01 11:16:18 INFO CliDriver: Time taken: 2178.736 seconds
moveFile
only move once.
Full log can be find spark.loadTable.log.tar.gz, SPARK-20187.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has nothing to do with dynamic partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you read the changes made in that PR, it is doing the similar things.
… warehouse/dbName.db/.hive-staging
Test build #75535 has finished for PR 17505 at commit
|
@gatorsmile You are right. It is similar to HIVE-12908. There are 3 bottlenecks for many output files:
|
@wangyum @rxin @gatorsmile Is it possible to upgrade the built-in hive-exec to resolve this problem? I believe the built-in hive-exec is this https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 |
What changes were proposed in this pull request?
HiveClientImpl.loadTable load files one by one, so this step will take a long time if a job generates many files. This PR replace
Hive.loadTable
withHive.moveFile
to speed up this step forcreate table tableName as select ...
andinsert overwrite table tableName select ...
.How was this patch tested?
manual tests