[SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive #2226

baishuo · 2014-09-01T08:04:15Z

a new PR base on new master. changes are the same as #1919

AmplabJenkins · 2014-09-01T08:07:08Z

Can one of the admins verify this patch?

liancheng · 2014-09-03T00:18:52Z

ok to test

baishuo · 2014-09-03T07:08:10Z

Hi @marmbrus and @liancheng, the latest code had pass "dev/lint-scala" and "sbt/sbt catalyst/test sql/test hive/test" locally.

liancheng · 2014-09-03T07:15:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+        fileSinkConf,
+        jobConfSer,
+        sc.hiveconf.getBoolean("hive.exec.compress.output", false),
+        dynamicPartNum)


Bad indentation :)

liancheng · 2014-09-03T07:17:21Z

Please also submit golden answer files for newly whitelisted test cases in HiveCompatibilitySuite.

liancheng · 2014-09-03T07:17:28Z

ok to test

liancheng · 2014-09-03T21:34:24Z

@baishuo Just added a note about Hive golden answer files in Spark Wiki https://cwiki.apache.org/confluence/display/SPARK/Spark+SQL+Internals, please refer to this page to generate and submit those files. Thanks!

liancheng · 2014-09-03T21:34:27Z

ok to test

baishuo · 2014-09-05T02:23:35Z

The golden file related HiveCompatibilitySuite with had already exists in master branch of spark. So do not need to add them.

baishuo · 2014-09-05T02:45:29Z

can this PR be tested? :)

liancheng · 2014-09-05T03:22:27Z

test this please

SparkQA · 2014-09-05T23:41:43Z

Can one of the admins verify this patch?

liancheng · 2014-09-06T06:07:17Z

test this please

marmbrus · 2014-09-08T23:25:00Z

sql/hive/src/main/scala/org/apache/spark/SparkHadoopWriter.scala

+    if (outputPath == null) {
+      throw new IOException("Undefined job output-path")
+    }
+    val workPath = new Path(outputPath, dynamicPartPath.substring(1)) // remove "/"


What about .stripPrefix("/")?

marmbrus · 2014-09-09T00:33:31Z

ok to test

marmbrus · 2014-09-09T00:38:14Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+          writerMap += (record._2 -> tempWriter)
+          tempWriter
+        }
+           }


I think the indentation here is off.

SparkQA · 2014-09-09T00:49:18Z

QA tests have started for PR 2226 at commit 15d877b.

This patch merges cleanly.

marmbrus · 2014-09-09T00:51:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

-    writer.commitJob()
+ /*
+ * if rowVal is null or "",will return HiveConf.get(hive.exec.default.partition.name) with default
+ * */


Should be:

/** * Returns `rowVal` as a String. If `rowVal` is null or equal to "", returns the default partition name. */

marmbrus · 2014-09-09T01:23:11Z

Thanks again for working on this! This will be an awesome feature to have. :) I did a pretty detailed pass and made a few comments. A few high-level notes:

There was a lot of unnecessary mutable state, which we try to avoid in Spark SQL. In general we try to limit the use of vars to places where it is critical for performance.
If at all possible it would be great to separate the dynamic partition support from the rest of the code. Right now there are a lot of if (dynamicPartNum == 0) or if (record._2 == null) or if (dynamicPartNum > 0) checks interleaved with other code. I think this might be a little easier to follow if common code was broken out into functions and there was a single path for each type of insertion.

yhuai · 2014-09-17T23:19:58Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SparkHadoopWriter.scala

+   * Create a `HiveRecordWriter`. A relative dynamic partition path can be used to create a writer
+   * for writing data to a dynamic partition.
+   */
+  def open() {


Seems open is not a good name at here. Maybe rename it?

Maybe init()? Also, I forgot to update the comments.

Is it always called after executorSideSetup? If so, can we rename it to something like setupWriter (or initWriter) and call it at the end of executorSideSetup instead of call it in writeToFile?

Yea, I also realized this. Renamed this to initWriters and merged it into executorSideSetup. Also merged the commit() call into close().

liancheng · 2014-09-18T01:12:40Z

Addressed @yhuai's comments except for adding more tests, will add them soon.

SparkQA · 2014-09-18T01:14:40Z

QA tests have started for PR 2226 at commit b20a3dc.

This patch merges cleanly.

SparkQA · 2014-09-18T02:01:41Z

QA tests have finished for PR 2226 at commit b20a3dc.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-23T01:39:26Z

QA tests have started for PR 2226 at commit e69ce88.

This patch merges cleanly.

SparkQA · 2014-09-23T02:29:38Z

QA tests have finished for PR 2226 at commit e69ce88.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-23T02:29:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20679/

liancheng · 2014-09-23T03:21:58Z

LGTM

@marmbrus This is finally good to go :)

baishuo · 2014-09-23T03:45:32Z

thanks a lot to @liancheng :)

marmbrus · 2014-09-23T19:06:02Z

Awesome, thanks guys! Can you remove the "s from the title... I think that is breaking my merge script.

baishuo · 2014-09-24T02:06:11Z

had remove "s from title @marmbrus

baishuo · 2014-09-24T03:50:53Z

I think I should say thank you to @liancheng and @yhuai. During the communication with you, I had learned a lot :)

baishuo · 2014-09-25T03:41:27Z

hi @marmbrus ,would you please run the merge script again? :)

kayousterhout · 2014-09-29T22:57:00Z

I've merged this into master. Sorry for the delay -- unicode characters in the commit author names were causing our merge script to crash!

liancheng · 2014-09-30T01:47:42Z

Haha, have we updated our merge script to handle unicode? I should note that half of Spark SQL contributors are Chinese :)

@yhuai

PR #2226 was reverted because it broke Jenkins builds for unknown reason. This debugging PR aims to fix the Jenkins build. This PR also fixes two bugs: 1. Compression configurations in `InsertIntoHiveTable` are disabled by mistake The `FileSinkDesc` object passed to the writer container doesn't have compression related configurations. These configurations are not taken care of until `saveAsHiveFile` is called. This PR moves compression code forward, right after instantiation of the `FileSinkDesc` object. 1. `PreInsertionCasts` doesn't take table partitions into account In `castChildOutput`, `table.attributes` only contains non-partition columns, thus for partitioned table `childOutputDataTypes` never equals to `tableOutputDataTypes`. This results funny analyzed plan like this: ``` == Analyzed Logical Plan == InsertIntoTable Map(partcol1 -> None, partcol2 -> None), false MetastoreRelation default, dynamic_part_table, None Project [c_0#1164,c_1#1165,c_2#1166] Project [c_0#1164,c_1#1165,c_2#1166] Project [c_0#1164,c_1#1165,c_2#1166] ... (repeats 99 times) ... Project [c_0#1164,c_1#1165,c_2#1166] Project [c_0#1164,c_1#1165,c_2#1166] Project [1 AS c_0#1164,1 AS c_1#1165,1 AS c_2#1166] Filter (key#1170 = 150) MetastoreRelation default, src, None ``` Awful though this logical plan looks, it's harmless because all projects will be eliminated by optimizer. Guess that's why this issue hasn't been caught before. Author: Cheng Lian <lian.cs.zju@gmail.com> Author: baishuo(白硕) <vc_java@hotmail.com> Author: baishuo <vc_java@hotmail.com> Closes #2616 from liancheng/dp-fix and squashes the following commits: 21935b6 [Cheng Lian] Adds back deleted trailing space f471c4b [Cheng Lian] PreInsertionCasts should take table partitions into account a132c80 [Cheng Lian] Fixes output compression 9c6eb2d [Cheng Lian] Adds tests to verify dynamic partitioning folder layout 0eed349 [Cheng Lian] Addresses @yhuai's comments 26632c3 [Cheng Lian] Adds more tests 9227181 [Cheng Lian] Minor refactoring c47470e [Cheng Lian] Refactors InsertIntoHiveTable to a Command 6fb16d7 [Cheng Lian] Fixes typo in test name, regenerated golden answer files d53daa5 [Cheng Lian] Refactors dynamic partitioning support b821611 [baishuo] pass check style 997c990 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name 761ecf2 [baishuo] modify according micheal's advice 207c6ac [baishuo] modify for some bad indentation caea6fb [baishuo] modify code to pass scala style checks b660e74 [baishuo] delete a empty else branch cd822f0 [baishuo] do a little modify 8e7268c [baishuo] update file after test 3f91665 [baishuo(白硕)] Update Cast.scala 8ad173c [baishuo(白硕)] Update InsertIntoHiveTable.scala 051ba91 [baishuo(白硕)] Update Cast.scala d452eb3 [baishuo(白硕)] Update HiveQuerySuite.scala 37c603b [baishuo(白硕)] Update InsertIntoHiveTable.scala 98cfb1f [baishuo(白硕)] Update HiveCompatibilitySuite.scala 6af73f4 [baishuo(白硕)] Update InsertIntoHiveTable.scala adf02f1 [baishuo(白硕)] Update InsertIntoHiveTable.scala 1867e23 [baishuo(白硕)] Update SparkHadoopWriter.scala 6bb5880 [baishuo(白硕)] Update HiveQl.scala

… versions This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x). The root cause is the semantics difference of `FileSystem.globStatus()` between different versions of Hadoop, as illustrated by the following test code: ```scala object GlobExperiments extends App { val conf = new Configuration() val fs = FileSystem.getLocal(conf) fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status => println(status.getPath) } } ``` Target directory structure: ``` /tmp/wh ├── dir0 │ ├── dir1 │ │ └── level2 │ └── level1 └── level0 ``` Hadoop 2.4.1 result: ``` file:/tmp/wh/dir0/dir1/level2 ``` Hadoop 1.0.4 resuet: ``` file:/tmp/wh/dir0/dir1/level2 file:/tmp/wh/dir0/level1 file:/tmp/wh/level0 ``` In #2226 and #2616, we call `FileOutputCommitter.commitJob()` at the end of the job, and the `_SUCCESS` mark file is written. When working with lower Hadoop versions, due to the `globStatus()` semantics issue, `_SUCCESS` is included as a separate partition data file by `Hive.loadDynamicPartitions()`, and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the `_SUCCESS` marker to workaround this issue. Hive doesn't suffer this issue because `FileSinkOperator` doesn't call `FileOutputCommitter.commitJob()`, instead, it calls `Utilities.mvFileToFinalPath()` to cleanup the output directory and then loads it into Hive warehouse by with `loadDynamicPartitions()`/`loadPartition()`/`loadTable()`. This approach is better because it handles failed job and speculative tasks properly. We should add this step to `InsertIntoHiveTable` in another PR. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2663 from liancheng/dp-hadoop-1-fix and squashes the following commits: 0177dae [Cheng Lian] Fixes dynamic partitioning support for lower Hadoop versions

baishuo mentioned this pull request Sep 1, 2014

[SPARK-3007][SQL]Add "Dynamic Partition" support to Spark Sql hive #1919

Closed

liancheng reviewed Sep 3, 2014
View reviewed changes

marmbrus reviewed Sep 8, 2014
View reviewed changes

marmbrus reviewed Sep 9, 2014
View reviewed changes

yhuai reviewed Sep 17, 2014
View reviewed changes

Addresses @yhuai's comments

b20a3dc

Adds tests to verify dynamic partitioning folder layout

e69ce88

baishuo changed the title ~~[SPARK-3007][SQL]Add "Dynamic Partition" support to Spark Sql hive~~ [SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive Sep 24, 2014

asfgit closed this in 0bbe7fa Sep 29, 2014

This was referenced Oct 1, 2014

[SPARK-3007][SQL] Adds dynamic partitioning support #2616

Closed

[SPARK-3007][SQL] (This is a debugging PR to test Jenkins build) #2626

Closed

liancheng mentioned this pull request Oct 5, 2014

[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663

Closed

[SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive #2226

[SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive #2226

Conversation

baishuo commented Sep 1, 2014

AmplabJenkins commented Sep 1, 2014

liancheng commented Sep 3, 2014

baishuo commented Sep 3, 2014

Choose a reason for hiding this comment

liancheng commented Sep 3, 2014

liancheng commented Sep 3, 2014

liancheng commented Sep 3, 2014

liancheng commented Sep 3, 2014

baishuo commented Sep 5, 2014

baishuo commented Sep 5, 2014

liancheng commented Sep 5, 2014

SparkQA commented Sep 5, 2014

liancheng commented Sep 6, 2014

Choose a reason for hiding this comment

marmbrus commented Sep 9, 2014

Choose a reason for hiding this comment

SparkQA commented Sep 9, 2014

Choose a reason for hiding this comment

marmbrus commented Sep 9, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

liancheng commented Sep 23, 2014

baishuo commented Sep 23, 2014

marmbrus commented Sep 23, 2014

baishuo commented Sep 24, 2014

baishuo commented Sep 24, 2014

baishuo commented Sep 25, 2014

kayousterhout commented Sep 29, 2014

liancheng commented Sep 30, 2014