[SPARK-3007][SQL] Adds dynamic partitioning support #2616

liancheng · 2014-10-01T13:18:42Z

PR #2226 was reverted because it broke Jenkins builds for unknown reason. This debugging PR aims to fix the Jenkins build.

This PR also fixes two bugs:

Compression configurations in InsertIntoHiveTable are disabled by mistake

The FileSinkDesc object passed to the writer container doesn't have compression related configurations. These configurations are not taken care of until saveAsHiveFile is called. This PR moves compression code forward, right after instantiation of the FileSinkDesc object.

PreInsertionCasts doesn't take table partitions into account

In castChildOutput, table.attributes only contains non-partition columns, thus for partitioned table childOutputDataTypes never equals to tableOutputDataTypes. This results funny analyzed plan like this:

== Analyzed Logical Plan ==
InsertIntoTable Map(partcol1 -> None, partcol2 -> None), false
MetastoreRelation default, dynamic_part_table, None
 Project [c_0#1164,c_1#1165,c_2#1166]
  Project [c_0#1164,c_1#1165,c_2#1166]
   Project [c_0#1164,c_1#1165,c_2#1166]
    ... (repeats 99 times) ...
     Project [c_0#1164,c_1#1165,c_2#1166]
      Project [c_0#1164,c_1#1165,c_2#1166]
       Project [1 AS c_0#1164,1 AS c_1#1165,1 AS c_2#1166]
        Filter (key#1170 = 150)
         MetastoreRelation default, src, None

Awful though this logical plan looks, it's harmless because all projects will be eliminated by optimizer. Guess that's why this issue hasn't been caught before.

…ion.name

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

SparkQA · 2014-10-01T13:24:26Z

QA tests have started for PR 2616 at commit a132c80.

This patch merges cleanly.

SparkQA · 2014-10-01T14:04:33Z

QA tests have started for PR 2616 at commit f471c4b.

This patch merges cleanly.

SparkQA · 2014-10-01T14:15:02Z

QA tests have finished for PR 2616 at commit a132c80.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2014-10-01T17:00:28Z

Reverted the accidental trailing space change. However, since this is really dangerous, fixed it in #2619.

SparkQA · 2014-10-01T17:51:51Z

QA tests have finished for PR 2616 at commit 21935b6.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-01T17:51:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21131/

MD5 of query strings in `createQueryTest` calls are used to generate golden files, leaving trailing spaces there can be really dangerous. Got bitten by this while working on #2616: my "smart" IDE automatically removed a trailing space and makes Jenkins fail. (Really should add "no trailing space" to our coding style guidelines!) Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2619 from liancheng/kill-trailing-space and squashes the following commits: 034f119 [Cheng Lian] Kill dangerous trailing space in query string

liancheng · 2014-10-03T00:42:04Z

@marmbrus Let's try to merge this one to master and see whether Jenkins accepts it.

marmbrus · 2014-10-03T02:58:18Z

Tried merging but it failed :(

@kayousterhout what did you end up doing to merge this the first time?

kayousterhout · 2014-10-03T03:10:02Z

Comment out the print statement in merge_pr that causes the failure.

On Thu, Oct 2, 2014 at 7:58 PM, Michael Armbrust notifications@github.com
wrote:

Tried merging but it failed :(

@kayousterhout https://github.com/kayousterhout what did you end up
doing to merge this the first time?

—
Reply to this email directly or view it on GitHub
#2616 (comment).

marmbrus · 2014-10-03T18:31:45Z

Hmmm, still failing with:

subprocess.CalledProcessError: Command '[u'git', u'fetch', u'apache', u'master:PR_TOOL_MERGE_PR_2616_MASTER']' returned non-zero exit status 128

scwf · 2014-10-04T11:56:36Z

Hi, @liancheng, master branch test failed in my machine for all dynamic partition ,
[info] - dynamic_partition *** FAILED ***
[info] - Dynamic partition folder layout *** FAILED ***
[info] - dynamic_partition_skip_default *** FAILED ***
[info] - load_dyn_part1 *** FAILED ***
[info] - load_dyn_part10 *** FAILED ***
[info] - load_dyn_part11 *** FAILED ***
[info] - load_dyn_part12 *** FAILED ***
[info] - load_dyn_part13 *** FAILED ***
[info] - load_dyn_part14 *** FAILED ***
[info] - load_dyn_part14_win *** FAILED ***
[info] - load_dyn_part2 *** FAILED ***
[info] - load_dyn_part3 *** FAILED ***
[info] - load_dyn_part4 *** FAILED ***
[info] - load_dyn_part5 *** FAILED ***
[info] - load_dyn_part6 *** FAILED ***
[info] - load_dyn_part8 *** FAILED ***
[info] - load_dyn_part9 *** FAILED ***
[info] *** 17 TESTS FAILED ***

Detail log---
[info] - dynamic_partition *** FAILED ***
[info] Failed to execute query using catalyst:
[info] Error: get partition: Value for key partcol1 is null or empty
[info] org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for key partcol1 is null or empty
[info] at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:1585)
[info] at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:1556)
[info] at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1189)

Here i miss something? My test command as follows:
sbt/sbt -Phive assembly
sbt/sbt -Phive test

liancheng · 2014-10-04T14:39:34Z

@scwf Can you elaborate on what configurations you're using? Details like compilation flags, environment variables and building process can be helpful. I've been tracking this failure during the last a few days but couldn't reproduce it either locally or on Jenkins PR builder.

liancheng · 2014-10-04T14:47:00Z

@scwf Or could you please describe the steps to reproduce this failure from a newly checked out master branch? I guess once you can reproduce it, it happens deterministically.

liancheng · 2014-10-04T14:53:37Z

Ah, just found out that I can reproduce it with -Phive, had been using -Phive,hadoop-2.4 all the time and just couldn't reproduce this, thanks!

scwf · 2014-10-04T14:58:28Z

Yes, i will use -Phive,hadoop-2.4 to see whether it has the peoblem

scwf · 2014-10-04T15:30:57Z

using -Phive,hadoop-2.4 is also ok in my local maching

liancheng · 2014-10-04T15:39:24Z

So this bug can be triggered by lower versions of Hadoop, e.g. 1.0.3. I haven't validate the exact range yet.

In Hive.loadDynamicPartitions, Hive calls o.a.h.h.q.e.Utilities.getFileStatusRecurse to glob the temporary directory for data files, it seems that lower versions of Hadoop doesn't filter out files like _SUCCESS, which causes the problem.

Within Hive, loadDynamicPartitions is only used in operations like LOAD. At the end of a normal insertion to a dynamically partitioned table, FileSinkOperator calls Utilities.mvFileToFinalPath to move the entire temporary directory to target location, thus doesn't have this problem.

Utilities.mvFileToFinalPath is more efficient than Hive.loadDynamicPartitions since it doesn't parses and validates partition specs. But it requires some internal Hive data structures like DynamicPartitionCtx. I'll try to see whether I can mock these data structures and use mvFileToFinalPath instead.

liancheng · 2014-10-04T15:41:22Z

@scwf Thanks for all the information you provided offline :)

liancheng · 2014-10-04T15:47:49Z

According to previous failed Jenkins builds (1, 2, etc.), Hadoop 1.0.3 and 2.0 are vulnerable, 2.2 and above are OK. That explains why this PR together with #2226 always passes Jenkins -- the PR builder uses Hadoop 2.3.

scwf · 2014-10-04T15:51:23Z

Get it.

liancheng · 2014-10-05T03:21:12Z

The reason why _SUCCESS is reserved is because semantics of FileSystem.globStatus got changed, and Utilities.getFileStatusRecurse relies on it to find out all partition data files.

Test code:

object GlobExperiments extends App {
  val conf = new Configuration()
  val fs = FileSystem.getLocal(conf)
  fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status =>
    println(status.getPath)
  }
}

Target directory structure:

/tmp/wh
├── dir0
│   ├── dir1
│   │   └── level2
│   └── level1
└── level0

Hadoop 2.4.1 result:

file:/tmp/wh/dir0/dir1/level2

Hadoop 1.0.4 resuet:

file:/tmp/wh/dir0/dir1/level2
file:/tmp/wh/dir0/level1
file:/tmp/wh/level0

… versions This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x). The root cause is the semantics difference of `FileSystem.globStatus()` between different versions of Hadoop, as illustrated by the following test code: ```scala object GlobExperiments extends App { val conf = new Configuration() val fs = FileSystem.getLocal(conf) fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status => println(status.getPath) } } ``` Target directory structure: ``` /tmp/wh ├── dir0 │ ├── dir1 │ │ └── level2 │ └── level1 └── level0 ``` Hadoop 2.4.1 result: ``` file:/tmp/wh/dir0/dir1/level2 ``` Hadoop 1.0.4 resuet: ``` file:/tmp/wh/dir0/dir1/level2 file:/tmp/wh/dir0/level1 file:/tmp/wh/level0 ``` In #2226 and #2616, we call `FileOutputCommitter.commitJob()` at the end of the job, and the `_SUCCESS` mark file is written. When working with lower Hadoop versions, due to the `globStatus()` semantics issue, `_SUCCESS` is included as a separate partition data file by `Hive.loadDynamicPartitions()`, and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the `_SUCCESS` marker to workaround this issue. Hive doesn't suffer this issue because `FileSinkOperator` doesn't call `FileOutputCommitter.commitJob()`, instead, it calls `Utilities.mvFileToFinalPath()` to cleanup the output directory and then loads it into Hive warehouse by with `loadDynamicPartitions()`/`loadPartition()`/`loadTable()`. This approach is better because it handles failed job and speculative tasks properly. We should add this step to `InsertIntoHiveTable` in another PR. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2663 from liancheng/dp-hadoop-1-fix and squashes the following commits: 0177dae [Cheng Lian] Fixes dynamic partitioning support for lower Hadoop versions

baishuo and others added 26 commits October 1, 2014 10:13

Update HiveQl.scala

6bb5880

Update SparkHadoopWriter.scala

1867e23

Update InsertIntoHiveTable.scala

adf02f1

Update InsertIntoHiveTable.scala

6af73f4

Update HiveCompatibilitySuite.scala

98cfb1f

Update InsertIntoHiveTable.scala

37c603b

Update HiveQuerySuite.scala

d452eb3

Update Cast.scala

051ba91

Update InsertIntoHiveTable.scala

8ad173c

Update Cast.scala

3f91665

update file after test

8e7268c

do a little modify

cd822f0

delete a empty else branch

b660e74

modify code to pass scala style checks

caea6fb

modify for some bad indentation

207c6ac

modify according micheal's advice

761ecf2

use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partit…

997c990

…ion.name

pass check style

b821611

Refactors dynamic partitioning support

d53daa5

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

Fixes typo in test name, regenerated golden answer files

6fb16d7

Refactors InsertIntoHiveTable to a Command

c47470e

Minor refactoring

9227181

Adds more tests

26632c3

Addresses @yhuai's comments

0eed349

Adds tests to verify dynamic partitioning folder layout

9c6eb2d

Fixes output compression

a132c80

PreInsertionCasts should take table partitions into account

f471c4b

yhuai mentioned this pull request Oct 1, 2014

[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576

Closed

liancheng changed the title ~~[SPARK-3007][SQL] WIP: adds dynamic partitioning support~~ [SPARK-3007][SQL] Adds dynamic partitioning support Oct 3, 2014

asfgit closed this in bec0d0e Oct 3, 2014

This was referenced Oct 5, 2014

[SPARK-3007][SQL] (This is a debugging PR to test Jenkins build) #2626

Closed

[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663

Closed

liancheng deleted the dp-fix branch February 24, 2015 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3007][SQL] Adds dynamic partitioning support #2616

[SPARK-3007][SQL] Adds dynamic partitioning support #2616

liancheng commented Oct 1, 2014

SparkQA commented Oct 1, 2014

SparkQA commented Oct 1, 2014

SparkQA commented Oct 1, 2014

liancheng commented Oct 1, 2014

SparkQA commented Oct 1, 2014

AmplabJenkins commented Oct 1, 2014

liancheng commented Oct 3, 2014

marmbrus commented Oct 3, 2014

kayousterhout commented Oct 3, 2014

marmbrus commented Oct 3, 2014

scwf commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

scwf commented Oct 4, 2014

scwf commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

scwf commented Oct 4, 2014

liancheng commented Oct 5, 2014

[SPARK-3007][SQL] Adds dynamic partitioning support #2616

[SPARK-3007][SQL] Adds dynamic partitioning support #2616

Conversation

liancheng commented Oct 1, 2014

SparkQA commented Oct 1, 2014

SparkQA commented Oct 1, 2014

SparkQA commented Oct 1, 2014

liancheng commented Oct 1, 2014

SparkQA commented Oct 1, 2014

AmplabJenkins commented Oct 1, 2014

liancheng commented Oct 3, 2014

marmbrus commented Oct 3, 2014

kayousterhout commented Oct 3, 2014

marmbrus commented Oct 3, 2014

scwf commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

scwf commented Oct 4, 2014

scwf commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

liancheng commented Oct 4, 2014

scwf commented Oct 4, 2014

liancheng commented Oct 5, 2014