[SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting #6864

liancheng · 2015-06-17T22:42:29Z

This PR fixes a Parquet output file name collision bug which may cause data loss. Changes made:

Identify each write job issued by InsertIntoHadoopFsRelation with a UUID

All concrete data sources which extend HadoopFsRelation (Parquet and ORC for now) must use this UUID to generate task output file path to avoid name collision.
2. Make TestHive use a local mode SparkContext with 32 threads to increase parallelism

The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue. Also, higher concurrency may potentially caught more concurrency bugs during testing phase. (It did help us spotted SPARK-8501.)
3. OrcSourceSuite was updated to workaround SPARK-8501, which we detected along the way.

NOTE: This PR is made a little bit more complicated than expected because we hit two other bugs on the way and have to work them around. See SPARK-8501 and SPARK-8513.

Some background and a summary of offline discussion with @yhuai about this issue for better understanding:

In 1.4.0, we added HadoopFsRelation to abstract partition support of all data sources that are based on Hadoop FileSystem interface. Specifically, this makes partition discovery, partition pruning, and writing dynamic partitions for data sources much easier.

To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (i.e., <id> in output file name part-r-<id>.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Unfortunately, for tasks scheduled later, they may see wrong max part number generated of files newly written by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive part numbers in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, then one of them gets overwritten by the other.

Before HadoopFsRelation, Spark SQL already supports appending data to Hive tables. From a user's perspective, these two look similar. However, they differ a lot internally. When data are inserted into Hive tables via Spark SQL, InsertIntoHiveTable simulates Hive's behaviors:

Write data to a temporary location
Move data in the temporary location to the final destination location using

Hive.loadTable() for non-partitioned table
Hive.loadPartition() for static partitions
Hive.loadDynamicPartitions() for dynamic partitions

The important part is that, Hive.copyFiles() is invoked in step 2 to move the data to the destination directory (I found the name is kinda confusing since no "copying" occurs here, we are just moving and renaming stuff). If a file in the source directory and another file in the destination directory happen to have the same name, say part-r-00001.parquet, the former is moved to the destination directory and renamed with a _copy_N postfix (part-r-00001_copy_1.parquet). That's how Hive handles appending and avoids name collision between different write jobs.

Some alternatives fixes considered for this issue:

Use a similar approach as Hive

This approach is not preferred in Spark 1.4.0 mainly because file metadata operations in S3 tend to be slow, especially for tables with lots of file and/or partitions. That's why InsertIntoHadoopFsRelation just inserts to destination directory directly, and is often used together with DirectParquetOutputCommitter to reduce latency when working with S3. This means, we don't have the chance to do renaming, and must avoid name collision from the very beginning.
2. Same as 1.3, just move max part number detection back to driver side

This isn't doable because unlike 1.3, 1.4 also takes dynamic partitioning into account. When inserting into dynamic partitions, we don't know which partition directories will be touched on driver side before issuing the write job. Checking all partition directories is simply too expensive for tables with thousands of partitions.
3. Add extra component to output file names to avoid name collision

This seems to be the only reasonable solution for now. To be more specific, we need a JOB level unique identifier to identify all write jobs issued by InsertIntoHadoopFile. Notice that TASK level unique identifiers can NOT be used. Because in this way a speculative task will write to a different output file from the original task. If both tasks succeed, duplicate output will be left behind. Currently, the ORC data source adds System.currentTimeMillis to the output file name for uniqueness. This doesn't work because of exactly the same reason.

That's why this PR adds a job level random UUID in BaseWriterContainer (which is used by InsertIntoHadoopFsRelation to issue write jobs). The drawback is that record order is not preserved any more (output files of a later job may be listed before those of a earlier job). However, we never promise to preserve record order when writing data, and Hive doesn't promise this either because the _copy_N trick breaks the order.

liancheng · 2015-06-17T22:47:32Z

Background and alternative solutions for this issue can be a little bit complex. Will give a summary of offline discussion with @yhuai here later.

SparkQA · 2015-06-17T23:44:02Z

Test build #35066 has finished for PR 6864 at commit e5e92f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-06-18T00:34:49Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala

+  // more cores, the issue can be reproduced steadily.  Fortunately our Jenkins builder meets this
+  // requirement.  We probably want to move this test case to spark-integration-tests or spark-perf
+  // later.
+  test("SPARK-8406") {


Can you add a description in addition to the JIRA?

liancheng · 2015-06-18T01:51:22Z

retest this please

liancheng · 2015-06-18T01:55:01Z

The last build failure looks pretty weird: a large part of Jenkins build log output are replaced by tens of thousands of lines of integer triples, and none of the 5 test failure can be reproduced locally.

liancheng · 2015-06-18T02:06:55Z

OK, found out that those integers are printed by SQLQuerySuite.test script transform for stderr. See Josh's comment.

SparkQA · 2015-06-18T02:52:13Z

Test build #35077 has finished for PR 6864 at commit e5e92f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-06-18T03:26:42Z

(Removed the original comment as it was just the background knowledge and discussion summary which had already been moved into the PR description.)

chenghao-intel · 2015-06-18T05:32:33Z

Thank you @liancheng for the summary, which is clear for me who didn't dive into this part before. One thing that I am think about when I review the code #6833, how to remove the redundant files when user switch on the speculative in writing data via data source interface?

liancheng · 2015-06-19T09:49:19Z

@chenghao-intel Thanks for the comment! Speculation is a great point that I didn't notice. Updated this PR and now use a job level UUID instead of a task level one. Because essentially, what we want is to avoid name collision between different write jobs (potentially issued by different Spark applications). Within a single write job, we can always avoid name collision with the help of task ID.

SparkQA · 2015-06-19T09:50:05Z

Test build #35258 has finished for PR 6864 at commit 6d946bd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-06-19T10:26:30Z

@yhuai Updated PR description with an updated version of the summary commented above. This is ready for review.

SparkQA · 2015-06-19T11:09:03Z

Test build #35259 has finished for PR 6864 at commit 14a47b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-06-19T16:26:54Z

sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala

@@ -70,7 +71,7 @@ private[sql] case class InsertIntoHadoopFsRelation(
      relation.paths.length == 1,
      s"Cannot write to multiple destinations: ${relation.paths.mkString(",")}")

-    val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+    val hadoopConf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)


Do we need this? We already do val job = new Job(hadoopConf) below. BTW, we need to add comment to explain new Job will clone the conf.

SparkQA · 2015-06-19T17:26:34Z

Test build #936 has finished for PR 6864 at commit 14a47b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-06-19T19:50:47Z

retest this please

liancheng · 2015-06-19T19:51:14Z

Retesting for gaining more test failure logs to diagnose.

SparkQA · 2015-06-19T20:52:00Z

Test build #35313 has finished for PR 6864 at commit 14a47b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-20T02:21:07Z

Test build #35344 has finished for PR 6864 at commit d5698b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

lianhuiwang · 2015-06-20T03:16:29Z

thanks @liancheng. @chenghao-intel there is other situation that needs to be considered when using data source interface.when some tasks are finished but job is failed because some tasks are failed, it needs to remove all output files of this job.

liancheng · 2015-06-20T08:11:18Z

@lianhuiwang Yeah, thanks for reminding. We are also working on this issue. It will be addressed in another PR. At first, appending jobs with output committers like DirectParquetOutputCommitter can be tricky to handle since they writes directly to the target directory without using any temporary folder (this can be super useful for S3 since S3 file metadata operations and directory operations can be very slow). But with this PR, the job level UUID can be used to distinguish files written by different jobs.

liancheng · 2015-06-20T08:21:39Z

With the help from @yhuai, finally found the root cause of the OrcSourceSuite failures showed in previous Jenkins builds. SPARK-8501 is opened to track that issue.

The reason why it shows in this PR and couldn't be reproduced locally on my laptop is that I changed the thread count number of the local SparkContext used by TestHiveContext to *, which uses 32 cores on Jenkins and 8 cores on my laptop. On the other hand, the testing data used in OrcSourceSuite consists of 10 rows, which means the ORC table written on my laptop consists of 8 part-files and each one contains some rows, while the one written on Jenkins consists of 32 part-files and some of them contains zero rows. It turned out that those empty ORC files messed things up. Please refer to SPARK-8501 for details.

For this reason, I made two more updates:

Change local[*] to local[32] for more determinism. 32 is chosen because Jenkins has 32 cores, and it should be enough for detecting concurrency issues.
Increased row number of the testing data used in OrcSourceSuite to 100 to temporarily workaround the build failure. SPARK-8501 will be fixed in another PR.

SparkQA · 2015-06-20T09:23:06Z

Test build #35361 has finished for PR 6864 at commit d412de7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-06-21T22:00:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala

@@ -49,7 +49,7 @@ import scala.collection.JavaConversions._
 object TestHive
  extends TestHiveContext(
    new SparkContext(
-      System.getProperty("spark.sql.test.master", "local[2]"),
+      System.getProperty("spark.sql.test.master", "local[32]"),


Maybe we should still use local[*]?

I think we'd better use a fixed number here to improve determinism (if we use 32 from the beginning, the ORC bug would be much easier to reproduce).

SparkQA · 2015-06-22T00:46:43Z

Test build #35417 has finished for PR 6864 at commit 3207323.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-06-22T01:44:11Z

It's proven that increasing thread number of the local SparkContext used by TestHive (and running the tests on a node with relatively more cores, say our Jenkins builder) is pretty useful for detecting concurrency related bugs. SPARK-8501 and SPARK-8513 are both detected by this means.

…tion

SparkQA · 2015-06-22T04:39:57Z

Test build #35423 timed out for PR 6864 at commit 99a73ab after a configured wait of 175m.

yhuai · 2015-06-22T04:41:57Z

test this please

yhuai · 2015-06-22T06:14:24Z

sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala

-        super.abortTask()
-        throw new RuntimeException("Failed to commit task", cause)
+    } catch { case cause: Throwable =>
+      throw new RuntimeException("Failed to commit task", cause)


This exception will be cached in writeRows, right? If so, can we add a comment and also explain how we will handle this exception?

Actually, I think we need to also add doc to InsertIntoHadoopFsRelation to explain the flow of this command and how we handle different kinds of failures/errors.

Right, it's handled in writeRows. Agree with more comments, I made multiple mistakes here myself...

yhuai · 2015-06-22T06:18:23Z

LGTM. Left two comments regarding adding comments/docs.

SparkQA · 2015-06-22T06:39:01Z

Test build #35429 has finished for PR 6864 at commit 99a73ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-06-22T08:03:54Z

#6932 was opened to backport this PR to branch-1.4.

nemccarthy · 2015-06-22T09:12:58Z

Can chance this can be merged today?

SparkQA · 2015-06-22T09:44:43Z

Test build #35437 has finished for PR 6864 at commit db7a46a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-06-22T16:15:05Z

@nemccarthy Yeah. This one should be in today. I am taking a final check now.

yhuai · 2015-06-22T17:03:08Z

LGTM. I am merging it to master.

Author: Cheng Lian <lian@databricks.com> Closes #6932 from liancheng/spark-8406-for-1.4 and squashes the following commits: a0168fe [Cheng Lian] Backports SPARK-8406 and PR #6864 to branch-1.4

Author: Cheng Lian <lian@databricks.com> Closes apache#6932 from liancheng/spark-8406-for-1.4 and squashes the following commits: a0168fe [Cheng Lian] Backports SPARK-8406 and PR apache#6864 to branch-1.4

…l overwriting This PR fixes a Parquet output file name collision bug which may cause data loss. Changes made: 1. Identify each write job issued by `InsertIntoHadoopFsRelation` with a UUID All concrete data sources which extend `HadoopFsRelation` (Parquet and ORC for now) must use this UUID to generate task output file path to avoid name collision. 2. Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue. Also, higher concurrency may potentially caught more concurrency bugs during testing phase. (It did help us spotted SPARK-8501.) 3. `OrcSourceSuite` was updated to workaround SPARK-8501, which we detected along the way. NOTE: This PR is made a little bit more complicated than expected because we hit two other bugs on the way and have to work them around. See [SPARK-8501] [1] and [SPARK-8513] [2]. [1]: https://github.com/liancheng/spark/tree/spark-8501 [2]: https://github.com/liancheng/spark/tree/spark-8513 ---- Some background and a summary of offline discussion with yhuai about this issue for better understanding: In 1.4.0, we added `HadoopFsRelation` to abstract partition support of all data sources that are based on Hadoop `FileSystem` interface. Specifically, this makes partition discovery, partition pruning, and writing dynamic partitions for data sources much easier. To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (i.e., `<id>` in output file name `part-r-<id>.gz.parquet`) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Unfortunately, for tasks scheduled later, they may see wrong max part number generated of files newly written by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive part numbers in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, then one of them gets overwritten by the other. Before `HadoopFsRelation`, Spark SQL already supports appending data to Hive tables. From a user's perspective, these two look similar. However, they differ a lot internally. When data are inserted into Hive tables via Spark SQL, `InsertIntoHiveTable` simulates Hive's behaviors: 1. Write data to a temporary location 2. Move data in the temporary location to the final destination location using - `Hive.loadTable()` for non-partitioned table - `Hive.loadPartition()` for static partitions - `Hive.loadDynamicPartitions()` for dynamic partitions The important part is that, `Hive.copyFiles()` is invoked in step 2 to move the data to the destination directory (I found the name is kinda confusing since no "copying" occurs here, we are just moving and renaming stuff). If a file in the source directory and another file in the destination directory happen to have the same name, say `part-r-00001.parquet`, the former is moved to the destination directory and renamed with a `_copy_N` postfix (`part-r-00001_copy_1.parquet`). That's how Hive handles appending and avoids name collision between different write jobs. Some alternatives fixes considered for this issue: 1. Use a similar approach as Hive This approach is not preferred in Spark 1.4.0 mainly because file metadata operations in S3 tend to be slow, especially for tables with lots of file and/or partitions. That's why `InsertIntoHadoopFsRelation` just inserts to destination directory directly, and is often used together with `DirectParquetOutputCommitter` to reduce latency when working with S3. This means, we don't have the chance to do renaming, and must avoid name collision from the very beginning. 2. Same as 1.3, just move max part number detection back to driver side This isn't doable because unlike 1.3, 1.4 also takes dynamic partitioning into account. When inserting into dynamic partitions, we don't know which partition directories will be touched on driver side before issuing the write job. Checking all partition directories is simply too expensive for tables with thousands of partitions. 3. Add extra component to output file names to avoid name collision This seems to be the only reasonable solution for now. To be more specific, we need a JOB level unique identifier to identify all write jobs issued by `InsertIntoHadoopFile`. Notice that TASK level unique identifiers can NOT be used. Because in this way a speculative task will write to a different output file from the original task. If both tasks succeed, duplicate output will be left behind. Currently, the ORC data source adds `System.currentTimeMillis` to the output file name for uniqueness. This doesn't work because of exactly the same reason. That's why this PR adds a job level random UUID in `BaseWriterContainer` (which is used by `InsertIntoHadoopFsRelation` to issue write jobs). The drawback is that record order is not preserved any more (output files of a later job may be listed before those of a earlier job). However, we never promise to preserve record order when writing data, and Hive doesn't promise this either because the `_copy_N` trick breaks the order. Author: Cheng Lian <lian@databricks.com> Closes apache#6864 from liancheng/spark-8406 and squashes the following commits: db7a46a [Cheng Lian] More comments f5c1133 [Cheng Lian] Addresses comments 85c478e [Cheng Lian] Workarounds SPARK-8513 088c76c [Cheng Lian] Adds comment about SPARK-8501 99a5e7e [Cheng Lian] Uses job level UUID in SimpleTextRelation and avoids double task abortion 4088226 [Cheng Lian] Works around SPARK-8501 1d7d206 [Cheng Lian] Adds more logs 8966bbb [Cheng Lian] Fixes Scala style issue 18b7003 [Cheng Lian] Uses job level UUID to take speculative tasks into account 3806190 [Cheng Lian] Lets TestHive use all cores by default 748dbd7 [Cheng Lian] Adding UUID to output file name to avoid accidental overwriting

liancheng force-pushed the spark-8406 branch from 0eb6bac to e5e92f3 Compare June 17, 2015 22:46

marmbrus reviewed Jun 18, 2015
View reviewed changes

andrewor14 mentioned this pull request Jun 18, 2015

[SPARK-8379][SQL]avoid speculative tasks write to the same file #6833

Closed

liancheng force-pushed the spark-8406 branch from e5e92f3 to 6d946bd Compare June 19, 2015 09:44

yhuai reviewed Jun 19, 2015
View reviewed changes

yhuai reviewed Jun 21, 2015
View reviewed changes

liancheng added 9 commits June 21, 2015 18:47

Adding UUID to output file name to avoid accidental overwriting

748dbd7

Lets TestHive use all cores by default

3806190

Uses job level UUID to take speculative tasks into account

18b7003

Fixes Scala style issue

8966bbb

Adds more logs

1d7d206

Works around SPARK-8501

4088226

Uses job level UUID in SimpleTextRelation and avoids double task abor…

99a5e7e

…tion

Adds comment about SPARK-8501

088c76c

Workarounds SPARK-8513

85c478e

yhuai reviewed Jun 22, 2015
View reviewed changes

Addresses comments

f5c1133

More comments

db7a46a

liancheng force-pushed the spark-8406 branch from 99a73ab to db7a46a Compare June 22, 2015 07:45

liancheng added a commit to liancheng/spark that referenced this pull request Jun 22, 2015

Backports SPARK-8406 and PR apache#6864 to branch-1.4

a0168fe

asfgit closed this in 0818fde Jun 22, 2015

liancheng deleted the spark-8406 branch June 22, 2015 18:28

[SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting #6864

[SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting #6864

Conversation

liancheng commented Jun 17, 2015

liancheng commented Jun 17, 2015

SparkQA commented Jun 17, 2015

marmbrus Jun 18, 2015

Choose a reason for hiding this comment

liancheng commented Jun 18, 2015

liancheng commented Jun 18, 2015

liancheng commented Jun 18, 2015

SparkQA commented Jun 18, 2015

liancheng commented Jun 18, 2015

chenghao-intel commented Jun 18, 2015

liancheng commented Jun 19, 2015

SparkQA commented Jun 19, 2015

liancheng commented Jun 19, 2015

SparkQA commented Jun 19, 2015

yhuai Jun 19, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 19, 2015

liancheng commented Jun 19, 2015

liancheng commented Jun 19, 2015

SparkQA commented Jun 19, 2015

SparkQA commented Jun 20, 2015

lianhuiwang commented Jun 20, 2015

liancheng commented Jun 20, 2015

liancheng commented Jun 20, 2015

SparkQA commented Jun 20, 2015

yhuai Jun 21, 2015

Choose a reason for hiding this comment

liancheng Jun 21, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 22, 2015

liancheng commented Jun 22, 2015

SparkQA commented Jun 22, 2015

yhuai commented Jun 22, 2015

yhuai Jun 22, 2015

Choose a reason for hiding this comment

yhuai Jun 22, 2015

Choose a reason for hiding this comment

liancheng Jun 22, 2015

Choose a reason for hiding this comment

yhuai commented Jun 22, 2015

SparkQA commented Jun 22, 2015

liancheng commented Jun 22, 2015

nemccarthy commented Jun 22, 2015

SparkQA commented Jun 22, 2015

yhuai commented Jun 22, 2015

yhuai commented Jun 22, 2015