[SPARK-14078] Streaming Parquet Based FileSink #11897

marmbrus · 2016-03-22T19:04:00Z

This PR adds a new Sink implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures.

zsxwing · 2016-03-22T23:14:35Z

sql/core/src/main/scala/org/apache/spark/sql/ContinuousQueryException.scala

    val message: String,
    val cause: Throwable,
    val startOffset: Option[Offset] = None,
    val endOffset: Option[Offset] = None
-  ) extends Exception(message, cause) {
+  ) extends Exception(message, cause) with Serializable {


nit: Exception has already extended Serializable. Not need to add it again.

SparkQA · 2016-03-22T23:17:38Z

Test build #53807 has finished for PR 11897 at commit 8e8e4c6.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamFileCatalog(sqlContext: SQLContext, path: Path) extends FileCatalog with Logging
- trait FileFormat

zsxwing · 2016-03-22T23:20:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala

+    path: String,
+    fileFormat: FileFormat) extends Sink with Logging {
+
+  val basePath = new Path(path)


nit: private

zsxwing · 2016-03-22T23:33:29Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala

+        .startStream(outputDir)
+
+    inputData.addData(1, 2, 3)
+    failAfter(streamingTimeout) { query.processAllAvailable() }


There is a race condition here: noNewData may become true before processAllAvailable.

Never mind. I just realized it will continue to set noNewData to true

zsxwing · 2016-03-22T23:48:45Z

Looks pretty good. Just some nits.

SparkQA · 2016-03-23T01:31:50Z

Test build #53841 has finished for PR 11897 at commit 7da7c7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T02:04:38Z

Test build #53847 has finished for PR 11897 at commit e821f2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-03-23T02:38:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

  }

  override def getOffset: Option[Offset] = Some(fetchMaxOffset()).filterNot(_.offset == -1)
+
+  override def toString: String = s"FileSink[$path]"


This is the file source :D

Haha, yes. I'll fix this in a follow up.

…eaming Sink ## What changes were proposed in this pull request? File source V1 supports reading output of FileStreamSink as batch. apache#11897 We should support this in file source V2 as well. When reading with paths, we first check if there is metadata log of FileStreamSink. If yes, we use `MetadataLogFileIndex` for listing files; Otherwise, we use `InMemoryFileIndex`. ## How was this patch tested? Unit test Closes apache#24900 from gengliangwang/FileStreamV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

marmbrus added 2 commits March 22, 2016 10:25

WIP

e861a68

cleanup

8e8e4c6

zsxwing reviewed Mar 22, 2016
View reviewed changes

comments

7da7c7e

zsxwing reviewed Mar 22, 2016
View reviewed changes

more comments

e821f2f

tdas reviewed Mar 23, 2016
View reviewed changes

asfgit closed this in 6bc4be6 Mar 23, 2016

gengliangwang mentioned this pull request Jun 18, 2019

[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14078] Streaming Parquet Based FileSink #11897

[SPARK-14078] Streaming Parquet Based FileSink #11897

marmbrus commented Mar 22, 2016

zsxwing Mar 22, 2016

SparkQA commented Mar 22, 2016

zsxwing Mar 22, 2016

zsxwing Mar 22, 2016

zsxwing Mar 22, 2016

zsxwing commented Mar 22, 2016

SparkQA commented Mar 23, 2016

SparkQA commented Mar 23, 2016

tdas Mar 23, 2016

marmbrus Mar 23, 2016

[SPARK-14078] Streaming Parquet Based FileSink #11897

[SPARK-14078] Streaming Parquet Based FileSink #11897

Conversation

marmbrus commented Mar 22, 2016

zsxwing Mar 22, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2016

zsxwing Mar 22, 2016

Choose a reason for hiding this comment

zsxwing Mar 22, 2016

Choose a reason for hiding this comment

zsxwing Mar 22, 2016

Choose a reason for hiding this comment

zsxwing commented Mar 22, 2016

SparkQA commented Mar 23, 2016

SparkQA commented Mar 23, 2016

tdas Mar 23, 2016

Choose a reason for hiding this comment

marmbrus Mar 23, 2016

Choose a reason for hiding this comment