Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14078] Streaming Parquet Based FileSink #11897

Closed
wants to merge 4 commits into from

Conversation

marmbrus
Copy link
Contributor

This PR adds a new Sink implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures.

val message: String,
val cause: Throwable,
val startOffset: Option[Offset] = None,
val endOffset: Option[Offset] = None
) extends Exception(message, cause) {
) extends Exception(message, cause) with Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Exception has already extended Serializable. Not need to add it again.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53807 has finished for PR 11897 at commit 8e8e4c6.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamFileCatalog(sqlContext: SQLContext, path: Path) extends FileCatalog with Logging
    • trait FileFormat

path: String,
fileFormat: FileFormat) extends Sink with Logging {

val basePath = new Path(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: private

.startStream(outputDir)

inputData.addData(1, 2, 3)
failAfter(streamingTimeout) { query.processAllAvailable() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition here: noNewData may become true before processAllAvailable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I just realized it will continue to set noNewData to true

@zsxwing
Copy link
Member

zsxwing commented Mar 22, 2016

Looks pretty good. Just some nits.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53841 has finished for PR 11897 at commit 7da7c7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53847 has finished for PR 11897 at commit e821f2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

override def getOffset: Option[Offset] = Some(fetchMaxOffset()).filterNot(_.offset == -1)

override def toString: String = s"FileSink[$path]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the file source :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yes. I'll fix this in a follow up.

@asfgit asfgit closed this in 6bc4be6 Mar 23, 2016
kiku-jw pushed a commit to kiku-jw/spark that referenced this pull request Jun 26, 2019
…eaming Sink

## What changes were proposed in this pull request?

File source V1 supports reading output of FileStreamSink as batch. apache#11897
We should support this in file source V2 as well. When reading with paths, we first check if there is metadata log of FileStreamSink. If yes, we use `MetadataLogFileIndex` for listing files; Otherwise, we use `InMemoryFileIndex`.

## How was this patch tested?

Unit test

Closes apache#24900 from gengliangwang/FileStreamV2.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants