Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system #22331

Closed
wants to merge 3 commits into from

Conversation

misutoth
Copy link
Contributor

@misutoth misutoth commented Sep 4, 2018

What changes were proposed in this pull request?

Reproduce File Sink duplication in driver failure scenario to help understanding the situation.
Propose a new StagingFileCommitProtocol that creates the target files in a staging subdirectory and upon job commit it moves all the files to the target directory. The target file names of a task will be the same in 2 different runs. This way a potential source for duplication (when same content is placed into 2 files with different names) is eliminated.

How was this patch tested?

Created specific unit test for the new protocol: StagingFileCommitProtocolSuite
Ran the test that reproduced the problem: FileStreamSinkUnitSuite
Made FileStreamStressSuite more strict to demand exactly once delivery.
Tested on a 4 machine cluster sending 30000 messages 20 times killing the driver. Each message was delivered exactly once.
Ran tests for sql with sbt.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@misutoth misutoth changed the title Tests for idempotency of FileStreamSink - Work in Progress [SPARK-25331][SS][WIP] Tests for exactly once guarantee of FileStreamSink Sep 4, 2018
@misutoth misutoth changed the title [SPARK-25331][SS][WIP] Tests for exactly once guarantee of FileStreamSink [SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system Sep 19, 2018
@misutoth
Copy link
Contributor Author

@rxin could you please look into this change?

@misutoth
Copy link
Contributor Author

@lw-lin , @marmbrus , in the meantime I found that you have been discussing about having deterministic file names in a PR. Could you please tell those cases?

I was just thinking if it is a reasonable expectation from a sink's point of view to receive the same data partitioned the same way if it is actually the same batch?

@gaborgsomogyi, you may also be interested in this change.

@gaborgsomogyi
Copy link
Contributor

I've taken a look at the things and I think the issue solved in the mentioned PR but not yet documented. If somebody would like to use the output directory of a spark application which uses a file sink (with exactly-once), then it must read the metadata first to get the list of valid files.

Considering these this PR can be closed.

@misutoth
Copy link
Contributor Author

misutoth commented Dec 7, 2018

So I am considering this as the recommended way to read a file sink's output. If there is a need to include the protocol in this PR as an alternative we can still reopen it.

@misutoth misutoth closed this Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants