[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent lost files/directories when called concurrently#2999
Closed
StephanEwen wants to merge 2 commits into
Closed
Conversation
…revent lost files when called concurrently.
Contributor
Author
|
I think this should go into 1.2 - it is quite a bug for local testing. |
Contributor
|
Changes look good to me. Great fix you've implemented there @StephanEwen. This will hopefully make some of the failing travis test cases stable again :-) +1 for merging. |
Contributor
Author
|
Thanks, will merge this... |
Contributor
Author
|
Manually merged in 2f3ad58 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is mainly relevant to tests and Local Mini Cluster executions.
The
FileOutputFormatand its subclasses rely onFileSystem::initOutPathLocalFS()to prepare the output directory. When multiple parallel output writers call that method, there is a slim chance that one parallel threads deletes the others directory. The checks that the method has are not bullet proof.I believe that this is the cause for many Travis test instabilities that we observed over time.
Simply synchronizing that method per process should do the trick. Since it is a rare initialization method, and only relevant in tests & local mini cluster executions, it should be a price that is okay to pay. I see no other way, as we do not have simple access to an atomic "check and delete and recreate" file operation.
The synchronization also makes many "re-try" code paths obsolete (there should be no re-tries needed on proper file systems).
Tests
This is tricky to test. The test in
InitOutputPathTest.javauses a series of latch to re-produce the problematic thread execution interleaving to validate the problem. The properly fixed variant cannot use that interleaving (because it fixes the problem, duh), but pushes the thread interleaving best-effort towards the case where the problem would occur, were the method not properly synchronized. Sounds weird, I know.