Skip to content

[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent lost files/directories when called concurrently#2999

Closed
StephanEwen wants to merge 2 commits into
apache:masterfrom
StephanEwen:fs_fix
Closed

[FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent lost files/directories when called concurrently#2999
StephanEwen wants to merge 2 commits into
apache:masterfrom
StephanEwen:fs_fix

Conversation

@StephanEwen
Copy link
Copy Markdown
Contributor

This is mainly relevant to tests and Local Mini Cluster executions.

The FileOutputFormat and its subclasses rely on FileSystem::initOutPathLocalFS() to prepare the output directory. When multiple parallel output writers call that method, there is a slim chance that one parallel threads deletes the others directory. The checks that the method has are not bullet proof.

I believe that this is the cause for many Travis test instabilities that we observed over time.

Simply synchronizing that method per process should do the trick. Since it is a rare initialization method, and only relevant in tests & local mini cluster executions, it should be a price that is okay to pay. I see no other way, as we do not have simple access to an atomic "check and delete and recreate" file operation.
The synchronization also makes many "re-try" code paths obsolete (there should be no re-tries needed on proper file systems).

Tests

This is tricky to test. The test in InitOutputPathTest.java uses a series of latch to re-produce the problematic thread execution interleaving to validate the problem. The properly fixed variant cannot use that interleaving (because it fixes the problem, duh), but pushes the thread interleaving best-effort towards the case where the problem would occur, were the method not properly synchronized. Sounds weird, I know.

@StephanEwen
Copy link
Copy Markdown
Contributor Author

I think this should go into 1.2 - it is quite a bug for local testing.

@tillrohrmann
Copy link
Copy Markdown
Contributor

Changes look good to me. Great fix you've implemented there @StephanEwen. This will hopefully make some of the failing travis test cases stable again :-) +1 for merging.

@StephanEwen
Copy link
Copy Markdown
Contributor Author

Thanks, will merge this...

@StephanEwen
Copy link
Copy Markdown
Contributor Author

Manually merged in 2f3ad58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants