New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27194][core] Job failures when task attempts do not clean up spark-staging parquet files #24142
Conversation
Can one of the admins verify this patch? |
Can you describe the problem first? AFAIK Spark will write files to a temp staging directory. If a task failed without cleaning up its the files, the partially written file will still be there and moved to the final directory at the end. |
@cloud-fan In |
Is this a problem or you fixed it as well? |
The fix is not changing the behavior of writing to temp location and then moving to final location. the fix deals only with file name ( even in temp/final location, the file names must not conflict) |
I am just worried that this fix causes duplicates.
|
Without looking (again) into this code I have the same question about this potentially causing duplicates in the case an executor fails.
No, because the executor may die before calling that, and the code needs to be resilient to that. (Also, as far as I can tell, that's already being done in the call to |
You seem to be saying that with your fix, it's possible that the partially written file will end up in the final output. If that's true, then that's worse than the current status: right now, your app will fail. With that partial fix, the app will generate wrong data. |
I tried with spark 2.3.3
Steps:
and here is the full stack
|
So as we can see from stacktrace, when we have
we directly try to write to staging location via So for the retry task, |
ping @vanzin @dongjoon-hyun @cloud-fan @rezasafi |
I don't think deleting the old file is a good idea. There might be edge cases where the failed task may still have the file opened and deleting from another place may not work as expected. (The name conflict also means that speculation is probably completely broken when this option is used. And speculation working would mean you have two files with the output of the same task, and only one of them should end up in the final output.) |
but wouldn't in this case as executor has exited, the lease will expire on the remote file letting retry task to delete successfully.? ( behaviour may differ if filesystem is not hdfs.?) |
Do you want to rely on specific behaviors of file systems and executors, or write code that doesn't depend on how they behave? And regardless of that, just deleting the file would still cause speculation to generate bad data. So it's not a fix. |
@vanzin thanks for clarification, i was just trying to make sure that its is the 'bad' idea to delete Okay i see the point. So Currently i see in org.apache.spark.internal.io.HadoopMapReduceCommitProtocol#commitJob when
This is where we will get the old and the new task file causing duplicates. Would it be better if we could eliminate speculated old tasks files from this operation.? |
You seem to be trying to create a new commit protocol that a single task can perform without regard for what other tasks are doing. That is just not going to work. |
No. I am trying to point the difference when dynamicPartitionOverwrite is false vs true. When False, file list is iterated to be moved. But when true entire folder is just renamed. Task commits are still independent, but job commit should consider task status. In case of retry tasks as output must not be overlapping we could have had separate file for retry and handled eliminating duplicate task outputs on job commit. I am just saying that as per current code, moving entire staging-dir partition output to final location regardless of its content (if any tasks were retired) is not good approach, it just works as retry tasks also have same file for output (hence avoids duplicate). Please suggest if you see a better approach to tackle the problem here |
If I had an approach for fixing this, I'd have opened a PR. I'm just trying to point out that your approach has issues. Any suggested approach needs to create the correct output regardless of task timing, task failures, or speculation. And none of the approaches you suggested so far fit that, as far as I can tell. |
Okay, will check for a better approach. Thanks for the inputs |
@ajithme I was just wondering whether you are still working on this? Thank you very much. |
I'll close this for now. It can be reopened by just updating the branch. |
@vanzin we face this problem too. After a simple investigation, I think it maybe a parquet bug. It hardcodes the |
Hi, @vanzin, @ajithme and @cloud-fan, I am also interested in this problem.
|
What changes were proposed in this pull request?
Avoid task output file overlap when task are reattempted. As currently file name considers only taskId which will be same across reattempts, it will cause collision. Instead file name can instead contain taskId along with attempt id
How was this patch tested?
Will update UT if approach is ok