[SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #36564

AngersZhuuuu · 2022-05-16T08:57:42Z

What changes were proposed in this pull request?

There is a case for current code may cause data correctness issue:

One writing data task commit task success, but the container was killed after commit task, so this task failed, but the data file still remained under committedTaskPath .
DAGScheduler call handleCompletedTask and then call outputCommitCoordinator.taskCompleted, outputCommitCoordinator will remove the lock of failed task's partition.
Then failed task's rerun with an new attempt, the new attempt task call outputCommitCoordinator.canCommit() will return true since the lock of this partition had been removed, then it commit task success, also task final succeed.
Two files remained under this job's attempt path for same partition.
CommitJob commit both two committed task path's data.
Finally data duplicated.

In this pr, we do below since:

When commit task success, executor send an CommitOutputSuccess message to outputCommitCoordinator.
When outputCommitCoordinator handle taskComplete, if task failed but commit success, means data duplicate will happen, we should failed to job.

Why are the changes needed?

Fix data duplicated issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

…ordinator

…nto SPARK-39195

AngersZhuuuu · 2022-05-24T10:10:37Z

gentle ping @dongjoon-hyun @cloud-fan @HyukjinKwon @srowen Could you take a view of this data correctness issue?

This reverts commit 7642cdb.

This reverts commit 759814b.

This reverts commit 5f5729b.

This reverts commit 796c08c.

This reverts commit 5e6c0be.

…ordinator

AngersZhuuuu · 2022-05-27T11:12:58Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+      case Some(state) if attemptFailed(state, stageAttempt, partition, attemptNumber) =>
+        throw new SparkException(s"Authorized committer (attemptNumber=$attemptNumber, " +
+          s"stage=$stage, partition=$partition) failed; but task commit success, " +
+          s"should fail the job")


Seems throw exception here won't stop the job, any suggestion.

AngersZhuuuu · 2022-05-30T23:45:21Z

ping @cloud-fan Could you take a look?

AngersZhuuuu · 2022-07-06T06:57:23Z

Yea

mstebelev · 2023-10-22T22:15:56Z

Hi @AngersZhuuuu. I came across problems with this changes after updating to spark 3.4
I write data to iceberg table with S3 backend and the data upload happens in dataWriter.commit() after coordinator.canCommit was called. So if uploading to S3 fails for some reason, the task fails and the partition data can't be uploaded anymore event in the task's retries, because the failed task remains to be the assigned commiter.
Looks like usually data writing is idempotent, because each partition is written into a separate file and you always can do it again in retrying task without data duplication.

cloud-fan · 2023-10-23T02:50:52Z

I'm surprised that iceberg does not overwrite https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/BatchWrite.java#L63

cc @huaxingao @dongjoon-hyun

mstebelev · 2023-10-23T07:28:46Z

🤔 so the option useCommitCoordinator is for appending to the same destination, like inserting rows to a remote database and we don't need it in iceberg?

cloud-fan · 2023-10-23T08:31:41Z

the new data lake formats all use transaction logs, I don't think coordinator is needed anymore.

huaxingao · 2023-10-24T07:34:31Z

@cloud-fan Thanks for pinging me. It appears that Iceberg doesn't override this useCommitCoordinator. I will take a look at this issue.

aokolnychyi · 2023-10-31T00:29:57Z

@huaxingao @cloud-fan, could you confirm only a single WriterCommitMessage will be passed in case of speculative execution even without the commit coordinator? Based on what I see in V2TableWriteExec, if multiple task attempts succeed, the last one will win by replacing the commit message from the first one. Am I right?

cloud-fan · 2023-10-31T00:51:02Z

@aokolnychyi In most cases, yes. However, we have BatchWrite#onDataWriterCommit, so implementations can decide how to deal with conflicting commit messages, maybe first-win.

aokolnychyi · 2023-10-31T05:22:18Z

Thanks for confirming, @cloud-fan!

viirya · 2024-05-13T15:23:18Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+          sc.foreach(_.dagScheduler.stageFailed(stage, s"Authorized committer " +
+            s"(attemptNumber=$attemptNumber, stage=$stage, partition=$partition) failed; " +
+            s"but task commit success, data duplication may happen."))
        }


@cloud-fan I think this is not very clear or correct in the reason string. stageState.authorizedCommitters records a commit is allowed but it is not actually successful. So as you said the driver never knows if the task commit is successful or not. Maybe we should update this to reduce confusion.

+1. @AngersZhuuuu can you refine it?

With #38980 seems we didn't need this patch anymore

oh, shall we create a PR to revert it then?

oh, shall we create a PR to revert it then?

Double checked with @boneanxs , we should revert this, let me do this.

…of ParquetIOSuite ### What changes were proposed in this pull request? A test from `ParquetIOSuite` is flaky: `SPARK-7837 Do not close output writer twice when commitTask() fails` It turns out to be a race condition. The test injects error to the task committing step, and the job may fail in two ways: 1. The task got the driver's permission to commit the task, but the committing failed and thus the task failed. This will trigger a stage failure as it means possible data duplication, see #36564 2. In test we disable task retry, so `TaskSetManager` will abort the stage. Both these two failures are done by sending an event to `DAGScheduler`, so the final job failure depends on which event gets processed first. This is not a big deal, but that test in `ParquetIOSuite` checks the error class. This PR fixes the flaky test by running the test case in a new test suite with output committer coordination disabled ### Why are the changes needed? fix flaky test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? GA test + manual test on lcoal ### Was this patch authored or co-authored using generative AI tooling? No Closes #46562 from gengliangwang/fixParquetIO. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

dongjoon-hyun · 2024-05-15T17:51:22Z

To @AngersZhuuuu , @viirya , @cloud-fan .
Since this PR is too old, shall we file a JIRA issue for that comment, #36564 (comment) ?

viirya · 2024-05-15T18:09:22Z

To @AngersZhuuuu , @viirya , @cloud-fan . Since this PR is too old, shall we file a JIRA issue for that comment, #36564 (comment) ?

Yea, I think so.

viirya · 2024-05-15T18:16:07Z

I created https://issues.apache.org/jira/browse/SPARK-48292

…inator should abort stage when committed file not consistent with task status ### What changes were proposed in this pull request? Revert #36564 According to discuss #36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in #36564 's case, since before #38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. ### Why are the changes needed? No need anymore ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46696 from AngersZhuuuu/SPARK-48292. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…inator should abort stage when committed file not consistent with task status ### What changes were proposed in this pull request? Revert apache#36564 According to discuss apache#36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in apache#36564 's case, since before apache#38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After apache#38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. ### Why are the changes needed? No need anymore ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46696 from AngersZhuuuu/SPARK-48292. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…inator should abort stage when committed file not consistent with task status Revert apache#36564 According to discuss apache#36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in apache#36564 's case, since before apache#38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After apache#38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. No need anymore No Existed UT No Closes apache#46696 from AngersZhuuuu/SPARK-48292. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…Coordinator should abort stage when committed file not consistent with task status This is a backport of #46696 ### What changes were proposed in this pull request? Revert #36564 According to discuss #36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in #36564 's case, since before #38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. ### Why are the changes needed? No need anymore ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47166 from dongjoon-hyun/SPARK-48292. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…Coordinator should abort stage when committed file not consistent with task status This is a backport of #46696 ### What changes were proposed in this pull request? Revert #36564 According to discuss #36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in #36564 's case, since before #38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. ### Why are the changes needed? No need anymore ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47168 from dongjoon-hyun/SPARK-48292-3.4. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

steveloughran · 2024-07-12T16:55:12Z

Task commits MUST be atomic and a second attempt MUST be able to supercede the first.

Hadoop FileOutputCommitter v1 (on hdfs, localfs, abfs, not gcs) uses atomic rename for this, deleting task completed data path first. A retry will delete that dest path and rename their own work directory to it.

S3A committers PUT a manifest to the task path; relies on PUT being atomic.
Mapreduce Manifest committer does write of manifest and then rename; this is atomic on GCS so safe there (dir rename is non atomic)

The application MUST NOT constrain which of two attempts told to commit succeeds, only that the second one MUST report success.

Why so? Because if task attempt TA1 stops responding after being told to commit, then TA2 will be told to commit, and reports success. But if TA1 is somehow suspended and performs its atomic commit after TA2, then it will be the TA1 manifest which is processed.

If you are encountering problems with jobs where task failures are unrecoverable, that means there is something wrong with the task commit algorithm.

What were you seeing it with?

dongjoon-hyun · 2024-07-12T17:48:13Z

To @steveloughran , for the better context, you had better see and post on #46696 instead of here.

BTW, for me, I can say that I don't see any issue with both SPARK-39195 and its revert (SPARK-48292) with S3 environment.

akki · 2024-08-02T09:44:18Z

Hi all

I am facing this issue after upgrading to Spark3.5.1 and wonder if this change is a root cause for it. Can anyone here confirm?

Many of our jobs are failing with "Authorized committer" errors and we might have to revert our whole system back to Spark3.3, which would be a lot of work. So I am trying to understand the root issue a bit more before I have to go back reverting my whole Spark+Hadoop system.

Thanks!

…Coordinator should abort stage when committed file not consistent with task status This is a backport of apache#46696 ### What changes were proposed in this pull request? Revert apache#36564 According to discuss apache#36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in apache#36564 's case, since before apache#38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After apache#38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. ### Why are the changes needed? No need anymore ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47168 from dongjoon-hyun/SPARK-48292-3.4. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-39195][SQL] Spark should use two step update of outputCommitCo…

19135e0

…ordinator

github-actions bot added the CORE label May 16, 2022

AngersZhuuuu added 5 commits May 17, 2022 11:18

Update

7062d32

Update OutputCommitCoordinatorSuite.scala

f4caa22

Spark OutputCommitCoordinator should keep consistent

5e6c0be

Merge branch 'SPARK-39195' of https://github.com/AngersZhuuuu/spark i…

171bd6a

…nto SPARK-39195

revert

b070c9d

AngersZhuuuu changed the title ~~[WIP][SPARK-39195][SQL] Spark should use two step update of outputCommitCoordinator~~ [SPARK-39195][SQL] Spark OutputCommitCoordinator should help keep file consistent with task status. May 24, 2022

AngersZhuuuu added 18 commits May 24, 2022 18:10

Update OutputCommitCoordinator.scala

796c08c

Merge remote-tracking branch 'upstream/master' into SPARK-39195

1d79aae

Update SparkHadoopMapRedUtil.scala

5f5729b

Update

759814b

trigger

b4e11cc

Update SparkHadoopMapRedUtil.scala

7642cdb

Revert "Update SparkHadoopMapRedUtil.scala"

e6dce26

This reverts commit 7642cdb.

Revert "Update"

6a403d7

This reverts commit 759814b.

Revert "Update SparkHadoopMapRedUtil.scala"

bc1214c

This reverts commit 5f5729b.

Revert "Update OutputCommitCoordinator.scala"

b5d2885

This reverts commit 796c08c.

Revert "Spark OutputCommitCoordinator should keep consistent"

b13dfbe

This reverts commit 5e6c0be.

[SPARK-39195][SQL] Spark should use two step update of outputCommitCo…

ad67d0d

…ordinator

Update

0e366a8

Update

11ba4b7

Update OutputCommitCoordinator.scala

cc71ddc

Update OutputCommitCoordinator.scala

e7204df

Update

9426f30

Update OutputCommitCoordinator.scala

60e03f3

AngersZhuuuu commented May 27, 2022

View reviewed changes

follow comment

f77c9c3

mstebelev mentioned this pull request Oct 23, 2023

operations fail after upgrading to spark 3.4 apache/iceberg#8904

Closed

cloud-fan mentioned this pull request May 13, 2024

[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46560

Closed

viirya reviewed May 13, 2024

View reviewed changes

gengliangwang mentioned this pull request May 13, 2024

[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46562

Closed

AngersZhuuuu mentioned this pull request May 22, 2024

[SPARK-48292][CORE] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #46696

Closed

dongjoon-hyun mentioned this pull request Jul 1, 2024

[SPARK-48292][CORE][3.5] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #47166

Closed

dongjoon-hyun mentioned this pull request Jul 1, 2024

[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #47168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #36564

[SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #36564

AngersZhuuuu commented May 16, 2022 •

edited

Loading

AngersZhuuuu commented May 24, 2022

AngersZhuuuu May 27, 2022

AngersZhuuuu commented May 30, 2022

AngersZhuuuu commented Jul 6, 2022

mstebelev commented Oct 22, 2023

cloud-fan commented Oct 23, 2023

mstebelev commented Oct 23, 2023

cloud-fan commented Oct 23, 2023

huaxingao commented Oct 24, 2023

aokolnychyi commented Oct 31, 2023

cloud-fan commented Oct 31, 2023

aokolnychyi commented Oct 31, 2023

viirya May 13, 2024

cloud-fan May 14, 2024

AngersZhuuuu May 21, 2024

cloud-fan May 21, 2024

AngersZhuuuu May 22, 2024

dongjoon-hyun commented May 15, 2024

viirya commented May 15, 2024

viirya commented May 15, 2024

steveloughran commented Jul 12, 2024

dongjoon-hyun commented Jul 12, 2024

akki commented Aug 2, 2024

[SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #36564

[SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #36564

Conversation

AngersZhuuuu commented May 16, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented May 24, 2022

AngersZhuuuu May 27, 2022

Choose a reason for hiding this comment

AngersZhuuuu commented May 30, 2022

AngersZhuuuu commented Jul 6, 2022

mstebelev commented Oct 22, 2023

cloud-fan commented Oct 23, 2023

mstebelev commented Oct 23, 2023

cloud-fan commented Oct 23, 2023

huaxingao commented Oct 24, 2023

aokolnychyi commented Oct 31, 2023

cloud-fan commented Oct 31, 2023

aokolnychyi commented Oct 31, 2023

viirya May 13, 2024

Choose a reason for hiding this comment

cloud-fan May 14, 2024

Choose a reason for hiding this comment

AngersZhuuuu May 21, 2024

Choose a reason for hiding this comment

cloud-fan May 21, 2024

Choose a reason for hiding this comment

AngersZhuuuu May 22, 2024

Choose a reason for hiding this comment

dongjoon-hyun commented May 15, 2024

viirya commented May 15, 2024

viirya commented May 15, 2024

steveloughran commented Jul 12, 2024

dongjoon-hyun commented Jul 12, 2024

akki commented Aug 2, 2024

AngersZhuuuu commented May 16, 2022 •

edited

Loading