[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #47168

dongjoon-hyun · 2024-07-01T18:57:25Z

This is a backport of #46696

What changes were proposed in this pull request?

Revert #36564 According to discuss #36564 (comment)

When spark commit task will commit to committedTaskPath
${outputpath}/_temporary//${appAttempId}/${taskId}
So in #36564 's case, since before #38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated.

After #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated.

Note: The taskAttemptPath is not same since in the path contains the taskAttemptId.

Why are the changes needed?

No need anymore

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existed UT

Was this patch authored or co-authored using generative AI tooling?

No

…inator should abort stage when committed file not consistent with task status Revert apache#36564 According to discuss apache#36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in apache#36564 's case, since before apache#38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After apache#38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. No need anymore No Existed UT No Closes apache#46696 from AngersZhuuuu/SPARK-48292. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2024-07-01T19:07:24Z

cc @AngersZhuuuu , @viirya , @cloud-fan , @huaxingao

huaxingao

LGTM. Thanks for the PR @dongjoon-hyun

dongjoon-hyun · 2024-07-01T21:19:26Z

Thank you, @huaxingao and @viirya .
Merged to branch-3.4 for Apache Spark 3.4.4.

…Coordinator should abort stage when committed file not consistent with task status This is a backport of #46696 ### What changes were proposed in this pull request? Revert #36564 According to discuss #36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in #36564 's case, since before #38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. ### Why are the changes needed? No need anymore ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47168 from dongjoon-hyun/SPARK-48292-3.4. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…Coordinator should abort stage when committed file not consistent with task status This is a backport of apache#46696 ### What changes were proposed in this pull request? Revert apache#36564 According to discuss apache#36564 (comment) When spark commit task will commit to committedTaskPath `${outputpath}/_temporary//${appAttempId}/${taskId}` So in apache#36564 's case, since before apache#38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated. After apache#38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated. Note: The taskAttemptPath is not same since in the path contains the taskAttemptId. ### Why are the changes needed? No need anymore ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47168 from dongjoon-hyun/SPARK-48292-3.4. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added the CORE label Jul 1, 2024

dongjoon-hyun mentioned this pull request Jul 1, 2024

[SPARK-48292][CORE] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #46696

Closed

huaxingao approved these changes Jul 1, 2024

View reviewed changes

viirya approved these changes Jul 1, 2024

View reviewed changes

dongjoon-hyun closed this Jul 1, 2024

dongjoon-hyun deleted the SPARK-48292-3.4 branch July 1, 2024 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #47168

[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #47168

dongjoon-hyun commented Jul 1, 2024

dongjoon-hyun commented Jul 1, 2024

huaxingao left a comment

dongjoon-hyun commented Jul 1, 2024

[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #47168

[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #47168

Conversation

dongjoon-hyun commented Jul 1, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun commented Jul 1, 2024

huaxingao left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 1, 2024