[SPARK-39622][SQL][TESTS] Add additional error message matching for `SPARK-7837 Do not close output writer twice when commitTask() fail` #37245

LuciferYang · 2022-07-21T12:42:07Z

What changes were proposed in this pull request?

This PR adds additional assertions to fix flaky test SPARK-7837 Do not close output writer twice when commitTask() fail in ParquetIOSuite.

Why are the changes needed?

The test suite SPARK-7837 Do not close output writer twice when commitTask() fails only handle the TaskSetFailed event before SPARK-39195 due to maxTaskFailures is 1 when local mode.

But after SPARK-39195, In OutputCommitCoordinator#taskCompleted, the processing of stageState.authorizedCommitters(partition) == taskId changes from debug logging to post an StageFailed event, the flaky test may handle to one of the TaskSetFailed event and StageFailed event, and the execution order of the two events is uncertain, and there may be the following two kinds of logs:

- Scenario 1(Success)

18:47:51.592 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 59e24bb4-d3e3-41f6-aa80-411ddb481362.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:596)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:334)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$12(FileFormatWriter.scala:242)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: Intentional exception for testing purposes
	at scala.sys.package$.error(package.scala:30)
	at org.apache.spark.sql.execution.datasources.parquet.TaskCommitFailureParquetOutputCommitter.commitTask(ParquetIOSuite.scala:1552)
	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.$anonfun$commitTask$1(SparkHadoopMapRedUtil.scala:51)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:619)
	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:51)
	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:78)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:279)
	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.$anonfun$commit$1(FileFormatDataWriter.scala:107)
	at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:619)
	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:107)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:318)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1524)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:324)
	... 9 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2706)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2641)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2641)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1189)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1189)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2897)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2836)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2825)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2222)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)

- Scenario 2(Failed)

18:49:14.145 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 82a0051d-7fe4-4632-8c9c-78dfbfaf5819.
org.apache.spark.SparkException: Job aborted due to stage failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; but task commit success, data duplication may happen.
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2706)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2641)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2641)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2894)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2836)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2825)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2222)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)

If maxTaskFailures is changed to >=2, Scenario 2 will appear stably, but the flaky test runs in the local mode and maxTaskFailures is always 1, so the this pr just adds additional assertions to workaround.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GitHub Actions

LuciferYang · 2022-07-21T12:59:49Z

cc @HyukjinKwon

LuciferYang · 2022-07-21T13:14:51Z

also cc @srowen @dongjoon-hyun

HyukjinKwon

Awesome!

LuciferYang · 2022-07-21T15:02:06Z

c97a71a merge SPARK-39831 to pass GA

dongjoon-hyun

+1, LGTM for code wise.

For the title, the current one is proper for JIRA title, but a little insufficient for PR title. Could you revise the PR title a little more toward to the proposed solution (newly handled error message)?

LuciferYang · 2022-07-21T17:41:03Z

Change the title to

Add additional error message matching for SPARK-7837 Do not close output writer twice when commitTask() fail

@dongjoon-hyun Do you think this is OK?

dongjoon-hyun · 2022-07-21T17:44:24Z

Thank you. Looks much better!

LuciferYang · 2022-07-21T17:49:33Z

Run / Run TPC-DS queries with SF=1 run failed, I'll re-trigger it later

dongjoon-hyun · 2022-07-21T23:50:12Z

This test code changes add OR condition for test validations which doesn't cause any additional failure.
I'll merge this PR.

HeartSaVioR · 2022-07-22T02:33:04Z

Thanks @LuciferYang for dealing with this!

LuciferYang · 2022-07-22T02:33:47Z

Thanks all ~

…oot cause ### What changes were proposed in this pull request? The pr follow #37245 StageFailed event should attach with the root cause ### Why are the changes needed? **It may be a good way for users to know the reason of failure.** By carefully investigating the issue: https://issues.apache.org/jira/browse/SPARK-39622, I found the root cause of test failure: StageFailed don't attach the failed reason from executor. when OutputCommitCoordinator execute 'taskCompleted', the 'reason' is ignored. Scenario 1: receive TaskSetFailed (Success) > InsertIntoHadoopFsRelationCommand > FileFormatWriter.write > _**handleTaskSetFailed**_ (**attach root cause**) > abortStage > failJobAndIndependentStages > SparkListenerJobEnd Scenario 1: receive StageFailed (Fail) > InsertIntoHadoopFsRelationCommand > FileFormatWriter.write > _**handleStageFailed**_ (**don't attach root cause**) > abortStage > failJobAndIndependentStages > SparkListenerJobEnd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual run UT & Pass GitHub Actions Closes #37292 from panbingkun/SPARK-39868. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

try workaround

6c99683

github-actions bot added the SQL label Jul 21, 2022

Merge branch 'master' of github.com:apache/spark into SPARK-39622

1a7e639

HyukjinKwon approved these changes Jul 21, 2022

View reviewed changes

Merge branch 'master' of github.com:apache/spark into SPARK-39622

c97a71a

dongjoon-hyun approved these changes Jul 21, 2022

View reviewed changes

LuciferYang closed this Jul 21, 2022

LuciferYang reopened this Jul 21, 2022

trigger GA

4ba3a67

dongjoon-hyun closed this in 0fda075 Jul 21, 2022

panbingkun mentioned this pull request Jul 26, 2022

[SPARK-39868][CORE][TESTS] StageFailed event should attach with the root cause #37292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39622][SQL][TESTS] Add additional error message matching for `SPARK-7837 Do not close output writer twice when commitTask() fail` #37245

[SPARK-39622][SQL][TESTS] Add additional error message matching for `SPARK-7837 Do not close output writer twice when commitTask() fail` #37245

LuciferYang commented Jul 21, 2022 •

edited

LuciferYang commented Jul 21, 2022

LuciferYang commented Jul 21, 2022

HyukjinKwon left a comment

LuciferYang commented Jul 21, 2022

dongjoon-hyun left a comment

LuciferYang commented Jul 21, 2022

dongjoon-hyun commented Jul 21, 2022

LuciferYang commented Jul 21, 2022

dongjoon-hyun commented Jul 21, 2022

HeartSaVioR commented Jul 22, 2022

LuciferYang commented Jul 22, 2022

[SPARK-39622][SQL][TESTS] Add additional error message matching for SPARK-7837 Do not close output writer twice when commitTask() fail #37245

[SPARK-39622][SQL][TESTS] Add additional error message matching for SPARK-7837 Do not close output writer twice when commitTask() fail #37245

Conversation

LuciferYang commented Jul 21, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

LuciferYang commented Jul 21, 2022

LuciferYang commented Jul 21, 2022

HyukjinKwon left a comment

Choose a reason for hiding this comment

LuciferYang commented Jul 21, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

LuciferYang commented Jul 21, 2022

dongjoon-hyun commented Jul 21, 2022

LuciferYang commented Jul 21, 2022

dongjoon-hyun commented Jul 21, 2022

HeartSaVioR commented Jul 22, 2022

LuciferYang commented Jul 22, 2022

[SPARK-39622][SQL][TESTS] Add additional error message matching for `SPARK-7837 Do not close output writer twice when commitTask() fail` #37245

[SPARK-39622][SQL][TESTS] Add additional error message matching for `SPARK-7837 Do not close output writer twice when commitTask() fail` #37245

LuciferYang commented Jul 21, 2022 •

edited