[SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every task #29732

Ngone51 · 2020-09-11T15:54:07Z

What changes were proposed in this pull request?

Fix the flaky test.

Why are the changes needed?

The test is flaky: Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown.

Check the full error stack here.

By analyzing the log below, I found that task 0 hadn't reached the second context.barrier() when another three tasks already raised the sync timeout exceptions by the first context.barrier(). The timeout exceptions were caught by the try...catch.... Then, each task started another round barrier sync from the second context.barrier() and completed the sync successfully.

20/09/10 20:54:48.821 dispatcher-event-loop-10 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0.
20/09/10 20:54:48.822 dispatcher-event-loop-10 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 2, current progress: 1/4.
20/09/10 20:54:48.826 dispatcher-BlockManagerMaster INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38420 (size: 2.2 KiB, free: 546.3 MiB)
20/09/10 20:54:48.908 dispatcher-event-loop-12 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0.
20/09/10 20:54:48.909 dispatcher-event-loop-12 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, current progress: 2/4.
20/09/10 20:54:48.959 dispatcher-event-loop-11 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0.
20/09/10 20:54:48.960 dispatcher-event-loop-11 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 3, current progress: 3/4.
20/09/10 20:54:49.616 dispatcher-CoarseGrainedScheduler INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 0 because the barrier taskSet requires 4 slots, while the total number of available slots is 0.
20/09/10 20:54:49.899 dispatcher-event-loop-15 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0.
20/09/10 20:54:49.900 dispatcher-event-loop-15 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, current progress: 1/4.
20/09/10 20:54:49.965 dispatcher-event-loop-13 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0.
20/09/10 20:54:49.966 dispatcher-event-loop-13 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 3, current progress: 2/4.
20/09/10 20:54:50.112 dispatcher-event-loop-16 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0.
20/09/10 20:54:50.113 dispatcher-event-loop-16 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 0, current progress: 3/4.
20/09/10 20:54:50.609 dispatcher-CoarseGrainedScheduler INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 0 because the barrier taskSet requires 4 slots, while the total number of available slots is 0.
20/09/10 20:54:50.826 dispatcher-event-loop-17 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0.
20/09/10 20:54:50.827 dispatcher-event-loop-17 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 2, current progress: 4/4.
20/09/10 20:54:50.827 dispatcher-event-loop-17 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received all updates from tasks, finished successfully.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated the test and tested a hundred times without failure(Previously, there could be several failures).

Ngone51 · 2020-09-11T15:54:33Z

@jiangxb1987 Please take a look, thanks!

SparkQA · 2020-09-11T18:27:41Z

Test build #128576 has finished for PR 29732 at commit 604e061.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-09-14T00:45:25Z

retest this please

SparkQA · 2020-09-14T03:30:10Z

Test build #128612 has finished for PR 29732 at commit 604e061.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-09-17T02:54:59Z

cc @jiangxb1987

core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala

jiangxb1987 · 2020-09-17T05:31:57Z

can we simply resolve the issue by setting longer barrier sync timeout?

Ngone51 · 2020-09-18T03:36:39Z

can we simply resolve the issue by setting longer barrier sync timeout?

Increased the timeout to 5s. I think it should be enough long.

SparkQA · 2020-09-18T07:05:03Z

Test build #128847 has finished for PR 29732 at commit 6e7044c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-09-21T17:59:19Z

retest this please

SparkQA · 2020-09-21T20:41:37Z

Test build #128947 has finished for PR 29732 at commit 6e7044c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-09-23T02:29:58Z

cc @jiangxb1987

jiangxb1987

LGTM

jiangxb1987 · 2020-09-28T20:15:33Z

retest this please

SparkQA · 2020-09-28T21:11:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33809/

SparkQA · 2020-09-28T21:30:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33809/

SparkQA · 2020-09-28T22:57:24Z

Test build #129194 has finished for PR 29732 at commit 6e7044c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @Ngone51 and @jiangxb1987 .
I also agree with the above analysis. Merged to master for Apache Spark 3.1.0 on December 2020.

probot-autolabeler bot added the CORE label Sep 11, 2020

Ngone51 closed this Sep 16, 2020

Ngone51 reopened this Sep 16, 2020

jiangxb1987 reviewed Sep 17, 2020

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala Outdated Show resolved Hide resolved

address comment

6e7044c

Ngone51 force-pushed the fix-flaky-throw-exception branch from 604e061 to 6e7044c Compare September 18, 2020 03:35

jiangxb1987 approved these changes Sep 28, 2020

View reviewed changes

dongjoon-hyun approved these changes Oct 6, 2020

View reviewed changes

dongjoon-hyun closed this in 0b326d5 Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every task #29732

[SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every task #29732

Ngone51 commented Sep 11, 2020

Ngone51 commented Sep 11, 2020

SparkQA commented Sep 11, 2020

HyukjinKwon commented Sep 14, 2020

SparkQA commented Sep 14, 2020

Ngone51 commented Sep 17, 2020

jiangxb1987 commented Sep 17, 2020

Ngone51 commented Sep 18, 2020

SparkQA commented Sep 18, 2020

jiangxb1987 commented Sep 21, 2020

SparkQA commented Sep 21, 2020

Ngone51 commented Sep 23, 2020

jiangxb1987 left a comment

jiangxb1987 commented Sep 28, 2020

SparkQA commented Sep 28, 2020

SparkQA commented Sep 28, 2020

SparkQA commented Sep 28, 2020

dongjoon-hyun left a comment •

edited

[SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every task #29732

[SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every task #29732

Conversation

Ngone51 commented Sep 11, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Sep 11, 2020

SparkQA commented Sep 11, 2020

HyukjinKwon commented Sep 14, 2020

SparkQA commented Sep 14, 2020

Ngone51 commented Sep 17, 2020

jiangxb1987 commented Sep 17, 2020

Ngone51 commented Sep 18, 2020

SparkQA commented Sep 18, 2020

jiangxb1987 commented Sep 21, 2020

SparkQA commented Sep 21, 2020

Ngone51 commented Sep 23, 2020

jiangxb1987 left a comment

Choose a reason for hiding this comment

jiangxb1987 commented Sep 28, 2020

SparkQA commented Sep 28, 2020

SparkQA commented Sep 28, 2020

SparkQA commented Sep 28, 2020

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun left a comment •

edited