[SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode. #28839

sarutak · 2020-06-16T08:50:04Z

What changes were proposed in this pull request?

This PR fixes the flaky testcase "barrier stage should fail if only partial tasks are launched" for SPARK-31485 by extending spark.locality.wait.process.

Why are the changes needed?

I noticed sometimes the testcase for SPARK-31485 fails.
This is an one instance.
Or, you can easily reproduce by running the testcase with setting spark.locality.wait.process to 0s.

The reason should be related to the locality wait.
If the scheduler waits for a resource offer which meets the preferred location for a task until the time-limit of process-local but no resource can be offered for the locality level, the scheduler will give up the preferred location. In this case, such task can be assigned to off-preferred location.
The testcase for SPARK-31485, there are two tasks and only one task is supposed to be assigned at one schedule round but both two tasks can be assigned in that situation mentioned above and the testcase will fail.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The modified testcase.

SparkQA · 2020-06-16T09:00:51Z

Test build #124114 has finished for PR 28839 at commit 5dc2886.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2020-06-16T09:13:06Z

retest this please.

SparkQA · 2020-06-16T09:22:44Z

Test build #124116 has finished for PR 28839 at commit 5dc2886.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2020-06-16T09:54:18Z

It's strange to faill the dependency test. The dependency test finishes successfully on my laptop.

sarutak · 2020-06-16T09:54:24Z

retest this please.

SparkQA · 2020-06-16T12:33:32Z

Test build #124118 has finished for PR 28839 at commit 5dc2886.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2020-06-16T12:47:02Z

retest this please.

SparkQA · 2020-06-16T15:39:43Z

Test build #124124 has finished for PR 28839 at commit 5dc2886.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-16T15:55:57Z

core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala

@@ -275,7 +275,8 @@ class BarrierTaskContextSuite extends SparkFunSuite with LocalSparkContext with
  }

  test("SPARK-31485: barrier stage should fail if only partial tasks are launched") {
-    initLocalClusterSparkContext(2)
+    val conf = new SparkConf().set(LOCALITY_WAIT_PROCESS.key, Int.MaxValue + "s")


Thank you, @sarutak . BTW, Int.MaxValue is inevitable for this test case's purpose?

This testcase requires both two tasks should be assigned to the same preferred location so Int.MaxValue means the rest of task which is not assigned the preferred location first waits for the preferred location available.

If it's concerned to hung the test, we can choose a small value but long enough to assign to the preferred location.

O.K, let's set to 10s for safety just in case.

dongjoon-hyun · 2020-06-16T15:59:37Z

cc @jiangxb1987 , @WeichenXu123 , @Ngone51 , @cloud-fan

SparkQA · 2020-06-16T20:09:04Z

Test build #124134 has finished for PR 28839 at commit 7024593.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-06-17T05:58:54Z

core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala

@@ -275,7 +275,8 @@ class BarrierTaskContextSuite extends SparkFunSuite with LocalSparkContext with
  }

  test("SPARK-31485: barrier stage should fail if only partial tasks are launched") {
-    initLocalClusterSparkContext(2)
+    val conf = new SparkConf().set(LOCALITY_WAIT_PROCESS.key, "10s")


Now that we have waited until all the executors has been launched before we submit any jobs, thus upon the first time we try to offer the resources to the pending tasks, we should expect all the executors are available and the locality preference should be satisfied, this shouldn't change following different locality wait time. cc @Ngone51

I can't remember how far we backported the fix from @Ngone51 , is it in branch 2.4?

The new delay scheduling update didn't go to 3.0 and previous versions, so this is a separated issue.

Ngone51 · 2020-06-17T08:31:13Z

Hi @sarutak, thanks for reporting and the fix.

First of all, I think it's very unlikely that we'll reach the locality wait timeout(default 3s) since it is still very long for such a unit test.

After checking the log, I believe the real root cause should be:

Two test cases from different test suites got submitted at the same time because of concurrent execution. In this particular case, the two test cases (DistributedSuite and BarrierTaskContextSuite) both launch under local-cluster mode. The two applications are submitted at the SAME time so they have the same applicationId(app-20200615210132-0000). Thus, when the cluster of BarrierTaskContextSuite is launching executors, it failed to create the directory for the executor 0, because the path (/home/jenkins/workspace/work/app-app-20200615210132-0000/0) has been used by the cluster of DistributedSuite. Therefore, it has to launch executor 1 and 2 instead, that lead to non of the tasks can get preferred locality thus they got scheduled together and lead to the test failure.

You can download the log and search appId "app-20200615210132-0000" to confirm the root cause.

The right fix I think is to use the dynamic executor id from the SparkContext instead of hardcode 0. I'd like to open a separate PR for the fix if you don't mind.

sarutak · 2020-06-17T08:51:46Z

@Ngone51 Ah, I got it. I don't mind you opening another PR.

Ngone51 · 2020-06-17T08:52:28Z

thanks a lot :) @sarutak

Opened the new PR: #28849

probot-autolabeler bot added the CORE label Jun 16, 2020

Fixed the testcase for SPARK-31485.

5dc2886

sarutak force-pushed the fix-barrier-partial-task-test branch from 2db6ca0 to 5dc2886 Compare June 16, 2020 08:51

dongjoon-hyun reviewed Jun 16, 2020

View reviewed changes

Set 10s as the time limit.

7024593

jiangxb1987 reviewed Jun 17, 2020

View reviewed changes

sarutak closed this Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode. #28839

[SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode. #28839

sarutak commented Jun 16, 2020 •

edited

SparkQA commented Jun 16, 2020

sarutak commented Jun 16, 2020

SparkQA commented Jun 16, 2020

sarutak commented Jun 16, 2020

sarutak commented Jun 16, 2020

SparkQA commented Jun 16, 2020

sarutak commented Jun 16, 2020

SparkQA commented Jun 16, 2020

dongjoon-hyun Jun 16, 2020 •

edited

sarutak Jun 16, 2020 •

edited

dongjoon-hyun Jun 16, 2020

dongjoon-hyun commented Jun 16, 2020

SparkQA commented Jun 16, 2020

jiangxb1987 Jun 17, 2020 •

edited

cloud-fan Jun 17, 2020

jiangxb1987 Jun 17, 2020

Ngone51 commented Jun 17, 2020 •

edited

sarutak commented Jun 17, 2020

Ngone51 commented Jun 17, 2020 •

edited

[SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode. #28839

[SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode. #28839

Conversation

sarutak commented Jun 16, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jun 16, 2020

sarutak commented Jun 16, 2020

SparkQA commented Jun 16, 2020

sarutak commented Jun 16, 2020

sarutak commented Jun 16, 2020

SparkQA commented Jun 16, 2020

sarutak commented Jun 16, 2020

SparkQA commented Jun 16, 2020

dongjoon-hyun Jun 16, 2020 • edited

Choose a reason for hiding this comment

sarutak Jun 16, 2020 • edited

Choose a reason for hiding this comment

dongjoon-hyun Jun 16, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 16, 2020

SparkQA commented Jun 16, 2020

jiangxb1987 Jun 17, 2020 • edited

Choose a reason for hiding this comment

cloud-fan Jun 17, 2020

Choose a reason for hiding this comment

jiangxb1987 Jun 17, 2020

Choose a reason for hiding this comment

Ngone51 commented Jun 17, 2020 • edited

sarutak commented Jun 17, 2020

Ngone51 commented Jun 17, 2020 • edited

sarutak commented Jun 16, 2020 •

edited

dongjoon-hyun Jun 16, 2020 •

edited

sarutak Jun 16, 2020 •

edited

jiangxb1987 Jun 17, 2020 •

edited

Ngone51 commented Jun 17, 2020 •

edited

Ngone51 commented Jun 17, 2020 •

edited