Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode. #28839

Closed
wants to merge 2 commits into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Jun 16, 2020

What changes were proposed in this pull request?

This PR fixes the flaky testcase "barrier stage should fail if only partial tasks are launched" for SPARK-31485 by extending spark.locality.wait.process.

Why are the changes needed?

I noticed sometimes the testcase for SPARK-31485 fails.
This is an one instance.
Or, you can easily reproduce by running the testcase with setting spark.locality.wait.process to 0s.

The reason should be related to the locality wait. 
If the scheduler waits for a resource offer which meets the preferred location for a task until the time-limit of process-local but no resource can be offered for the locality level, the scheduler will give up the preferred location. In this case, such task can be assigned to off-preferred location.
The testcase for SPARK-31485, there are two tasks and only one task is supposed to be assigned at one schedule round but both two tasks can be assigned in that situation mentioned above and the testcase will fail.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The modified testcase.

@sarutak sarutak force-pushed the fix-barrier-partial-task-test branch from 2db6ca0 to 5dc2886 Compare June 16, 2020 08:51
@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124114 has finished for PR 28839 at commit 5dc2886.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member Author

sarutak commented Jun 16, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124116 has finished for PR 28839 at commit 5dc2886.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member Author

sarutak commented Jun 16, 2020

It's strange to faill the dependency test. The dependency test finishes successfully on my laptop.

@sarutak
Copy link
Member Author

sarutak commented Jun 16, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124118 has finished for PR 28839 at commit 5dc2886.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member Author

sarutak commented Jun 16, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124124 has finished for PR 28839 at commit 5dc2886.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -275,7 +275,8 @@ class BarrierTaskContextSuite extends SparkFunSuite with LocalSparkContext with
}

test("SPARK-31485: barrier stage should fail if only partial tasks are launched") {
initLocalClusterSparkContext(2)
val conf = new SparkConf().set(LOCALITY_WAIT_PROCESS.key, Int.MaxValue + "s")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @sarutak . BTW, Int.MaxValue is inevitable for this test case's purpose?

Copy link
Member Author

@sarutak sarutak Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This testcase requires both two tasks should be assigned to the same preferred location so Int.MaxValue means the rest of task which is not assigned the preferred location first waits for the preferred location available.

If it's concerned to hung the test, we can choose a small value but long enough to assign to the preferred location.

O.K, let's set to 10s for safety just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dongjoon-hyun
Copy link
Member

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124134 has finished for PR 28839 at commit 7024593.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -275,7 +275,8 @@ class BarrierTaskContextSuite extends SparkFunSuite with LocalSparkContext with
}

test("SPARK-31485: barrier stage should fail if only partial tasks are launched") {
initLocalClusterSparkContext(2)
val conf = new SparkConf().set(LOCALITY_WAIT_PROCESS.key, "10s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have waited until all the executors has been launched before we submit any jobs, thus upon the first time we try to offer the resources to the pending tasks, we should expect all the executors are available and the locality preference should be satisfied, this shouldn't change following different locality wait time. cc @Ngone51

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember how far we backported the fix from @Ngone51 , is it in branch 2.4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new delay scheduling update didn't go to 3.0 and previous versions, so this is a separated issue.

@Ngone51
Copy link
Member

Ngone51 commented Jun 17, 2020

Hi @sarutak, thanks for reporting and the fix.

First of all, I think it's very unlikely that we'll reach the locality wait timeout(default 3s) since it is still very long for such a unit test.

After checking the log, I believe the real root cause should be:

Two test cases from different test suites got submitted at the same time because of concurrent execution. In this particular case, the two test cases (DistributedSuite and BarrierTaskContextSuite) both launch under local-cluster mode. The two applications are submitted at the SAME time so they have the same applicationId(app-20200615210132-0000). Thus, when the cluster of BarrierTaskContextSuite is launching executors, it failed to create the directory for the executor 0, because the path (/home/jenkins/workspace/work/app-app-20200615210132-0000/0) has been used by the cluster of DistributedSuite. Therefore, it has to launch executor 1 and 2 instead, that lead to non of the tasks can get preferred locality thus they got scheduled together and lead to the test failure.

You can download the log and search appId "app-20200615210132-0000" to confirm the root cause.

The right fix I think is to use the dynamic executor id from the SparkContext instead of hardcode 0. I'd like to open a separate PR for the fix if you don't mind.

@sarutak
Copy link
Member Author

sarutak commented Jun 17, 2020

@Ngone51 Ah, I got it. I don't mind you opening another PR.

@Ngone51
Copy link
Member

Ngone51 commented Jun 17, 2020

thanks a lot :) @sarutak

Opened the new PR: #28849

@sarutak sarutak closed this Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants