New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode. #28839
Conversation
2db6ca0
to
5dc2886
Compare
Test build #124114 has finished for PR 28839 at commit
|
retest this please. |
Test build #124116 has finished for PR 28839 at commit
|
It's strange to faill the dependency test. The dependency test finishes successfully on my laptop. |
retest this please. |
Test build #124118 has finished for PR 28839 at commit
|
retest this please. |
Test build #124124 has finished for PR 28839 at commit
|
@@ -275,7 +275,8 @@ class BarrierTaskContextSuite extends SparkFunSuite with LocalSparkContext with | |||
} | |||
|
|||
test("SPARK-31485: barrier stage should fail if only partial tasks are launched") { | |||
initLocalClusterSparkContext(2) | |||
val conf = new SparkConf().set(LOCALITY_WAIT_PROCESS.key, Int.MaxValue + "s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @sarutak . BTW, Int.MaxValue
is inevitable for this test case's purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This testcase requires both two tasks should be assigned to the same preferred location so Int.MaxValue
means the rest of task which is not assigned the preferred location first waits for the preferred location available.
If it's concerned to hung the test, we can choose a small value but long enough to assign to the preferred location.
O.K, let's set to 10s
for safety just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
cc @jiangxb1987 , @WeichenXu123 , @Ngone51 , @cloud-fan |
Test build #124134 has finished for PR 28839 at commit
|
@@ -275,7 +275,8 @@ class BarrierTaskContextSuite extends SparkFunSuite with LocalSparkContext with | |||
} | |||
|
|||
test("SPARK-31485: barrier stage should fail if only partial tasks are launched") { | |||
initLocalClusterSparkContext(2) | |||
val conf = new SparkConf().set(LOCALITY_WAIT_PROCESS.key, "10s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have waited until all the executors has been launched before we submit any jobs, thus upon the first time we try to offer the resources to the pending tasks, we should expect all the executors are available and the locality preference should be satisfied, this shouldn't change following different locality wait time. cc @Ngone51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember how far we backported the fix from @Ngone51 , is it in branch 2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new delay scheduling update didn't go to 3.0 and previous versions, so this is a separated issue.
Hi @sarutak, thanks for reporting and the fix. First of all, I think it's very unlikely that we'll reach the locality wait timeout(default 3s) since it is still very long for such a unit test. After checking the log, I believe the real root cause should be: Two test cases from different test suites got submitted at the same time because of concurrent execution. In this particular case, the two test cases (DistributedSuite and BarrierTaskContextSuite) both launch under local-cluster mode. The two applications are submitted at the SAME time so they have the same applicationId(app-20200615210132-0000). Thus, when the cluster of BarrierTaskContextSuite is launching executors, it failed to create the directory for the executor 0, because the path (/home/jenkins/workspace/work/app-app-20200615210132-0000/0) has been used by the cluster of DistributedSuite. Therefore, it has to launch executor 1 and 2 instead, that lead to non of the tasks can get preferred locality thus they got scheduled together and lead to the test failure. You can download the log and search appId "app-20200615210132-0000" to confirm the root cause. The right fix I think is to use the dynamic executor id from the SparkContext instead of hardcode 0. I'd like to open a separate PR for the fix if you don't mind. |
@Ngone51 Ah, I got it. I don't mind you opening another PR. |
What changes were proposed in this pull request?
This PR fixes the flaky testcase "barrier stage should fail if only partial tasks are launched" for SPARK-31485 by extending
spark.locality.wait.process
.Why are the changes needed?
I noticed sometimes the testcase for SPARK-31485 fails.
This is an one instance.
Or, you can easily reproduce by running the testcase with setting
spark.locality.wait.process
to0s
.The reason should be related to the locality wait.
If the scheduler waits for a resource offer which meets the preferred location for a task until the time-limit of process-local but no resource can be offered for the locality level, the scheduler will give up the preferred location. In this case, such task can be assigned to off-preferred location.
The testcase for SPARK-31485, there are two tasks and only one task is supposed to be assigned at one schedule round but both two tasks can be assigned in that situation mentioned above and the testcase will fail.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The modified testcase.