[SPARK-19565] Improve DAGScheduler tests. #16901

jinxing64 · 2017-02-12T13:35:29Z

What changes were proposed in this pull request?

This is related to #16620.
When fetch failed, stage will be resubmitted. There can be running tasks from both old and new stage attempts. This pr added a test to check the case that success of tasks from old stage attempt should be taken as valid and partitionId should be removed from stage's pendingPartitions accordingly. When pending partitions is empty, downstream stage can be scheduled, even though there's still running tasks in the active(new) stage attempt.

jinxing64 · 2017-02-12T13:44:32Z

@kayousterhout @squito @markhamstra
As mentioned in #16620 , I think it might make sense to make this pr. Please take a look. If you think it is too trivial, I will close.

kayousterhout · 2017-02-13T20:05:48Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

  }

+  test("After fetching failed, success of old attempt of stage should be taken as valid.") {
+    val rddA = new MyRDD(sc, 2, Nil)


can you add a brief comment here with something like:

/// Create 3 RDDs with shuffle dependencies on each other: A <--- B <---- C

kayousterhout · 2017-02-13T20:09:59Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+    submit(rddC, Array(0, 1))
+    assert(taskSets(0).stageId === 0 && taskSets(0).stageAttemptId === 0)
+
+    complete(taskSets(0), Seq(


add a comment saying something like "Complete both tasks in rddA"

kayousterhout · 2017-02-13T20:11:51Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+        "Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"), null),
+      (Success, makeMapStatus("hostB", 2))))
+
+    scheduler.resubmitFailedStages()


add a comment here saying something like "Both original tasks in rddA should be marked as failed, because they ran on the failed hostA, so both should be resubmitted. Complete them successfully."

kayousterhout · 2017-02-13T20:12:22Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+      (Success, makeMapStatus("hostB", 2)),
+      (Success, makeMapStatus("hostB", 2))))
+
+    assert(taskSets(3).stageId === 1 && taskSets(2).stageAttemptId === 1)


should the second condition in the assert be checking taskSets(3) (not taskSets(2) again?)

kayousterhout · 2017-02-13T20:14:27Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+    runEvent(makeCompletionEvent(
+      taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
+
+    // Thanks to the success from old attempt of stage(stageId=1), there's no pending


It looks like the success above is from the newer attempt of the stage (since you're taking the task from taskSets(3), not taskSets(1)), which is inconsistent with the comment. I think perhaps the intention here was to not finish one of the tasks from taskSets(1) in the first time around (i.e., eliminate the Success on line 2185)) and then move that success here (instead of completing the task from the more recent task set)?

Yes, the success should be moved. Sorry for this and I'll rectify.

jinxing64 · 2017-02-14T06:41:52Z

@kayousterhout
I've refined accordingly. Sorry for the stupid mistake I made. Please take another look at this : )

squito

lgtm

I left a comment about another issue but that is unrelated to just adding this test. I was surprised there wasn't already a good test for this, but I don't see one that really addresses it, thanks for suggesting this adddition.

squito · 2017-02-14T15:21:07Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

    }
  }

+  test("After fetching failed, success of old attempt of stage should be taken as valid.") {


can you rename to "After a fetch failure, success ..."
really minor, but I had to read this twice

squito · 2017-02-14T15:30:47Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+    // though there's still a running task(stageId=1, stageAttempt=1, partitionId=1)
+    // in the active stage attempt.
+    assert(taskSets.size === 5 && taskSets(4).tasks(0).isInstanceOf[ResultTask[_, _]])
+    complete(taskSets(4), Seq(


I was going to suggest adding a check here to make sure that all prior tasksetmanagers are marked as zombies. But (a) you can't check that, since the dagscheduler doesn't have a handle on the tasksetmanagers, and (b) more importantly, the prior TSMs actually are not marked as zombies. So they may continue to submit tasks even though they're not necessary.

I will file a separate bug for that -- its a performance issue, not a correctness issue, so not critical. But this isn't the same as the old "kill running tasks when marking a tsm as a zombie" -- in this case, the issue is that the tsm may continue to launch new tasks.

https://issues.apache.org/jira/browse/SPARK-19596

jinxing64 · 2017-02-15T01:54:22Z

@squito
Thanks a lot for your comments. I've refined the comment.

kayousterhout · 2017-02-15T19:49:25Z

After looking at my other test cleanup PR I realized the "map stage submission with executor failure late map task completions" test already tests this functionality, only for map stages that are submitted in isolation (without a following reduce stage). @squito do you think we need this one, for the same reason? If so, do you think it's worth it to do some cleanup to (a) move this test next to that one and (b) make it consistent with that one?

kayousterhout · 2017-02-15T20:11:40Z

Jenkins this is OK to test

jinxing64 · 2017-02-20T13:36:14Z

@kayousterhout
I'll close since this functionality is already tested.

squito · 2017-02-21T18:24:52Z

sorry responding late to this, but your analysis sounds fine

[SPARK-19565] Improve DAGScheduler tests.

f5df214

jinxing64 mentioned this pull request Feb 12, 2017

[SPARK-19263] DAGScheduler should avoid sending conflicting task set. #16620

Closed

kayousterhout reviewed Feb 13, 2017

View reviewed changes

jinxing64 force-pushed the SPARK-19565 branch 2 times, most recently from da6b4d3 to 26e8ab4 Compare February 14, 2017 02:43

fix

f8cf4fc

jinxing64 force-pushed the SPARK-19565 branch from 26e8ab4 to f8cf4fc Compare February 14, 2017 03:57

squito reviewed Feb 14, 2017

View reviewed changes

Modify comment.

86fd6b4

jinxing64 closed this Feb 20, 2017

[SPARK-19565] Improve DAGScheduler tests. #16901

[SPARK-19565] Improve DAGScheduler tests. #16901

Uh oh!

Conversation

jinxing64 commented Feb 12, 2017

What changes were proposed in this pull request?

Uh oh!

jinxing64 commented Feb 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinxing64 commented Feb 14, 2017

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinxing64 commented Feb 15, 2017

Uh oh!

kayousterhout commented Feb 15, 2017

Uh oh!

kayousterhout commented Feb 15, 2017

Uh oh!

jinxing64 commented Feb 20, 2017

Uh oh!

squito commented Feb 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants