Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fe… #17297

Closed
wants to merge 11 commits into from

Conversation

sitalkedia
Copy link

@sitalkedia sitalkedia commented Mar 14, 2017

What changes were proposed in this pull request?

When a fetch failure occurs, the DAGScheduler re-launches the previous stage (to re-generate output that was missing), and then re-launches all tasks in the stage that haven't completed by the time the stage gets resubmitted (the DAGScheduler re-lanches all of the tasks whose output data is not available -- which is equivalent to the set of tasks that hadn't yet completed). This some times leads to wasteful duplicate task run for the jobs with long running task.

To address the issue following changes have been made.

Dag scheduler maintains a pending task list, which is a list of tasks that have been submitted to the lower-level scheduler and they should not be resubmitted when rerun of the stage.

  1. When a fetch failure happens, the task set manager informs the dag scheduler to mark all the non-running tasks to the pending task list. However, the running tasks in the task set are not killed.
  2. In case of resubmission of the stage, the dag scheduler only resubmits the tasks which are in pending stage.

How was this patch tested?

Added new tests.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74558 has finished for PR 17297 at commit e5429d3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

@sitalkedia
Copy link
Author

cc - @kayousterhout - Addressed your earlier comment about #12436 ignoring fetch failure from stale map output. I have addressed this issue by adding epoch for each map output registered, that way if the task's epoch is smaller than the epoch of the map output, we can ignore the fetch failure. This also takes care of epoch changes which will be triggered due to executor loss for a shuffle task when its shuffle map task executor is gone as pointed out by @mridulm.

Let me know what you think of the approach.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74560 has finished for PR 17297 at commit 279b09a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74562 has finished for PR 17297 at commit f127150.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74566 has finished for PR 17297 at commit 0bcc69a.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

@kayousterhout
Copy link
Contributor

@sitalkedia I won't have time to review this in detail for at least a few weeks, just so you know (although others may have time to review / merge it).

At a very high level, I'm concerned about the amount of complexity that this adds to the scheduler code. We've recently had to deal with a number of subtle bugs with jobs hanging or Spark crashing as a result of trying to handle map output from old tasks. As a result, I'm hesitant to add more complexity -- and the associated risk of bugs that cause job failures + expense of maintaining the code -- to improve performance.

At the point I'd lean towards cancelling outstanding map tasks when a fetch failure occurs (there's currently a TODO in the code to do this) to simplify these issues. This would improve performance in some ways, by freeing up slots that could be used for something else, at the expense of wasted work if the tasks have already made significant progress. But it would significantly simplify the scheduler code, which given the debugging + reviewer time that has gone into fixing subtle issues with this code path, I think is worthwhile.

Curious what other folks think here.

@sitalkedia
Copy link
Author

@kayousterhout - I understand your concern and I agree that canceling the running tasks is definitely a simpler approach, but this is very inefficient for large jobs where tasks can run for hours. In our environment where fetch failures are common, this change will not only improve the performance of the jobs in case of fetch failure, this also helps reliability. If we cancel all running reducers, we might end of in a state where jobs will not make any progress at all in case of frequent fetch failure, because they will just flip-flop between two stage.

Comparing this approach to how Hadoop handles fetch failure, it does not fail any reducer in case it detects any map output missing. The reducers just continue processing output from other mappers while the missing output is being recomputed concurrently. This approach give Hadoop a big edge over Spark for long running jobs with multiple fetch failure. This change is one step towards making Spark robust against fetch failure, we would eventually want to have the hadoop model, where we would not fail any task in case of map output missing.

Regarding the approach, please let me know if you can think of some way to reduce the complexity of this change.

cc -@markhamstra, @rxin, @sameeragarwal

@@ -193,13 +193,6 @@ private[spark] class TaskSchedulerImpl private[scheduler](
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this check is not needed anymore because the DagScheduler already keeps track of running tasks and does not submit duplicate tasks anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, that is not really the point of this check. Its just checking if one stage has two tasksets (aka stage attempts), where both are in the "non-zombie" state. It doesn't do any checks at all on what tasks are actually in those tasksets.

This is just checking an invariant which we believe to always be true, but we figure its better to fail-fast if we hit this condition, rather than proceed with some inconsistent state. This check was added because behavior gets really confusing when the invariant is violated, and though we think it should always be true, we've still hit cases where it happens.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squito - That's correct, this is checking that we should not have more than one non-zombie attempts of a stage running. But in the scenario in (d) you described below, we will end up having more than two non-zombie attempts.

However, my point is there is no reason we should not allow multiple concurrent attempts of a stage to run, the only thing we should guarantee is we are running mutually exclusive tasks in those attempts. With this change, since the dag scheduler already keeps track of submitted/running tasks, it can guarantee that it will not resubmit duplicate tasks for a stage.

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74631 has finished for PR 17297 at commit 901c9bf.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Mar 18, 2017

I'm a bit confused by the description:

  1. When a fetch failure happens, the task set manager ask the dag scheduler to abort all the non-running tasks. However, the running tasks in the task set are not killed.

this is already true. when there is a fetch failure, the TaskSetManager is marked as zombie, and the DAGScheduler resubmits stages, but nothing actively kills running tasks.

re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred (the DAGScheduler re-lanches all of the tasks whose output data is not available -- which is equivalent to the set of tasks that hadn't yet completed).

I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.

But there are several other potential issues you may be trying to address.

Say there is stage 0 and stage 1, each one has 10 tasks. Stage 0 completes fine on the first attempt, then stage 1 starts. Tasks 0 & 1 in stage 1 complete, but then there is a fetch failure in task 2. Lets also say we have an abundance of cluster resources so tasks 3 - 9 from stage 1, attempt 0 are still running.

Stage 0 get resubmitted as attempt 1, just to regenerate the map output for whatever executor had the data for the fetch failure -- perhaps its just one task from stage 0 that needs to resubmitted. Now, lots of different scenarios are possible:

(a) Tasks 3 - 9 from stage 1 attempt 0 all finish successfully while stage 0 attempt 1 is running. So when stage 0 attempt 1 finishes, then stage 1 attempt 1 is submitted, just with Task 2. If it completely succesfully, we're done (no wasted work).

(b) stage 0 attempt 1 finishes, before tasks 3 - 9 from stage 1 attempt 0 have finished. So stage 1 gets submitted again as stage 1 attempt 1, with tasks 2 - 9. So there are now two copies running for tasks 3 - 9. Maybe all the tasks from attempt 0 actually finish shortly after attempt 1 starts. In this case, the stage is complete as soon as there is one complete attempt for each task. But even after the stage completes successfully, all the other tasks keep running anyway. (plenty of wasted work)

(c) like (b), but shortly after stage 1 attempt 1 is submitted, we get another fetch failure in one of the old "zombie" tasks from stage 1 attempt 0. But the DAGScheduler realizes it already has a more recent attempt for this stage, so it ignores the fetch failure. All the other tasks keep running as usual. If there aren't any other issues, the stage completes when there is one completed attempt for each task. (same amount of wasted work as (b)).

(d) While stage 0 attempt 1 is running, we get another fetch failure from stage 1 attempt 0, say in Task 3, which has a failure from a different executor. Maybe its from a completely different host (just by chance, or there may be cluster maintenance where multiple hosts are serviced at once); or maybe its from another executor on the same host (at least, until we do something about your other pr on unregistering all shuffle files on a host). To be honest, I don't understand how things work in this scenario. We mark stage 0 as failed, we unregister some shuffle output, and we resubmit stage 0. But stage 0 attempt 1 is still running, so I would have expected us to end up with conflicting task sets. Whatever the real behavior is here, it seems we're at risk of having even more duplicated work for yet another attempt for stage 1.

etc.

So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe that the original version of those tasks will fail. Most likely, those tasks needs map output from the same executor that caused the first fetch failure. So Kay is suggesting that we take the opposite approach, and instead actively kill the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may have been transient or (ii) the tasks already finished fetching that data before the error occurred. We really have no idea.

@sitalkedia
Copy link
Author

Thanks a lot @squito for taking a look at it and for your feedback.

this is already true. when there is a fetch failure, the TaskSetManager is marked as zombie, and the DAGScheduler resubmits stages, but nothing actively kills running tasks.

That is true but currently the DAG scheduler has no idea about which tasks are running and which are being aborted. With this change, the task set manager informs the dag scheduler about currently running/aborted tasks so that the DAG scheduler can avoid resubmitting duplicates.

I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.

Yes that's true. I will update the PR description.

So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe that the original version of those tasks will fail. Most likely, those tasks needs map output from the same executor that caused the first fetch failure. So Kay is suggesting that we take the opposite approach, and instead actively kill the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may have been transient or (ii) the tasks already finished fetching that data before the error occurred. We really have no idea.

In our case, we are observing that any transient issue on the shuffle service might cause few tasks to fail. While other reducers might not see the fetch failure because either they already fetched the data from that shuffle service or they are yet to fetch it. Killing all the reducers in those cases is waste of a lot of work and also as I mentioned above, we might end of in a state where jobs will not make any progress at all in case of frequent fetch failure, because they will just flip-flop between two stage.

@sitalkedia
Copy link
Author

I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.

Actually, I realized that it's not true. If you looked at the code (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1419), when the stage fails because of fetch failure, we remove the stage from the output commiter. So if any task completes between the time of first fetch failure and the time stage is resubmitted, will be denied to commit the output and so the scheduler re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred.

@squito
Copy link
Contributor

squito commented Mar 20, 2017

when the stage fails because of fetch failure, we remove the stage from the output commiter. So if any task completes between the time of first fetch failure and the time stage is resubmitted, will be denied to commit the output

oh, that is a great point. I was mostly thinking of another shufflemapstage, where that wouldn't matter, but if its a result stage which needs to commit its output, you are right.

// It is possible that the map output was regenerated by rerun of the stage and the
// fetch failure is being reported for stale map output. In that case, we should just
// ignore the fetch failure and relaunch the task with latest map output info.
if (epochForMapOutput.nonEmpty && epochForMapOutput.get <= task.epoch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to do this without the extra binding and get:

        for(epochForMapOutput <- mapOutputTracker.getEpochForMapOutput(shuffleId, mapId) if
            epochForMapOutput <= task.epoch) {
          // Mark the map whose fetch failed as broken in the map stage
          if (mapId != -1) {
            mapStage.removeOutputLoc(mapId, bmAddress)
            mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
          }

          // TODO: mark the executor as failed only if there were lots of fetch failures on it
          if (bmAddress != null) {
            handleExecutorLost(bmAddress.executorId, filesLost = true, Some(task.epoch))
          }
        }

if (changeEpoch) {
incrementEpoch()
}
mapStatuses.put(shuffleId, statuses.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the point of moving this?

@@ -378,15 +382,17 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf,
val array = mapStatuses(shuffleId)
array.synchronized {
array(mapId) = status
val epochs = epochForMapStatus.get(shuffleId).get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val epochs = epochForMapStatus(shuffleId)

return Some(epochForMapStatus.get(shuffleId).get(mapId))
}
None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, arrayOpt.get != null isn't necessary since we don't put null values into mapStatuses. Second, epochForMapStatus.get(shuffleId).get is the same as epochForMapStatus(shuffleId). Third, I don't like all the explicit gets,null checks and the unnecessary non-local return. To my mind, this is better:

  def getEpochForMapOutput(shuffleId: Int, mapId: Int): Option[Long] = {
    for {
      mapStatus <- mapStatuses.get(shuffleId).flatMap { mapStatusArray =>
        Option(mapStatusArray(mapId))
      }
    } yield epochForMapStatus(shuffleId)(mapId)
  }

for (task <- tasks) {
stage.pendingPartitions -= task.partitionId
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    for {
      stage <- stageIdToStage.get(stageId)
      task <- tasks
    } stage.pendingPartitions -= task.partitionId

val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
val missingPartitions = stage.findMissingPartitions()
val partitionsToCompute =
missingPartitions.filter(id => !stage.pendingPartitions.contains(id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missingPartitions.filterNot(stage.pendingPartitions)

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75029 has finished for PR 17297 at commit 99b4069.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia
Copy link
Author

Thanks @markhamstra for review comments, addressed. I also found an issue with my previous implementation that we do not allow task commits from old stage attempts, I fixed that issue as well.

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75030 has started for PR 17297 at commit 40a3742.

@SparkQA
Copy link

SparkQA commented Mar 23, 2017

Test build #75126 has finished for PR 17297 at commit 05770b9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75124 has finished for PR 17297 at commit c0bdca6.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75127 has finished for PR 17297 at commit 1aab715.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor

To recap the issue that Imran and I discussed here, I think it can be summarized as follows:

  • A Fetch Failure happens at some time t and indicates that the map output on machine M has been lost
  • Consider some running task that's read x map outputs and still needs to process y map outputs
  • Scenario A: (PRO of this PR) If the output from M was in the x outputs that are already read, we should keep running the task (as this PR does), because the task already successfully fetched the output from the failed machine. We don't do this currently, meaning we're throwing away the wasted work.
  • Scenario B: (CON of this PR) If the output from M was in the y outputs that have not yet been read, then we should cancel the task, because the task won't learn about the new location for the re-generated output of M (IIUC, there's no functionality to do this now) so is going to fail later on. The current code will re-run the task, which is what we should do. This code will try to re-use the old task, which means the job will take longer to run because the task will fail later on and need to be re-started.

If my description above is correct, then this PR is assuming that scenario A is more likely than scenario B, but it seems to me that these two scenarios are equally likely (in which case this PR provides no net benefit). @sitalkedia what are your thoughts here / did I miss something in my description above?

@sitalkedia
Copy link
Author

sitalkedia commented Mar 28, 2017

@squito - I am not able to reproduce this issue locally.

The tests fails with some other issue -

java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:347)
	at scala.None$.get(Option.scala:345)
	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply$mcV$sp(InternalAccumulatorSuite.scala:43)
	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply(InternalAccumulatorSuite.scala:39)
	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply(InternalAccumulatorSuite.scala:39)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
	at org.apache.spark.InternalAccumulatorSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(InternalAccumulatorSuite.scala:28)
	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
	at org.apache.spark.InternalAccumulatorSuite.runTest(InternalAccumulatorSuite.scala:28)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
	at org.scalatest.Suite$class.run(Suite.scala:1424)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
	at org.scalatest.tools.Runner$.run(Runner.scala:883)
	at org.scalatest.tools.Runner.run(Runner.scala)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) ```

Please note that all `InternalAccumulatorSuite` tests fail on my laptop. 
In the Jenkins log, do you see any other test cases having the `java.lang.ArrayIndexOutOfBoundsException` from `MapOutputTrackerMaster` ?

@squito
Copy link
Contributor

squito commented Mar 28, 2017

@sitalkedia how are you trying to run the test? Works fine for me on my laptop on master. Note that the test is referencing a var which is only defined if "spark.testing" is a system property: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L199

which it is in the sbt and maven build. (maybe doens't work inside an IDE? I'd strongly just using ~testOnly with sbt for faster dev iterations if you're not already)

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75287 has finished for PR 17297 at commit 1e6e88a.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia
Copy link
Author

@squito - I am able to reproduce the issue by running ./build/sbt "test-only org.apache.spark.InternalAccumulatorSuite, however test case logs are not being printed on the console, do you know where can I find the test case logs on my laptop?

Also, one weird thing is that after adding system.testing property to my Intellij, all test cases succeeds without being stuck :/ .

@kayousterhout
Copy link
Contributor

@sitalkedia they're in core/target/unit-tests.log

Sometimes it's easier to move the logs to the tests (so they show up in-line), which you can do by changing core/src/test/resources/log4j.properties to log to the console instead of to a file.

@sitalkedia
Copy link
Author

@kayousterhout - Both the scenario A and B you described above are likely (it totally depend on the nature of the job and available cluster resources) and you are right that in case of scenario B, this PR will not provide any benefit.

I am planning to have a follow up PR to make the fetch failure handling logic better by not failing a task at all. In that case, the reducers can just inform the scheduler of lost map output and can still continue processing other available map outputs while the scheduler concurrently recomputes the lost map output. But that will be a bigger change in the scheduler.

@squito
Copy link
Contributor

squito commented Mar 28, 2017

btw I filed https://issues.apache.org/jira/browse/SPARK-20128 for the test timeout -- fwiw I don't think its a problem w/ the test but a potential real issue with the metrics system, though I don't really understand how it can happen.

@squito
Copy link
Contributor

squito commented Mar 28, 2017

@sitalkedia This change is pretty contentious, there are lot of questions about whether or not this is a good change. I don't think discussing this here in github comments on a PR is the best form. I think of PR comments as being more about code details -- clarity, tests, whether the implementation is correct, etc. But here we're discussing whether the behavior is even desirable, as well as trying to discuss this in relation to other changes. I think a better format would be for you to open a jira and submit a design document (maybe a shared google doc at first), where we can focus more on the desired behavior and consider all the changes, even if the PRs are smaller to make them easier to review.

I'm explicitly not making a judgement on whether or not this is a good change. Also I do appreciate you having the code changes ready, as a POC, as that can help folks consider the complexity of the change. But it seems clear to me that first we need to come to a decision about the end goal.

Also, assuming we do decide this is desirable behavior, there is also a question about how we can get changes like this in without risking breaking things -- I have started a thread on dev@ related to that topic in general, but we should figure that for these changes in particular as well.

@kayousterhout @tgravescs @markhamstra makes sense?

@tgravescs
Copy link
Contributor

Sounds good to me.

@kayousterhout
Copy link
Contributor

Agree sounds good!

@sitalkedia
Copy link
Author

@squito - Sounds good to me, let me compile the list of pain points related to fetch failure we are seeing and also a design doc to have better handling of the issues.

@markhamstra
Copy link
Contributor

markhamstra commented Mar 29, 2017 via email

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75332 has finished for PR 17297 at commit bdaff12.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75339 has finished for PR 17297 at commit ace8464.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia
Copy link
Author

@kayousterhout, @squito - Since we need more discussion on this change over a design doc, I have put out a temporary change (#17485) to kill the running tasks in case of fetch failure. Although this is not ideal but that would be better than current situation.

@jiangxb1987
Copy link
Contributor

Should we temporarily close the PR and wait for the design doc to be finalized? @sitalkedia

@sitalkedia
Copy link
Author

okay, closing the PR.

@sitalkedia sitalkedia closed this May 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants