Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-3050] [runtime] Add UnrecoverableException to suppress job restarts #1461

Closed
wants to merge 1 commit into from

Conversation

uce
Copy link
Contributor

@uce uce commented Dec 16, 2015

I need this to address a comment in #1434.

Adds UnrecoverableException, which suppresses job restarts if it is the failure cause. It's just a wrapper around the real cause and it is only possible to instantiate with a cause.

A stack trace looks like this:

org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:649)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:595)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:595)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.execution.UnrecoverableException: Unrecoverable failure. This suppresses job restarts. Please check the stack trace for the root cause.
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1067)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1052)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1052)
    ... 9 more
Caused by: java.lang.IllegalArgumentException: Invalid path 'unknown path'.
    at org.apache.flink.runtime.checkpoint.HeapStateStore.getState(HeapStateStore.java:57)
    at org.apache.flink.runtime.checkpoint.SavepointStore.getState(SavepointStore.java:54)
    at org.apache.flink.runtime.checkpoint.SavepointStore.getState(SavepointStore.java:24)
    at org.apache.flink.runtime.checkpoint.SavepointCoordinator.restoreSavepoint(SavepointCoordinator.java:189)
    at org.apache.flink.runtime.executiongraph.ExecutionGraph.restoreSavepoint(ExecutionGraph.java:874)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1064)
    ... 11 more

@gyfora
Copy link
Contributor

gyfora commented Dec 17, 2015

I think this is very useful, also for instance when there are not enough task slots etc.

@gyfora
Copy link
Contributor

gyfora commented Dec 17, 2015

Looks good +1

@uce
Copy link
Contributor Author

uce commented Dec 17, 2015

Thanks for the review. With task slots we have to be careful though, because it is possible that slots become available after some time.

@tillrohrmann
Copy link
Contributor

And in the foreseeable future it might be possible to scale the job according to the available slots.

@uce
Copy link
Contributor Author

uce commented Jan 11, 2016

Any objections against merging this?

@asfgit asfgit closed this in ebbc85d Jan 11, 2016
@uce uce deleted the 3050-suppress_restarts branch January 21, 2016 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants