Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify PAPI Error Code 10 Message 14 #3855

Open
ruchim opened this issue Jul 3, 2018 · 3 comments
Open

Clarify PAPI Error Code 10 Message 14 #3855

ruchim opened this issue Jul 3, 2018 · 3 comments

Comments

@ruchim
Copy link
Contributor

ruchim commented Jul 3, 2018

Cromwell treats Error Code 10, Message 14 as a preemption error. When a preemptible machine fails with Error Code 10: Message 14, a user doesn't usually see it as Cromwell retries the preemption. However, we've observed it is possible to get this error on a non-preemptible machine, which isn't retried and causes a workflow to fail.

The problem here is that it's quite unclear from this message that this is a transient failure and it's best to retry the workflow. Adjust the error message to include more information about the nature of this error and action items one can take to mitigate this failure mode.

@juhawilppu
Copy link

I also saw this problem. The VM is not a preemptible and I'm using Cromwell v32. There's a lot of shards spending 10 minutes in "Waiting for quota" when this problem happens. The instance that gives PAPI Error Code 10 was able to get a virtual machine, though. Maybe there is a timeout for "Waiting for quota" which causes all other shards to fail with Error Code 10 even though there was nothing wrong with this particular shard?

Task to_bam_workflow.BaseRecalibrator:3:1 failed. The job was stopped before the command finished. PAPI error code 10.  Message: 14: VM ggp-8822042418103915125 stopped unexpectedly.
java.lang.Exception: Task to_bam_workflow.BaseRecalibrator:3:1 failed. The job was stopped before the command finished. PAPI error code 10.  Message: 14: VM ggp-8822042418103915125 stopped unexpectedly.
	at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor$.StandardException(PipelinesApiAsyncBackendJobExecutionActor.scala:73)
	at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor.handleFailedRunStatus$1(PipelinesApiAsyncBackendJobExecutionActor.scala:520)
	at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor.handleExecutionFailure(PipelinesApiAsyncBackendJobExecutionActor.scala:527)
	at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor.handleExecutionFailure(PipelinesApiAsyncBackendJobExecutionActor.scala:77)
	at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$5.applyOrElse(StandardAsyncExecutionActor.scala:1019)
	at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$5.applyOrElse(StandardAsyncExecutionActor.scala:1015)
	at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:413)
	at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

@ruchim
Copy link
Contributor Author

ruchim commented Aug 30, 2018

@juhawilppu We hope to have this automatically retried soon, on the order of months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants