GCP batch backend preempt handling issue #7407

zhilizheng · 2024-04-16T18:26:07Z

The GCP batch backend preemption handling seems to have issue. When preemption happens the job had very high possibility to be error. The typical error would be : time=“…” level=error msg=“error waiting for container:” . It will take the preempt events as the error from Cromwell logs. However, in the google batch console, it shows clearly "preemption notice has received and will be processed".

GCP batch

The workflow works perfectly in GCP life science backend

dspeck1 · 2024-04-17T14:38:03Z

Hi @zhilizheng - Please post the output or error logs. We will review.

zhilizheng · 2024-04-17T14:50:20Z

Hi @dspeck1 ,

Thanks for your reply. The is the cromwell outputs when things happened:

=======================log start============

status_events {
  description: "Job state is set from QUEUED to SCHEDULED for job projects/A_JOB_ID."
  event_time {
    seconds: 1713287682
    nanos: 566509009
  }
  type: "STATUS_CHANGED"
}
status_events {
  description: "Job state is set from SCHEDULED to RUNNING for job projects/A_JOB_ID."
  event_time {
    seconds: 1713287919
    nanos: 96623968
  }
  type: "STATUS_CHANGED"
}
status_events {
  description: "Job state is set from RUNNING to FAILED for job projects/A_JOB_ID. Job failed due to task failures
. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/A_INSTANCE_ID due to Spot VM
 preemption with exit code 50001."
  event_time {
    seconds: 1713288624
    nanos: 767597866
  }
  type: "STATUS_CHANGED"
}

task_groups {
  key: "group0"
  value {
    counts {
      key: "FAILED"
      value: 1
    }
    instances {
      machine_type: "e2-standard-2"
      provisioning_model: SPOT
      task_pack: 1
      boot_disk {
        type: "pd-balanced"
        size_gb: 30
        image: "projects/batch-custom-image/global/images/batch-cos-stable-official-20240320-01-p00"
      }
    }
  }
}
run_duration {
  seconds: 705
  nanos: 670973898
}

2024-04-16 17:30:25 cromwell-system-akka.dispatchers.backend-dispatcher-2485 INFO  - GcpBatchAsyncBackendJobExecutionActor [UUID(0c7363b7)Test.mergeTest:NA:1]: Status change fr
om Running to Failed
2024-04-16 17:30:25 cromwell-system-akka.dispatchers.backend-dispatcher-2485 INFO  - isTerminal match terminal run status with Failed
2024-04-16 17:30:25 cromwell-system-akka.dispatchers.backend-dispatcher-2485 INFO  - GCP batch job unsuccessful matched isDone
2024-04-16 17:30:25 cromwell-system-akka.dispatchers.engine-dispatcher-2358 INFO  - WorkflowManagerActor: Workflow 0c7363b7-6b8f-48cf-8f38-f66d127b305f failed (during ExecutingWorkflowSta
te): java.lang.RuntimeException: Task Test.mergeTest:NA:1 failed for unknown reason: Failed

        at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure(StandardAsyncExecutionActor.scala:1170)
        at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure$(StandardAsyncExecutionActor.scala:1169)
        at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:123)
        at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1442)
        at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1439)
        at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:490)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2024-04-16 17:30:28 cromwell-system-akka.dispatchers.engine-dispatcher-2309 INFO  - WorkflowManagerActor: Workflow actor for 0c7363b7-6b8f-48cf-8f38-f66d127b305f completed with status 'Fa
iled'. The workflow will be removed from the workflow store.

=======================log end============

Thanks. If I run the WDL again, it works without any problem. The jobs fails will always be preempted.

Regards,
Zhili

In theory, this solves #7407

aednichols added the GCP Batch label May 9, 2024

AlexITC added a commit that referenced this issue May 21, 2024

Try fixing preemption errors from GCP

3c9d020

In theory, this solves #7407

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP batch backend preempt handling issue #7407

GCP batch backend preempt handling issue #7407

zhilizheng commented Apr 16, 2024

dspeck1 commented Apr 17, 2024

zhilizheng commented Apr 17, 2024 •

edited by AlexITC

GCP batch backend preempt handling issue #7407

GCP batch backend preempt handling issue #7407

Comments

zhilizheng commented Apr 16, 2024

dspeck1 commented Apr 17, 2024

zhilizheng commented Apr 17, 2024 • edited by AlexITC

zhilizheng commented Apr 17, 2024 •

edited by AlexITC