Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP batch backend preempt handling issue #7407

Open
zhilizheng opened this issue Apr 16, 2024 · 2 comments
Open

GCP batch backend preempt handling issue #7407

zhilizheng opened this issue Apr 16, 2024 · 2 comments

Comments

@zhilizheng
Copy link

The GCP batch backend preemption handling seems to have issue. When preemption happens the job had very high possibility to be error. The typical error would be : time=“…” level=error msg=“error waiting for container:” . It will take the preempt events as the error from Cromwell logs. However, in the google batch console, it shows clearly "preemption notice has received and will be processed".

GCP batch

The workflow works perfectly in GCP life science backend

@dspeck1
Copy link
Collaborator

dspeck1 commented Apr 17, 2024

Hi @zhilizheng - Please post the output or error logs. We will review.

@zhilizheng
Copy link
Author

zhilizheng commented Apr 17, 2024

Hi @dspeck1 ,

Thanks for your reply. The is the cromwell outputs when things happened:

=======================log start============

status_events {
  description: "Job state is set from QUEUED to SCHEDULED for job projects/A_JOB_ID."
  event_time {
    seconds: 1713287682
    nanos: 566509009
  }
  type: "STATUS_CHANGED"
}
status_events {
  description: "Job state is set from SCHEDULED to RUNNING for job projects/A_JOB_ID."
  event_time {
    seconds: 1713287919
    nanos: 96623968
  }
  type: "STATUS_CHANGED"
}
status_events {
  description: "Job state is set from RUNNING to FAILED for job projects/A_JOB_ID. Job failed due to task failures
. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/A_INSTANCE_ID due to Spot VM
 preemption with exit code 50001."
  event_time {
    seconds: 1713288624
    nanos: 767597866
  }
  type: "STATUS_CHANGED"
}

task_groups {
  key: "group0"
  value {
    counts {
      key: "FAILED"
      value: 1
    }
    instances {
      machine_type: "e2-standard-2"
      provisioning_model: SPOT
      task_pack: 1
      boot_disk {
        type: "pd-balanced"
        size_gb: 30
        image: "projects/batch-custom-image/global/images/batch-cos-stable-official-20240320-01-p00"
      }
    }
  }
}
run_duration {
  seconds: 705
  nanos: 670973898
}

2024-04-16 17:30:25 cromwell-system-akka.dispatchers.backend-dispatcher-2485 INFO  - GcpBatchAsyncBackendJobExecutionActor [UUID(0c7363b7)Test.mergeTest:NA:1]: Status change fr
om Running to Failed
2024-04-16 17:30:25 cromwell-system-akka.dispatchers.backend-dispatcher-2485 INFO  - isTerminal match terminal run status with Failed
2024-04-16 17:30:25 cromwell-system-akka.dispatchers.backend-dispatcher-2485 INFO  - GCP batch job unsuccessful matched isDone
2024-04-16 17:30:25 cromwell-system-akka.dispatchers.engine-dispatcher-2358 INFO  - WorkflowManagerActor: Workflow 0c7363b7-6b8f-48cf-8f38-f66d127b305f failed (during ExecutingWorkflowSta
te): java.lang.RuntimeException: Task Test.mergeTest:NA:1 failed for unknown reason: Failed

        at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure(StandardAsyncExecutionActor.scala:1170)
        at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure$(StandardAsyncExecutionActor.scala:1169)
        at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:123)
        at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1442)
        at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1439)
        at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:490)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2024-04-16 17:30:28 cromwell-system-akka.dispatchers.engine-dispatcher-2309 INFO  - WorkflowManagerActor: Workflow actor for 0c7363b7-6b8f-48cf-8f38-f66d127b305f completed with status 'Fa
iled'. The workflow will be removed from the workflow store.

=======================log end============

Thanks. If I run the WDL again, it works without any problem. The jobs fails will always be preempted.

Regards,
Zhili

AlexITC added a commit that referenced this issue May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants