Skip to content

Actor stuck in STATUS_SUSPENDING #50

@thockin

Description

@thockin

Expected Behavior

Suspended actor

Actual Behavior

Stuck suspending

Steps to Reproduce the Problem

  1. Run the counter demo
  2. Suspend an actor

Version: 18c86

$ kubectl ate suspend actor ctr1
Error: failed to suspend actor: rpc error: code = Unknown desc = while calling ateom.CheckpointWorkload: rpc error: code = Unknown desc = while deleting pause container: while running `runsc delete`: exit status 128

$ kubectl ate suspend actor ctr1
Error: failed to suspend actor: rpc error: code = Unknown desc = while calling ateom.CheckpointWorkload: rpc error: code = Unknown desc = while checkpointing pause: while running `runsc checkpoint`: exit status 128

$ kubectl ate get workers
NAMESPACE          POOL      POD                                   STATUS     ASSIGNED ACTOR
ate-demo-counter   counter   counter-deployment-585c7fc5cd-x26b5   FREE       <none>
ate-demo-counter   counter   counter-deployment-585c7fc5cd-rrtqm   FREE       <none>
ate-demo-counter   counter   counter-deployment-585c7fc5cd-rgrb9   FREE       <none>
ate-demo-counter   counter   counter-deployment-585c7fc5cd-zt776   FREE       <none>
ate-demo-counter   counter   counter-deployment-585c7fc5cd-4r9p9   ASSIGNED   ate-demo-counter/counter/ctr1

$ kubectl ate get actors
NAMESPACE          TEMPLATE   ID                                     STATUS              ATEOM POD                                              ATEOM IP      VERSION
ate-demo-counter   counter    ctr6                                   STATUS_SUSPENDED    <none>                                                               1
ate-demo-counter   counter    ctr2                                   STATUS_SUSPENDED    <none>                                                               5
ate-demo-counter   counter    ctr4                                   STATUS_SUSPENDED    <none>                                                               5
ate-demo-counter   counter    ctr3                                   STATUS_SUSPENDED    <none>                                                               5
ate-demo-counter   counter    ctr5                                   STATUS_SUSPENDED    <none>                                                               5
ate-demo-counter   counter    ctr1                                   STATUS_SUSPENDING   ate-demo-counter/counter-deployment-585c7fc5cd-4r9p9   10.244.0.29   4

$ kubectl ate suspend actor ctr1
Error: failed to suspend actor: rpc error: code = Unknown desc = while calling ateom.CheckpointWorkload: rpc error: code = Unknown desc = while checkpointing pause: while running `runsc checkpoint`: exit status 128

Worker Logs:

{"time":"2026-05-21T21:46:15.90675658Z","level":"INFO","msg":"Actor checkpointing","labels":{"ate.dev/actor_id":"ctr1","ate.dev/actor_template":"counter","ate.dev/actor_namespace":"ate-demo-counter"}}
{"time":"2026-05-21T21:46:15.906784733Z","level":"INFO","msg":"About to run runsc checkpoint","container":"pause"}
I0521 21:46:15.940484     140 cli.go:271] **************** gVisor ****************
I0521 21:46:15.940520     140 cli.go:272] Version release-20260511.0-42-ga7924c4ef10d-dirty, go1.25.5, amd64, 12 CPUs, linux, PID 140, PPID 1, UID 0, GID 0
I0521 21:46:15.940530     140 cli.go:274] Args: [/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63 -log-format json --alsologtostderr -root /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state checkpoint -image-path /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/checkpoint pause]
I0521 21:46:15.940544     140 config.go:487] Platform: systrap
I0521 21:46:15.940555     140 config.go:488] RootDir: /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state
I0521 21:46:15.940560     140 config.go:489] FileAccess: exclusive / Directfs: true / Overlay: root:self
I0521 21:46:15.940568     140 config.go:490] Network: sandbox
I0521 21:46:15.940574     140 config.go:491] UseCPUNums: false
W0521 21:46:15.940579     140 config.go:496] --allow-suid is disabled, SUID/SGID bits on executables will be ignored.
I0521 21:46:15.940584     140 cli.go:283] **************** gVisor ****************
I0521 21:46:15.951081     140 cli.go:310] Exiting with status: 0
W0521 21:46:15.980427     145 maincli.go:38] Cannot find if container pause exists, checking if sandbox pause is running, err: getting container state (CID: "pause"): connecting to control server at PID 34: connection refused
I0521 21:46:15.980466     145 cli.go:271] **************** gVisor ****************
I0521 21:46:15.980490     145 cli.go:272] Version release-20260511.0-42-ga7924c4ef10d-dirty, go1.25.5, amd64, 12 CPUs, linux, PID 145, PPID 1, UID 0, GID 0
I0521 21:46:15.980499     145 cli.go:274] Args: [/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63 -log-format json --alsologtostderr -root /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state state pause]
I0521 21:46:15.980511     145 config.go:487] Platform: systrap
I0521 21:46:15.980523     145 config.go:488] RootDir: /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state
I0521 21:46:15.980528     145 config.go:489] FileAccess: exclusive / Directfs: true / Overlay: root:self
I0521 21:46:15.980534     145 config.go:490] Network: sandbox
I0521 21:46:15.980540     145 config.go:491] UseCPUNums: false
W0521 21:46:15.980545     145 config.go:496] --allow-suid is disabled, SUID/SGID bits on executables will be ignored.
{
  "ociVersion": "1.2.1",
  "id": "pause",
  "status": "running",
  "pid": 34,
  "bundle": "/run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/bundles/pause",
  "annotations": {
    "io.kubernetes.cri.container-name": "pause",
    "io.kubernetes.cri.container-type": "sandbox"
  }
}
I0521 21:46:15.980550     145 cli.go:283] **************** gVisor ****************
I0521 21:46:15.980576     145 cli.go:310] Exiting with status: 0
W0521 21:46:16.032647     150 maincli.go:38] Cannot find if container counter exists, checking if sandbox pause is running, err: getting container state (CID: "counter"): connecting to control server at PID 34: connection refused
W0521 21:46:16.032650     150 maincli.go:38] Sandbox isn't running anymore, marking container counter as stopped:
I0521 21:46:16.032696     150 cli.go:271] **************** gVisor ****************
I0521 21:46:16.032722     150 cli.go:272] Version release-20260511.0-42-ga7924c4ef10d-dirty, go1.25.5, amd64, 12 CPUs, linux, PID 150, PPID 1, UID 0, GID 0
I0521 21:46:16.032730     150 cli.go:274] Args: [/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63 -log-format json --alsologtostderr -root /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state state counter]
I0521 21:46:16.032742     150 config.go:487] Platform: systrap
I0521 21:46:16.032754     150 config.go:488] RootDir: /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state
I0521 21:46:16.032759     150 config.go:489] FileAccess: exclusive / Directfs: true / Overlay: root:self
I0521 21:46:16.032766     150 config.go:490] Network: sandbox
I0521 21:46:16.032772     150 config.go:491] UseCPUNums: false
W0521 21:46:16.032777     150 config.go:496] --allow-suid is disabled, SUID/SGID bits on executables will be ignored.
I0521 21:46:16.032782     150 cli.go:283] **************** gVisor ****************
I0521 21:46:16.032809     150 cli.go:310] Exiting with status: 0
{
  "ociVersion": "1.2.1",
  "id": "counter",
  "status": "stopped",
  "pid": -1,
  "bundle": "/run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/bundles/counter",
  "annotations": {
    "io.kubernetes.cri.container-name": "counter",
    "io.kubernetes.cri.container-type": "container",
    "io.kubernetes.cri.sandbox-id": "pause"
  }
}
W0521 21:46:16.080113     155 maincli.go:38] Cannot find if container counter exists, checking if sandbox pause is running, err: getting container state (CID: "counter"): connecting to control server at PID 34: connection refused
W0521 21:46:16.080116     155 maincli.go:38] Sandbox isn't running anymore, marking container counter as stopped:
I0521 21:46:16.080156     155 cli.go:271] **************** gVisor ****************
I0521 21:46:16.080179     155 cli.go:272] Version release-20260511.0-42-ga7924c4ef10d-dirty, go1.25.5, amd64, 12 CPUs, linux, PID 155, PPID 1, UID 0, GID 0
I0521 21:46:16.080188     155 cli.go:274] Args: [/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63 -log-format json --alsologtostderr -root /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state delete -force counter]
I0521 21:46:16.080199     155 config.go:487] Platform: systrap
I0521 21:46:16.080215     155 config.go:488] RootDir: /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state
I0521 21:46:16.080220     155 config.go:489] FileAccess: exclusive / Directfs: true / Overlay: root:self
I0521 21:46:16.080226     155 config.go:490] Network: sandbox
I0521 21:46:16.080233     155 config.go:491] UseCPUNums: false
W0521 21:46:16.080238     155 config.go:496] --allow-suid is disabled, SUID/SGID bits on executables will be ignored.
I0521 21:46:16.080244     155 cli.go:283] **************** gVisor ****************
W0521 21:46:16.080491     155 container.go:1821] Process (34) not found setting oom_score_adj
I0521 21:46:16.080522     155 cli.go:310] Exiting with status: 0
W0521 21:46:16.143706     160 maincli.go:38] Cannot find if container pause exists, checking if sandbox pause is running, err: getting container state (CID: "pause"): connecting to control server at PID 34: connection refused
W0521 21:46:16.143708     160 maincli.go:38] Sandbox isn't running anymore, marking container pause as stopped:
I0521 21:46:16.143741     160 cli.go:271] **************** gVisor ****************
I0521 21:46:16.143761     160 cli.go:272] Version release-20260511.0-42-ga7924c4ef10d-dirty, go1.25.5, amd64, 12 CPUs, linux, PID 160, PPID 1, UID 0, GID 0
I0521 21:46:16.143771     160 cli.go:274] Args: [/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63 -log-format json --alsologtostderr -root /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state delete -force pause]
I0521 21:46:16.143783     160 config.go:487] Platform: systrap
I0521 21:46:16.143795     160 config.go:488] RootDir: /run/ateom-gvisor/actors/ate-demo-counter:counter:ctr1/runsc-state
I0521 21:46:16.143799     160 config.go:489] FileAccess: exclusive / Directfs: true / Overlay: root:self
I0521 21:46:16.143806     160 config.go:490] Network: sandbox
I0521 21:46:16.143811     160 config.go:491] UseCPUNums: false
W0521 21:46:16.143816     160 config.go:496] --allow-suid is disabled, SUID/SGID bits on executables will be ignored.
I0521 21:46:16.143821     160 cli.go:283] **************** gVisor ****************
W0521 21:46:21.053287     160 container.go:930] stopping container: removing cgroup path "/sys/fs/cgroup/pause": device or resource busy
W0521 21:46:21.053432     160 util.go:107] FATAL ERROR: destroying container: stopping container: removing cgroup path "/sys/fs/cgroup/pause": device or resource busy
destroying container: stopping container: removing cgroup path "/sys/fs/cgroup/pause": device or resource busy
{"time":"2026-05-21T21:46:21.054283909Z","level":"INFO","msg":"Handle RPC","method":"/ateom.Ateom/CheckpointWorkload","req":{"actor_template_namespace":"ate-demo-counter","actor_template_name":"counter","actor_id":"ctr1","runsc_path":"/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63","spec":{"containers":[{"name":"counter"}]}},"resp":null,"err":"while deleting pause container: while running `runsc delete`: exit status 128","elapsed-time":"5.14752792s"}
{"time":"2026-05-21T21:46:30.220620132Z","level":"INFO","msg":"Actor checkpointing","labels":{"ate.dev/actor_id":"ctr1","ate.dev/actor_template":"counter","ate.dev/actor_namespace":"ate-demo-counter"}}
{"time":"2026-05-21T21:46:30.220643698Z","level":"INFO","msg":"About to run runsc checkpoint","container":"pause"}
FetchSpec failed: loading container: file does not exist
{"time":"2026-05-21T21:46:30.272075347Z","level":"INFO","msg":"Handle RPC","method":"/ateom.Ateom/CheckpointWorkload","req":{"actor_template_namespace":"ate-demo-counter","actor_template_name":"counter","actor_id":"ctr1","runsc_path":"/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63","spec":{"containers":[{"name":"counter"}]}},"resp":null,"err":"while checkpointing pause: while running `runsc checkpoint`: exit status 128","elapsed-time":"51.455806ms"}
{"time":"2026-05-21T21:47:13.504365359Z","level":"INFO","msg":"Actor checkpointing","labels":{"ate.dev/actor_id":"ctr1","ate.dev/actor_template":"counter","ate.dev/actor_namespace":"ate-demo-counter"}}
{"time":"2026-05-21T21:47:13.50439249Z","level":"INFO","msg":"About to run runsc checkpoint","container":"pause"}
FetchSpec failed: loading container: file does not exist
{"time":"2026-05-21T21:47:13.552016869Z","level":"INFO","msg":"Handle RPC","method":"/ateom.Ateom/CheckpointWorkload","req":{"actor_template_namespace":"ate-demo-counter","actor_template_name":"counter","actor_id":"ctr1","runsc_path":"/run/ateom-gvisor/static-files/runsc-a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63","spec":{"containers":[{"name":"counter"}]}},"resp":null,"err":"while checkpointing pause: while running `runsc checkpoint`: exit status 128","elapsed-time":"47.651721ms"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions