"no running task found" issue after upgrade to 7.7.0 #8172

igor-nikiforov · 2022-03-16T21:19:19Z

Summary

After upgrading to 7.7.0 from 7.6.0 we started facing many errors in resources like following:

run check: start process: backend error: Exit status: 500, message: {"Type":"","Message":"task retrieval: no running task found: task d7139440-721d-4258-79bb-25ad9760bab9 not found: not found","Handle":"","ProcessID":"","Binary":""}

We are using containerd runtime and running Concourse on Kubernetes. After revert back to 7.6.0 all errors disappeared.

Steps to reproduce

Upgrade 7.6.0 to 7.7.0

Expected results

No errors :)

Actual results

See above.

Triaging info

Concourse version: 7.7.0
containerd: 1.5.8
Kubernetes: 1.21.7

The text was updated successfully, but these errors were encountered:

xtremerui · 2022-03-17T19:10:12Z

Hi, where is this error happening e.g. resource check, task step, put step?

igor-nikiforov · 2022-03-17T19:24:12Z

@xtremerui resource check.

dhml · 2022-03-19T18:49:58Z

FWIW, pausing the pipeline then un-pausing it after something longer than resource.check_every appears to fix this issue.

nekrondev · 2022-03-29T18:42:07Z

Workaround did help to fix the issue I expired with regular S3 storage checks for new deployment artifacts that run into task not found error.

I got this issue by upgrading from 7.7.0 to 7.7.1 shutting down worker node, updating worker node, starting worker node, shutting down web node, upgrading web node, starting web node (that was done manually on a development system).

On production system running the same pipelines I first did the upgrade dance on web node and afterwards on worker node resulting in no issues with no running tasks found but this might be random.

clarafu · 2022-04-20T14:03:59Z

Is there any other information that you can give us? Does this happen occasionally but keeps flaking with this error? Or did it only happen at the beginning when you upgraded and never again? Was everyone that saw this error using containerd?

0x450x6c · 2022-04-23T00:45:58Z

Same issue.

It was happen after docker restart, no version changed, I'm using 7.7.1 version with containerd.

aeijdenberg · 2022-06-05T13:40:07Z

I'm seeing this issue on 7.7.1 using containerd with the Helm chart (when using runc on k3s I see errors related to cgroups). I had Concourse deployed on k3s, and for various reasons, I wanted to shut everything down then back up again. I ran:

sudo service k3s stop
/usr/local/bin/k3s-killall.sh
sudo service k3s start

and now I'm in that state for most(all?) of my checks. Hopeing some kind of TTL will kick in... Looks like renaming a resource in a pipeline is enough to unstick it, but not a terribly satisfying workaround.

aeijdenberg · 2022-06-05T13:41:31Z

FWIW, pausing the pipeline then un-pausing it after something longer than resource.check_every appears to fix this issue.

I wanted to try this, but I'd cunningly set some of these resources to check_every: never (as I manually trigger the build), so waiting longer than never is not a option. :)

aeijdenberg · 2022-06-06T07:51:32Z

Problem cleared itself up overnight with no intervention.

nekrondev · 2022-06-07T10:40:05Z

FYI: Just upgraded from 7.7.1 to 7.8.0 without any issues (non-clusterer deployment, just plain docker-compose).
Update order was web node first (stop, upgrade, start) then following worker node (stop, upgrade, start). Previously we did it the reverse way starting with worker node and web node at last that caused the issue.

holgerstolzenberg · 2022-06-10T09:00:10Z

We are also seeing these issues being on version 7.7.1 at the moment.

As for the circumstances:

we did not upgrade recently
issue appears suddenly and randomly throughout all pipelines, but mostly related to resources whose checks run on the same worker
issue goes away when Concourse schedules the resource check on another worker
the worker causing the resource checks to fail is operational and can execute jobs/tasks
not sure but check seems to work again on problematic worker after it ran on another worker in the meantime
OR sometimes it seems like the issue just goes away after some timeout

holgerstolzenberg · 2022-06-27T07:55:07Z

Just an update on this one.

We updated to Concourse 7.8.1 last week and the issue is still there, I'd rather say it has become worse, now not only resource checks but also tasks are effected
and this issue pops up throughout the whole day.

I tried a lot of stuff to analyze/overcome this, but nothing helped in the first place.

Facts:

the absolutely random and hits heavier under load
it occurs in every pipeline and Concourse lits up like a Christmas tree
it self heals after some time, but that is no solution for mission critical stuff, the team really had pain with it last week
logging into the worker container, you can see that the volume the tasks wants to use is really gone, it is just gone

I've extracted the following log stuff so far:

{"timestamp":"2022-06-23T11:53:49.986269148Z","level":"info","source":"worker","message":"worker.garden.garden-server.create.created","data":{"request":{"Handle":"387c694c-3556-4d43-495c-834610e7f343","GraceTime":0,"RootFSPath":"raw:///worker-state/volumes/live/a4a75891-4d12-43f0-45bd-56c0c3a9fe65/volume","BindMounts":[{"src_path":"/worker-state/volumes/live/09de2f61-be17-45b7-7023-4e572565d203/volume","dst_path":"scratch","mode":1}],"Network":"","Privileged":false,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{},"pid_limits":{}}},"session":"1.2.24457"}}
{"timestamp":"2022-06-23T11:53:50.019615220Z","level":"info","source":"worker","message":"worker.garden.garden-server.get-properties.got-properties","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.24460"}}
{"timestamp":"2022-06-23T11:53:50.179611267Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.spawned","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","id":"8c64794d-51d5-4bad-52d2-387b4375c6bb","session":"1.2.24461","spec":{"Path":"/opt/resource/check","Dir":"","User":"","Limits":{},"TTY":null}}}
{"timestamp":"2022-06-23T11:53:58.440124752Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.exited","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","id":"8c64794d-51d5-4bad-52d2-387b4375c6bb","session":"1.2.24461","status":0}}
{"timestamp":"2022-06-23T11:57:05.835467090Z","level":"info","source":"worker","message":"worker.garden.garden-server.get-properties.got-properties","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.24979"}}
{"timestamp":"2022-06-23T11:57:06.067347438Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.spawned","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","id":"49be3c07-64e5-4f62-6715-54427d31a925","session":"1.2.24980","spec":{"Path":"/opt/resource/check","Dir":"","User":"","Limits":{},"TTY":null}}}
{"timestamp":"2022-06-23T11:58:54.559256144Z","level":"error","source":"worker","message":"worker.garden.garden-server.attach.failed","data":{"error":"task: no running task found: task 387c694c-3556-4d43-495c-834610e7f343 not found: not found","handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.13"}}
{"timestamp":"2022-06-23T12:05:32.874380615Z","level":"info","source":"worker","message":"worker.garden.garden-server.get-properties.got-properties","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.927"}}
{"timestamp":"2022-06-23T12:05:32.879314608Z","level":"error","source":"worker","message":"worker.garden.garden-server.run.failed","data":{"error":"task retrieval: no running task found: task 387c694c-3556-4d43-495c-834610e7f343 not found: not found","handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.928"}}

A colleague of mine stumbled upon this issue and we suspected this to be a possible reason:
containerd/containerd#2202

So we updated to the latest focal kernel and deactivated the hugepages memory stuff with no success.

Then I tried to tweak some of the resource check/GC/reaping time settings as I suspected some type of race condition with no success.

In the last resort we had the idea to not use containerd runtime any more to make sure that it is not related to it - and bang - pipes are okay by now.
I'll definitely monitor this further but it looks like using guardian runtime works around this issue.

Still I think this needs investigation. The volumes being just gone with containerd is no good.

xtremerui · 2022-06-27T14:18:02Z

@holgerstolzenberg thx for the updates. I appreciate the time you spent on it.

So far there is no good way to reproduce this issue and in our own CI we haven't noticed such error. We'd like to see if there are more similar case reported and hoping for next containerd update.

holgerstolzenberg · 2022-07-12T17:39:35Z

We tried to run the Concourse Helm Chart within OKD. Here we are forced to use the containerd runtime, as guardian does not work on Fedora (cgroups V2).

We migrated a test pipeline and see the exact same problematic on the vanilla cluster in the OKD. This time this is a real show stopper, as we cannot switch to guardian here.

vixus0 · 2022-07-20T12:56:56Z

Also seeing the following for resource checks and task executions:

wait for process completion: backend error: Exit status: 500, message: {"Type":"","Message":"task: no running task found: task 5fd6f69a-480b-4390-69a9-1066d61e565d not found: not found","Handle":"","ProcessID":"","Binary":""}

It seems to only affect deployments with lots of activity and becomes worse under high load.

Deployed versions:

Concourse: v7.8.0 (Helm chart: 17.0.0)
Kubernetes: v1.22.6-eks-7d68063
Containerd: v1.6.5

igor-nikiforov · 2022-08-01T21:04:45Z

This issue still exist on Concourse v7.8.1 and containerd v1.6.6.

Is there any ETA when could expect fix for this issue?

Thanks.

nekrondev · 2022-08-03T14:40:52Z

@xtremerui I would suggest to look into abandoned PR #7042 as this explains the issues many of us have.

xtremerui · 2022-08-03T19:20:57Z

@nekrondev thx for the info. Thats helpful. However I still don't understand why things becomes worse after upgrading from 7.6.

In #7042, the error shows after worker restarted as a task has gone while the container is persisted (here we know exactly why the task was gone).
While in this issue, the error could show up when under heavy load.

Combining the comments above (and also containerd/containerd#2202), a possible case is containerd needs more resources (or worker memory comsuption becomes higher) to run task (a namespaced process) in Concourse > v7.6 while itself is sensitive to memory heavy tasks, which might got reaped without ATC's awareness.

Note that, #7042 shows attemps to either better GC those persisted container if the task is dead or recreate the task from the persisted container. But neither of them could solve the problem i.e. preventing the task to be reaped.

ATM I could only suggest to try increasing the worker memory and see if it helps the situation.

Also, after going through the diffs between 7.6 and 7.7, I think #8048 might have a play in this issue that it might just spawn many check containers for a resource type that will overload the worker. So please keep an eye on the resources type one resource is using when you see the error happens in its check. We'd like to know if the check errors happen to resources that have same resource type (or several resource types that are very commonly used).

@evanchaoli

IanTaylor-catapult · 2022-09-28T15:21:27Z

Hit this same issue while using the containerd runtime. Set up a secondary worker and restarted the primary / web nodes to get it working again.

From what I saw, if you only have one worker and it gets in this state it will never recover.

kardashov · 2022-10-11T11:36:46Z

Facing same error on Concourse 7.5.0 deployed to GKE K8s cluster using HELM. We run 5 worker nodes.
In my case this occurs only for tasks with heavy CPU load after 5-10 minutes of task execution.
In my case it was possible to fix by applying CPU limits on CPU heavy tasks.

backend error: Exit status: 500, message: {"Type":"","Message":"task: no running task found: task fbffd41f-73d0-3552-4aaf-965a3b006aef not found: not found","Handle":"","ProcessID":"","Binary":""}"

At the same time following log entries can be observed in one of concourse-worker pods

{
"textPayload": "time="2022-10-05T12:43:20.314245070Z" level=info msg="starting signal loop" namespace=concourse path=/run/containerd/io.containerd.runtime.v2.task/concourse/fbffd41f-73d0-4435-4aaf-965a3b006aef pid=1717658\n",
}
++++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"timestamp": "2022-10-05T12:43:20.541373463Z",
"level": "info",
"message": "worker.garden.garden-server.create.created",
"data": {
"session": "1.2.84461",
"request": {
"Limits": {
"cpu_limits": {},
"bandwidth_limits": {},
"pid_limits": {},
"memory_limits": {},
"disk_limits": {}
},
"GraceTime": 0,
"BindMounts": [
REDACTED
],
"RootFSPath": "raw:///concourse-work-dir/volumes/live/4f7f9829-6399-45e4-4a6d-65119f58a48f/volume/rootfs",
"Privileged": false,
"Network": "",
"Handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef"
}
},
"source": "worker"
}
+++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"data": {
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef",
"session": "1.2.84462"
},
"message": "worker.garden.garden-server.get-properties.got-properties",
"level": "info",
"source": "worker",
"timestamp": "2022-10-05T12:43:20.577993390Z"
}
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"source": "worker",
"message": "worker.garden.garden-server.attach.failed",
"data": {
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef",
"session": "1.2.84463",
"error": "load proc: no running process found: process does not exist task: not found"
},
"level": "error",
"timestamp": "2022-10-05T12:43:20.580878222Z"
}
}
++++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"level": "info",
"data": {
"session": "1.2.84464",
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef"
},
"source": "worker",
"message": "worker.garden.garden-server.get-properties.got-properties",
"timestamp": "2022-10-05T12:43:20.582814559Z"
}
}
++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"source": "worker",
"level": "info",
"message": "worker.garden.garden-server.run.spawned",
"timestamp": "2022-10-05T12:43:20.638253661Z",
"data": {
"session": "1.2.84465",
"spec": {
"User": "",
"Path": "my-repo/my-script.sh",
"TTY": {
"window_size": {
"columns": 500,
"rows": 500
}
},
"Dir": "/tmp/build/712d8b9f",
"Limits": {}
},
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef",
"id": "task"
}
}
}

4x0v7 · 2022-11-13T23:56:01Z

Got this issue with v7.8.3 after my laptop blue screened and rebooted. I'm running with docker-compose locally, a web+worker container and a postgres container.
Not running anything too strenuous (1 job) and everything had been working fine.
Pausing then un-pausing the pipeline fixed it for me

peterhaochen47 · 2022-12-20T17:54:57Z

Seeing this issue in v7.9.0 (hush-house) consistently in more CPU-intensive tasks, whereas less CPU-intensive tasks are working fine.

readmodifywrite · 2023-01-09T22:22:52Z

Seeing this error as well, on both 7.6.0 and 7.9.0. It happens very quickly on relatively simple pipelines (almost instantly, sometimes). It is just a resource check for a small git repo. I cannot imagine this is a CPU or memory loading issue. It has happened with the same pipelines running on 2 different machines so far.

Single worker, containerd, running from a carbon copy of the quickstart docker-compose.

It does occasionally seem to clear itself up, but it might sit there broken for a while. That's not really a remedy.

RCM7 · 2023-02-13T17:43:51Z

We're also getting this error at 7.9.0

akobir-mp · 2023-03-21T19:55:21Z

also experiencing this in 7.9.0 - @xtremerui can we have another look at this one? its very prevalent

tobyhersey-sky · 2023-04-13T11:13:26Z

Also seeing this issue in v7.9.1

bheemann · 2023-04-26T06:49:13Z

Also seeing this issue in v7.9.1

saj · 2023-12-13T22:30:27Z

We hit this with an in-place worker restart. No Concourse upgrade nor downgrade. Concourse v7.10.0.

Dashboard was full of orange triangles.
The orange triangles appeared to dissipate over time with no action on my part.

We run our workers on over-provisioned bare metal. The environment is usually very stable, and is nowhere near capacity. I only restarted the workers today while trying to debug an unrelated problem. We also use the containerd runtime. (I did not land the workers prior to restart. No worker state was lost between restarts, so I did not think it would be necessary to land them.)

Edit: Yes, reliably reproducible on in-place worker restart.
My original post included a 'workaround' that was found later not to reliably work.

Essentially, it seems one must wait for Concourse to GC out the errant containers (fly containers), then you can ask for a fly check-resource without error (or leave it to the scheduler to do the latter at some point in the future).

igor-nikiforov added the bug label Mar 16, 2022

xtremerui self-assigned this Mar 17, 2022

clarafu added this to the v7.8.0 milestone Apr 14, 2022

navdeep-pama removed this from the v7.8.0 milestone May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"no running task found" issue after upgrade to 7.7.0 #8172

"no running task found" issue after upgrade to 7.7.0 #8172

igor-nikiforov commented Mar 16, 2022 •

edited

xtremerui commented Mar 17, 2022

igor-nikiforov commented Mar 17, 2022

dhml commented Mar 19, 2022

nekrondev commented Mar 29, 2022

clarafu commented Apr 20, 2022

0x450x6c commented Apr 23, 2022

aeijdenberg commented Jun 5, 2022

aeijdenberg commented Jun 5, 2022

aeijdenberg commented Jun 6, 2022

nekrondev commented Jun 7, 2022 •

edited

holgerstolzenberg commented Jun 10, 2022 •

edited

holgerstolzenberg commented Jun 27, 2022 •

edited

xtremerui commented Jun 27, 2022

holgerstolzenberg commented Jul 12, 2022

vixus0 commented Jul 20, 2022 •

edited

igor-nikiforov commented Aug 1, 2022

nekrondev commented Aug 3, 2022 •

edited

xtremerui commented Aug 3, 2022 •

edited

IanTaylor-catapult commented Sep 28, 2022

kardashov commented Oct 11, 2022 •

edited

4x0v7 commented Nov 13, 2022

peterhaochen47 commented Dec 20, 2022

readmodifywrite commented Jan 9, 2023

RCM7 commented Feb 13, 2023

akobir-mp commented Mar 21, 2023

tobyhersey-sky commented Apr 13, 2023

bheemann commented Apr 26, 2023

saj commented Dec 13, 2023 •

edited

"no running task found" issue after upgrade to 7.7.0 #8172

"no running task found" issue after upgrade to 7.7.0 #8172

Comments

igor-nikiforov commented Mar 16, 2022 • edited

Summary

Steps to reproduce

Expected results

Actual results

Triaging info

xtremerui commented Mar 17, 2022

igor-nikiforov commented Mar 17, 2022

dhml commented Mar 19, 2022

nekrondev commented Mar 29, 2022

clarafu commented Apr 20, 2022

0x450x6c commented Apr 23, 2022

aeijdenberg commented Jun 5, 2022

aeijdenberg commented Jun 5, 2022

aeijdenberg commented Jun 6, 2022

nekrondev commented Jun 7, 2022 • edited

holgerstolzenberg commented Jun 10, 2022 • edited

holgerstolzenberg commented Jun 27, 2022 • edited

xtremerui commented Jun 27, 2022

holgerstolzenberg commented Jul 12, 2022

vixus0 commented Jul 20, 2022 • edited

igor-nikiforov commented Aug 1, 2022

nekrondev commented Aug 3, 2022 • edited

xtremerui commented Aug 3, 2022 • edited

IanTaylor-catapult commented Sep 28, 2022

kardashov commented Oct 11, 2022 • edited

4x0v7 commented Nov 13, 2022

peterhaochen47 commented Dec 20, 2022

readmodifywrite commented Jan 9, 2023

RCM7 commented Feb 13, 2023

akobir-mp commented Mar 21, 2023

tobyhersey-sky commented Apr 13, 2023

bheemann commented Apr 26, 2023

saj commented Dec 13, 2023 • edited

igor-nikiforov commented Mar 16, 2022 •

edited

nekrondev commented Jun 7, 2022 •

edited

holgerstolzenberg commented Jun 10, 2022 •

edited

holgerstolzenberg commented Jun 27, 2022 •

edited

vixus0 commented Jul 20, 2022 •

edited

nekrondev commented Aug 3, 2022 •

edited

xtremerui commented Aug 3, 2022 •

edited

kardashov commented Oct 11, 2022 •

edited

saj commented Dec 13, 2023 •

edited