Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"no running task found" issue after upgrade to 7.7.0 #8172

Open
igor-nikiforov opened this issue Mar 16, 2022 · 28 comments
Open

"no running task found" issue after upgrade to 7.7.0 #8172

igor-nikiforov opened this issue Mar 16, 2022 · 28 comments
Assignees
Labels

Comments

@igor-nikiforov
Copy link

igor-nikiforov commented Mar 16, 2022

Summary

After upgrading to 7.7.0 from 7.6.0 we started facing many errors in resources like following:

run check: start process: backend error: Exit status: 500, message: {"Type":"","Message":"task retrieval: no running task found: task d7139440-721d-4258-79bb-25ad9760bab9 not found: not found","Handle":"","ProcessID":"","Binary":""}

We are using containerd runtime and running Concourse on Kubernetes. After revert back to 7.6.0 all errors disappeared.

Steps to reproduce

Upgrade 7.6.0 to 7.7.0

Expected results

No errors :)

Actual results

See above.

Triaging info

  • Concourse version: 7.7.0
  • containerd: 1.5.8
  • Kubernetes: 1.21.7
@xtremerui
Copy link
Contributor

Hi, where is this error happening e.g. resource check, task step, put step?

@igor-nikiforov
Copy link
Author

@xtremerui resource check.

@xtremerui xtremerui self-assigned this Mar 17, 2022
@dhml
Copy link

dhml commented Mar 19, 2022

FWIW, pausing the pipeline then un-pausing it after something longer than resource.check_every appears to fix this issue.

@nekrondev
Copy link

Workaround did help to fix the issue I expired with regular S3 storage checks for new deployment artifacts that run into task not found error.

I got this issue by upgrading from 7.7.0 to 7.7.1 shutting down worker node, updating worker node, starting worker node, shutting down web node, upgrading web node, starting web node (that was done manually on a development system).

On production system running the same pipelines I first did the upgrade dance on web node and afterwards on worker node resulting in no issues with no running tasks found but this might be random.

@clarafu clarafu added this to the v7.8.0 milestone Apr 14, 2022
@clarafu
Copy link
Contributor

clarafu commented Apr 20, 2022

Is there any other information that you can give us? Does this happen occasionally but keeps flaking with this error? Or did it only happen at the beginning when you upgraded and never again? Was everyone that saw this error using containerd?

@0x450x6c
Copy link

Same issue.

It was happen after docker restart, no version changed, I'm using 7.7.1 version with containerd.

@navdeep-pama navdeep-pama removed this from the v7.8.0 milestone May 31, 2022
@aeijdenberg
Copy link
Contributor

I'm seeing this issue on 7.7.1 using containerd with the Helm chart (when using runc on k3s I see errors related to cgroups). I had Concourse deployed on k3s, and for various reasons, I wanted to shut everything down then back up again. I ran:

sudo service k3s stop
/usr/local/bin/k3s-killall.sh
sudo service k3s start

and now I'm in that state for most(all?) of my checks. Hopeing some kind of TTL will kick in... Looks like renaming a resource in a pipeline is enough to unstick it, but not a terribly satisfying workaround.

@aeijdenberg
Copy link
Contributor

FWIW, pausing the pipeline then un-pausing it after something longer than resource.check_every appears to fix this issue.

I wanted to try this, but I'd cunningly set some of these resources to check_every: never (as I manually trigger the build), so waiting longer than never is not a option. :)

@aeijdenberg
Copy link
Contributor

Problem cleared itself up overnight with no intervention.

@nekrondev
Copy link

nekrondev commented Jun 7, 2022

FYI: Just upgraded from 7.7.1 to 7.8.0 without any issues (non-clusterer deployment, just plain docker-compose).
Update order was web node first (stop, upgrade, start) then following worker node (stop, upgrade, start). Previously we did it the reverse way starting with worker node and web node at last that caused the issue.

@holgerstolzenberg
Copy link

holgerstolzenberg commented Jun 10, 2022

We are also seeing these issues being on version 7.7.1 at the moment.

Screenshot 2022-06-10 at 10 55 54

As for the circumstances:

  • we did not upgrade recently
  • issue appears suddenly and randomly throughout all pipelines, but mostly related to resources whose checks run on the same worker
  • issue goes away when Concourse schedules the resource check on another worker
  • the worker causing the resource checks to fail is operational and can execute jobs/tasks
  • not sure but check seems to work again on problematic worker after it ran on another worker in the meantime
  • OR sometimes it seems like the issue just goes away after some timeout

@holgerstolzenberg
Copy link

holgerstolzenberg commented Jun 27, 2022

Just an update on this one.

We updated to Concourse 7.8.1 last week and the issue is still there, I'd rather say it has become worse, now not only resource checks but also tasks are effected
and this issue pops up throughout the whole day.

I tried a lot of stuff to analyze/overcome this, but nothing helped in the first place.

Facts:

  • the absolutely random and hits heavier under load
  • it occurs in every pipeline and Concourse lits up like a Christmas tree
  • it self heals after some time, but that is no solution for mission critical stuff, the team really had pain with it last week
  • logging into the worker container, you can see that the volume the tasks wants to use is really gone, it is just gone

I've extracted the following log stuff so far:

{"timestamp":"2022-06-23T11:53:49.986269148Z","level":"info","source":"worker","message":"worker.garden.garden-server.create.created","data":{"request":{"Handle":"387c694c-3556-4d43-495c-834610e7f343","GraceTime":0,"RootFSPath":"raw:///worker-state/volumes/live/a4a75891-4d12-43f0-45bd-56c0c3a9fe65/volume","BindMounts":[{"src_path":"/worker-state/volumes/live/09de2f61-be17-45b7-7023-4e572565d203/volume","dst_path":"scratch","mode":1}],"Network":"","Privileged":false,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{},"pid_limits":{}}},"session":"1.2.24457"}}
{"timestamp":"2022-06-23T11:53:50.019615220Z","level":"info","source":"worker","message":"worker.garden.garden-server.get-properties.got-properties","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.24460"}}
{"timestamp":"2022-06-23T11:53:50.179611267Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.spawned","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","id":"8c64794d-51d5-4bad-52d2-387b4375c6bb","session":"1.2.24461","spec":{"Path":"/opt/resource/check","Dir":"","User":"","Limits":{},"TTY":null}}}
{"timestamp":"2022-06-23T11:53:58.440124752Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.exited","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","id":"8c64794d-51d5-4bad-52d2-387b4375c6bb","session":"1.2.24461","status":0}}
{"timestamp":"2022-06-23T11:57:05.835467090Z","level":"info","source":"worker","message":"worker.garden.garden-server.get-properties.got-properties","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.24979"}}
{"timestamp":"2022-06-23T11:57:06.067347438Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.spawned","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","id":"49be3c07-64e5-4f62-6715-54427d31a925","session":"1.2.24980","spec":{"Path":"/opt/resource/check","Dir":"","User":"","Limits":{},"TTY":null}}}
{"timestamp":"2022-06-23T11:58:54.559256144Z","level":"error","source":"worker","message":"worker.garden.garden-server.attach.failed","data":{"error":"task: no running task found: task 387c694c-3556-4d43-495c-834610e7f343 not found: not found","handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.13"}}
{"timestamp":"2022-06-23T12:05:32.874380615Z","level":"info","source":"worker","message":"worker.garden.garden-server.get-properties.got-properties","data":{"handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.927"}}
{"timestamp":"2022-06-23T12:05:32.879314608Z","level":"error","source":"worker","message":"worker.garden.garden-server.run.failed","data":{"error":"task retrieval: no running task found: task 387c694c-3556-4d43-495c-834610e7f343 not found: not found","handle":"387c694c-3556-4d43-495c-834610e7f343","session":"1.2.928"}}

A colleague of mine stumbled upon this issue and we suspected this to be a possible reason:
containerd/containerd#2202

So we updated to the latest focal kernel and deactivated the hugepages memory stuff with no success.

Then I tried to tweak some of the resource check/GC/reaping time settings as I suspected some type of race condition with no success.

In the last resort we had the idea to not use containerd runtime any more to make sure that it is not related to it - and bang - pipes are okay by now.
I'll definitely monitor this further but it looks like using guardian runtime works around this issue.

Still I think this needs investigation. The volumes being just gone with containerd is no good.

@xtremerui
Copy link
Contributor

@holgerstolzenberg thx for the updates. I appreciate the time you spent on it.

So far there is no good way to reproduce this issue and in our own CI we haven't noticed such error. We'd like to see if there are more similar case reported and hoping for next containerd update.

@holgerstolzenberg
Copy link

We tried to run the Concourse Helm Chart within OKD. Here we are forced to use the containerd runtime, as guardian does not work on Fedora (cgroups V2).

We migrated a test pipeline and see the exact same problematic on the vanilla cluster in the OKD. This time this is a real show stopper, as we cannot switch to guardian here.

@vixus0
Copy link
Contributor

vixus0 commented Jul 20, 2022

Also seeing the following for resource checks and task executions:

wait for process completion: backend error: Exit status: 500, message: {"Type":"","Message":"task: no running task found: task 5fd6f69a-480b-4390-69a9-1066d61e565d not found: not found","Handle":"","ProcessID":"","Binary":""}

It seems to only affect deployments with lots of activity and becomes worse under high load.

Deployed versions:

  • Concourse: v7.8.0 (Helm chart: 17.0.0)
  • Kubernetes: v1.22.6-eks-7d68063
  • Containerd: v1.6.5

@igor-nikiforov
Copy link
Author

This issue still exist on Concourse v7.8.1 and containerd v1.6.6.

Is there any ETA when could expect fix for this issue?

Thanks.

@nekrondev
Copy link

nekrondev commented Aug 3, 2022

@xtremerui I would suggest to look into abandoned PR #7042 as this explains the issues many of us have.

@xtremerui
Copy link
Contributor

xtremerui commented Aug 3, 2022

@nekrondev thx for the info. Thats helpful. However I still don't understand why things becomes worse after upgrading from 7.6.

In #7042, the error shows after worker restarted as a task has gone while the container is persisted (here we know exactly why the task was gone).
While in this issue, the error could show up when under heavy load.

Combining the comments above (and also containerd/containerd#2202), a possible case is containerd needs more resources (or worker memory comsuption becomes higher) to run task (a namespaced process) in Concourse > v7.6 while itself is sensitive to memory heavy tasks, which might got reaped without ATC's awareness.

Note that, #7042 shows attemps to either better GC those persisted container if the task is dead or recreate the task from the persisted container. But neither of them could solve the problem i.e. preventing the task to be reaped.

ATM I could only suggest to try increasing the worker memory and see if it helps the situation.

Also, after going through the diffs between 7.6 and 7.7, I think #8048 might have a play in this issue that it might just spawn many check containers for a resource type that will overload the worker. So please keep an eye on the resources type one resource is using when you see the error happens in its check. We'd like to know if the check errors happen to resources that have same resource type (or several resource types that are very commonly used).

@evanchaoli

@IanTaylor-catapult
Copy link

Hit this same issue while using the containerd runtime. Set up a secondary worker and restarted the primary / web nodes to get it working again.

From what I saw, if you only have one worker and it gets in this state it will never recover.

@kardashov
Copy link

kardashov commented Oct 11, 2022

Facing same error on Concourse 7.5.0 deployed to GKE K8s cluster using HELM. We run 5 worker nodes.
In my case this occurs only for tasks with heavy CPU load after 5-10 minutes of task execution.
In my case it was possible to fix by applying CPU limits on CPU heavy tasks.

backend error: Exit status: 500, message: {"Type":"","Message":"task: no running task found: task fbffd41f-73d0-3552-4aaf-965a3b006aef not found: not found","Handle":"","ProcessID":"","Binary":""}"

At the same time following log entries can be observed in one of concourse-worker pods

{
"textPayload": "time="2022-10-05T12:43:20.314245070Z" level=info msg="starting signal loop" namespace=concourse path=/run/containerd/io.containerd.runtime.v2.task/concourse/fbffd41f-73d0-4435-4aaf-965a3b006aef pid=1717658\n",
}
++++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"timestamp": "2022-10-05T12:43:20.541373463Z",
"level": "info",
"message": "worker.garden.garden-server.create.created",
"data": {
"session": "1.2.84461",
"request": {
"Limits": {
"cpu_limits": {},
"bandwidth_limits": {},
"pid_limits": {},
"memory_limits": {},
"disk_limits": {}
},
"GraceTime": 0,
"BindMounts": [
REDACTED
],
"RootFSPath": "raw:///concourse-work-dir/volumes/live/4f7f9829-6399-45e4-4a6d-65119f58a48f/volume/rootfs",
"Privileged": false,
"Network": "",
"Handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef"
}
},
"source": "worker"
}
+++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"data": {
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef",
"session": "1.2.84462"
},
"message": "worker.garden.garden-server.get-properties.got-properties",
"level": "info",
"source": "worker",
"timestamp": "2022-10-05T12:43:20.577993390Z"
}
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"source": "worker",
"message": "worker.garden.garden-server.attach.failed",
"data": {
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef",
"session": "1.2.84463",
"error": "load proc: no running process found: process does not exist task: not found"
},
"level": "error",
"timestamp": "2022-10-05T12:43:20.580878222Z"
}
}
++++++++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"level": "info",
"data": {
"session": "1.2.84464",
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef"
},
"source": "worker",
"message": "worker.garden.garden-server.get-properties.got-properties",
"timestamp": "2022-10-05T12:43:20.582814559Z"
}
}
++++++++++++++++++++++++++++++++++++++++++++++
{
"jsonPayload": {
"source": "worker",
"level": "info",
"message": "worker.garden.garden-server.run.spawned",
"timestamp": "2022-10-05T12:43:20.638253661Z",
"data": {
"session": "1.2.84465",
"spec": {
"User": "",
"Path": "my-repo/my-script.sh",
"TTY": {
"window_size": {
"columns": 500,
"rows": 500
}
},
"Dir": "/tmp/build/712d8b9f",
"Limits": {}
},
"handle": "fbffd41f-73d0-4435-4aaf-965a3b006aef",
"id": "task"
}
}
}

@4x0v7
Copy link

4x0v7 commented Nov 13, 2022

Got this issue with v7.8.3 after my laptop blue screened and rebooted. I'm running with docker-compose locally, a web+worker container and a postgres container.
Not running anything too strenuous (1 job) and everything had been working fine.
Pausing then un-pausing the pipeline fixed it for me

@peterhaochen47
Copy link

Seeing this issue in v7.9.0 (hush-house) consistently in more CPU-intensive tasks, whereas less CPU-intensive tasks are working fine.

@readmodifywrite
Copy link

Seeing this error as well, on both 7.6.0 and 7.9.0. It happens very quickly on relatively simple pipelines (almost instantly, sometimes). It is just a resource check for a small git repo. I cannot imagine this is a CPU or memory loading issue. It has happened with the same pipelines running on 2 different machines so far.

Single worker, containerd, running from a carbon copy of the quickstart docker-compose.

It does occasionally seem to clear itself up, but it might sit there broken for a while. That's not really a remedy.

@RCM7
Copy link

RCM7 commented Feb 13, 2023

We're also getting this error at 7.9.0

@akobir-mp
Copy link

also experiencing this in 7.9.0 - @xtremerui can we have another look at this one? its very prevalent

@tobyhersey-sky
Copy link

Also seeing this issue in v7.9.1

@bheemann
Copy link

Also seeing this issue in v7.9.1

@saj
Copy link

saj commented Dec 13, 2023

We hit this with an in-place worker restart. No Concourse upgrade nor downgrade. Concourse v7.10.0.

Dashboard was full of orange triangles.
The orange triangles appeared to dissipate over time with no action on my part.

We run our workers on over-provisioned bare metal. The environment is usually very stable, and is nowhere near capacity. I only restarted the workers today while trying to debug an unrelated problem. We also use the containerd runtime. (I did not land the workers prior to restart. No worker state was lost between restarts, so I did not think it would be necessary to land them.)

Edit: Yes, reliably reproducible on in-place worker restart.
My original post included a 'workaround' that was found later not to reliably work.

Essentially, it seems one must wait for Concourse to GC out the errant containers (fly containers), then you can ask for a fly check-resource without error (or leave it to the scheduler to do the latter at some point in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests