-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concourse containers hang towards end of lifecycle #54
Comments
@vito @schubert @jingyimei @tom-meyer @dsharp-pivotal for visibility |
Hi there! We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created. The current status is as follows:
This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes. |
Hi, can you please clarify what you mean by: "our 'hanging' containers only get to line 29"? Also, the auplink messages are unrelated. |
Ok, good to know that auplink is unrelated. I messed up the markdown initially -- my apologies. Fixed in the original issue -- https://gist.github.com/cjcjameson/e76e2f532077382f76650a65aab15569#file-normal_sequence_of_guardian_messages-L29 shows full container lifecycle vs. where hung containers only get to line 29 |
Our debugging today was primarily based on looking at the process tree on Worker VMs in our concourse instance. We were able to determine which workers had what containers stuck between We observed in
We scanned our pipelines and codebases, and identified some jobs that were running with We haven't seen that bad of behavior so far this afternoon. For example, the team that runs We do have a team that's running tests which involve crashing various segments of Greenplum database and seeing how it responds. Those tests take a long time, and result in Here's the rub: cancelling that Concourse job, which seems to be going slowly and having
The next time we checked |
Anything else lingering after the build finishes is probably a background process that was left around after the build finished. This will make things hang since Garden (and thus Concourse) will still be waiting for those to exit, as they may have further output to collect for the process output/build logs. I'm not sure why they ended up being zombies, but it sounds like your task isn't properly killing them and waiting for them to exit. Maybe it's a race, and if |
@vito, We also sometimes see multiple processes in the container with ppid = 0 (w/in the pid namespace), including the task's run command. If that's the case, then it won't help if the task's run command is an "init"-like process with zombie reaper (like dumb-init). Processes that lose their parents get reparented. Under pid namespaces, they get reparented to pid 1 in the namespace, but it looks to me that this only happens when pid 1 is an ancestor process of the orphaned process. The task command seems to be put in the pid ns using setns, and is therefore not a descendant of pid 1 (in the ns). ... Unless the task process is set to be a "subreaper" ( So, it makes sense to me to run tasks that spawn background processes under an "init"-like, but I'm still wondering about how the task gets launched in the container. I would rather that the task process be pid 1, if only so that it looks like the proper "init" that it should be. Instead, our pid 1 is the runC init (/proc/self/exe init). (btw, why does runC call |
This morning's discussion and debugging helped us understand more deeply what Suraci mentioned above, that Concourse and Garden are waiting on We see now that zombie processes & daemons may not be directly related this 'hang'ing behavior. Leaving issue open to await final resolution based on more information from our particular test runs and containers. |
We have a container live right now exhibiting this issue: It has nothing running in it other than some zombies, and the runc init process (so nothing to even hold stdout open). Here's a bunch of info.... Inside the container, via fly hijack:
The rest is outside the container, via bosh ssh, as root, but not in "the secret garden" container... selected ps -efH output:
Looking for init outside the container, by lining up pid namespaces:
init:
dadoo:
What has the stdout named-pipe open?
... Just guardian. What about other things dadoo has open?
... seems like just itself. |
I currently have a deployment of Concourse 3.2.1, with web/atc deployed as a Docker container on an AWS ECS cluster with an ELB in front, and 2 workers using the binary on plain EC2 machines, each with an additional volume using Concourse is in a state where I have 4 jobs in a hung state. Currently, they always hang on me when Some info:
Contrary to @cjcjameson, my containers do not have
Checking on the status, each of these report this:
When trying to intercept the hanging
The Desperately trying to find a solution. |
Hi @ringods - I'm not sure the multiple I'm pretty sure this issue is quite different from the original one in the thread so could you please open a new issue with as much information (full |
@julz While it might have been another problem, we switched away our workers from |
I get this issue too (Concourse 3.6.0), some of the PUT tasks just hang forever, and if I try to intercept the container with fly intercept, I get "bad handshake". |
Todos
Please try the following before submitting the issue:
Description
Concourse jobs "hang" -- a get OR a task step will have finished, and the next step doesn't proceed (or if it's the last step, the job doesn't resolve).
Logging and/or test output
ps aux
when hijacked into a 'hung' container:Typical garden logs for a particular container handle -- our 'hanging' containers only get to line 29
Steps to reproduce
We have not been able to reliably reproduce the issue. We had been seeing various concourse issues (including this one, although we didn't get so deep in the garden logs to be sure it was the same, it manifested the same way), and ended up dropping the database entirely.
After spinning up a new EBS volume with
bosh deploy
and getting some initial pipelines to go green for the first ~12 hours, after 12-24 hours, we were seeing this 'hang' somewhat often again. But we can't make it happen.On a job that is 'hung', if we trigger another (cancelling or not cancelling the 'hung' one), it hangs on "waiting for a suitable set of input versions" in the Concourse UI. Concourse tracker story
Container images
We are running some of these tasks on a custom-built CentOS container image. But the
get
steps that hang are just whatever Concourse provides, and we also havetask
steps that are hanging on the base Dockerhub ruby image.garden.stderr.log
-- Possibly unrelated?Our
garden.stderr.log
files each have 10-20 lines of the following in them:time="2016-11-01T09:50:07Z" level=error msg="Couldn't run auplink before unmount: exec: \"auplink\": executable file not found in $PATH"
They're coming in ~1-4 alerts per hour. Is this related?
The text was updated successfully, but these errors were encountered: