Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concourse containers hang towards end of lifecycle #54

Closed
2 tasks done
cjcjameson opened this issue Nov 2, 2016 · 13 comments
Closed
2 tasks done

Concourse containers hang towards end of lifecycle #54

cjcjameson opened this issue Nov 2, 2016 · 13 comments

Comments

@cjcjameson
Copy link

cjcjameson commented Nov 2, 2016

Todos

Please try the following before submitting the issue:

  • Upgrade Concourse if you are observing a failure in a test (yep, on v2.4.0 with Garden runC 1.0.0
  • Use latest BOSH stemcell if the problem occurred in a BOSH VM (yep, bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3263.8)

Description

Concourse jobs "hang" -- a get OR a task step will have finished, and the next step doesn't proceed (or if it's the last step, the job doesn't resolve).

Logging and/or test output

ps aux when hijacked into a 'hung' container:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.0  0.0 116124  6820 ?        S<sl 12:30   0:00 /proc/self/exe init
root         18  0.7  0.0  20252  3260 ?        S<s  20:36   0:00 bash
root         24  0.0  0.0  17500  2188 ?        R<+  20:36   0:00 ps aux

Typical garden logs for a particular container handle -- our 'hanging' containers only get to line 29

Steps to reproduce

We have not been able to reliably reproduce the issue. We had been seeing various concourse issues (including this one, although we didn't get so deep in the garden logs to be sure it was the same, it manifested the same way), and ended up dropping the database entirely.

After spinning up a new EBS volume with bosh deploy and getting some initial pipelines to go green for the first ~12 hours, after 12-24 hours, we were seeing this 'hang' somewhat often again. But we can't make it happen.

On a job that is 'hung', if we trigger another (cancelling or not cancelling the 'hung' one), it hangs on "waiting for a suitable set of input versions" in the Concourse UI. Concourse tracker story

Container images

We are running some of these tasks on a custom-built CentOS container image. But the get steps that hang are just whatever Concourse provides, and we also have task steps that are hanging on the base Dockerhub ruby image.

garden.stderr.log -- Possibly unrelated?

Our garden.stderr.log files each have 10-20 lines of the following in them:

time="2016-11-01T09:50:07Z" level=error msg="Couldn't run auplink before unmount: exec: \"auplink\": executable file not found in $PATH"

They're coming in ~1-4 alerts per hour. Is this related?

@cjcjameson
Copy link
Author

@vito @schubert @jingyimei @tom-meyer @dsharp-pivotal for visibility

@garden-gnome
Copy link

garden-gnome commented Nov 2, 2016

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

  • #133566103 Concourse containers hang towards end of lifecycle

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

@williammartin
Copy link

Hi, can you please clarify what you mean by: "our 'hanging' containers only get to line 29"?

Also, the auplink messages are unrelated.

@cjcjameson
Copy link
Author

Ok, good to know that auplink is unrelated.

I messed up the markdown initially -- my apologies. Fixed in the original issue -- https://gist.github.com/cjcjameson/e76e2f532077382f76650a65aab15569#file-normal_sequence_of_guardian_messages-L29 shows full container lifecycle vs. where hung containers only get to line 29

@cjcjameson
Copy link
Author

cjcjameson commented Nov 3, 2016

Our debugging today was primarily based on looking at the process tree on Worker VMs in our concourse instance. We were able to determine which workers had what containers stuck between guardian.api.garden-server.run.spawned and guardian.api.garden-server.run.exited by grep'ing the garden logs.

We observed in ps -efH on a worker that there were a number of "zombie" processes, and also lots of sshd processes:

  • sshd processes still running in jobs with <defunct> processes
  • /proc/self/exe init processes still running with many child sshd processes, all <defunct>
  • A few stray sshd processes not within a particular container process

We scanned our pipelines and codebases, and identified some jobs that were running with privileged: true. We removed those pipelines and or re-set them so that they don't have privileged: true. We recreated the worker VMs to make sure that zombie processes were killed ✝️

We haven't seen that bad of behavior so far this afternoon. For example, the team that runs sshd processes hasn't run into an issue yet, after switching off privileged: true

We do have a team that's running tests which involve crashing various segments of Greenplum database and seeing how it responds. Those tests take a long time, and result in <defunct> processes which are clearly children of the core test processes, but it seems well encapsulated. We can and should look into this more as a team if we think our own testing procedures are giving us trouble. Not your issue, but...

Here's the rub: cancelling that Concourse job, which seems to be going slowly and having <defunct> processes, even if maybe its issues were by design and we should have just let it run for longer, leads to the following in our process tree: [reminder -- GPDB is a fork of postgres, so postgres processes are GPDB processes in this readout]

4294967+  60091      1  0 Nov02 ?        00:00:00   /proc/self/exe init
500       58877  60091  0 Nov02 ?        00:00:00     [postgres] <defunct>
500       58971  60091  0 Nov02 ?        00:00:00     [postgres] <defunct>
500       58972  60091  0 Nov02 ?        00:00:01     [postgres] <defunct>
⋮
500      113237  60091  0 Nov02 ?        00:00:01     [postgres] <defunct>
500      113635  60091  0 Nov02 ?        00:00:01     [postgres] <defunct>

The next time we checked ps -efH, that init with defunct postgres'es wasn't there. What did we do in the meantime that may have fixed it? Address vmware-archive/atc#130 manually by resetting the field resource_checking in the Concourse database. This unblocked a subsequent run of the same job, and this subsequent job was in-flight when we checked ps -efH next.

@vito
Copy link
Contributor

vito commented Nov 3, 2016

/proc/self/exe init is the container itself, so that entry is normal.

Anything else lingering after the build finishes is probably a background process that was left around after the build finished. This will make things hang since Garden (and thus Concourse) will still be waiting for those to exit, as they may have further output to collect for the process output/build logs.

I'm not sure why they ended up being zombies, but it sounds like your task isn't properly killing them and waiting for them to exit. Maybe it's a race, and if sshd doesn't exit in time, the script exits without waiting on it, and leaves the zombie around?

@dsharp-pivotal
Copy link

@vito, We also sometimes see multiple processes in the container with ppid = 0 (w/in the pid namespace), including the task's run command. If that's the case, then it won't help if the task's run command is an "init"-like process with zombie reaper (like dumb-init). Processes that lose their parents get reparented. Under pid namespaces, they get reparented to pid 1 in the namespace, but it looks to me that this only happens when pid 1 is an ancestor process of the orphaned process. The task command seems to be put in the pid ns using setns, and is therefore not a descendant of pid 1 (in the ns).

... Unless the task process is set to be a "subreaper" (tini can set itself up to do that, butdumb-init doesn't), which is a feature available since Linux 3.4.

So, it makes sense to me to run tasks that spawn background processes under an "init"-like, but I'm still wondering about how the task gets launched in the container. I would rather that the task process be pid 1, if only so that it looks like the proper "init" that it should be. Instead, our pid 1 is the runC init (/proc/self/exe init). (btw, why does runC call exec(/proc/self/exe)? And couldn't it set argv[0] to something more informative than /proc/self/exe?)

@cjcjameson
Copy link
Author

cjcjameson commented Nov 4, 2016

This morning's discussion and debugging helped us understand more deeply what Suraci mentioned above, that Concourse and Garden are waiting on stdout being closed for output or log collection. We are looking into the particulars of our jobs because understanding that more will help us reproduce issues, if there are any remaining.

We see now that zombie processes & daemons may not be directly related this 'hang'ing behavior.

Leaving issue open to await final resolution based on more information from our particular test runs and containers.

@dsharp-pivotal
Copy link

We have a container live right now exhibiting this issue: It has nothing running in it other than some zombies, and the runc init process (so nothing to even hold stdout open).

Here's a bunch of info....

Inside the container, via fly hijack:

[root@d63938bf-7b31-43b1-6e7f-7a09c8d5608b b25614d2]# ps -efH
UID        PID  PPID  C STIME TTY          TIME CMD
root     12797     0  0 17:36 ?        00:00:00 bash
root     12821 12797  0 18:17 ?        00:00:00   ps -efH
gpadmin   4603     0  0 Dec08 ?        00:00:00 [postgres] <defunct>
...
gpadmin   8197     0  0 Dec08 ?        00:00:00 [postgres] <defunct>
root        86     0  0 Dec08 ?        00:00:00 [sshd] <defunct>
root         1     0  0 Dec08 ?        00:00:00 /proc/self/exe init
[root@d63938bf-7b31-43b1-6e7f-7a09c8d5608b b25614d2]# ls -l /proc/self/ns/
total 0
lrwxrwxrwx 1 root root 0 Dec  9 18:18 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Dec  9 18:18 ipc -> ipc:[4026535108]
lrwxrwxrwx 1 root root 0 Dec  9 18:18 mnt -> mnt:[4026534718]
lrwxrwxrwx 1 root root 0 Dec  9 18:18 net -> net:[4026535111]
lrwxrwxrwx 1 root root 0 Dec  9 18:18 pid -> pid:[4026535109]
lrwxrwxrwx 1 root root 0 Dec  9 18:18 user -> user:[4026534620]
lrwxrwxrwx 1 root root 0 Dec  9 18:18 uts -> uts:[4026535107]

The rest is outside the container, via bosh ssh, as root, but not in "the secret garden" container...

selected ps -efH output:

root     32392  5579  0 Dec08 ?        00:00:15         runc events d63938bf-7b31-43b1-6e7f-7a09c8d5608b
root     32706  5579  0 Dec08 ?        00:00:00         /var/vcap/packages/guardian/bin/dadoo -tty -rows 0 -cols 0 -uid 4294967294 -gid 4294967294 exec runc /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2 d63938bf-7b31-43b1-6e7f-7a09c8d5608b
4294967+  9668 32706  0 Dec08 ?        00:00:00           [sshd] <defunct>
500      14839 32706  0 Dec08 ?        00:00:00           [postgres] <defunct>
...
500       2355 32706  0 Dec08 ?        00:00:00           [postgres] <defunct>
root      3401  5579  0 17:36 ?        00:00:00         /var/vcap/packages/guardian/bin/dadoo -tty -rows 75 -cols 156 -uid 4294967294 -gid 4294967294 exec runc /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/5c24d731-53a0-42f9-6248-3f553fa4473c d63938bf-7b31-43b1-6e7f-7a09c8d5608b
4294967+  3416  3401  0 17:36 pts/1    00:00:00           bash

Looking for init outside the container, by lining up pid namespaces:

GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# ps -efH | grep '/proc/self/exe init' | awk '{print $2}' | while read pid; do ls -l /proc/$pid/ns/* 2>/dev/null ; done | grep -F 'pid:[4026535109]'
lrwxrwxrwx 1 root root 0 Dec  9 18:40 /proc/32300/ns/pid -> pid:[4026535109]

init:

GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# cat /proc/32300/status
Name:	exe
State:	S (sleeping)
Tgid:	32300
Ngid:	0
Pid:	32300
PPid:	1
TracerPid:	0
Uid:	-2	-2	-2	-2
Gid:	-2	-2	-2	-2
FDSize:	64
Groups:	1 2 3 4 6 10 -2
NStgid:	32300	1
NSpid:	32300	1
NSpgid:	32300	1
NSsid:	32300	1
VmPeak:	  163700 kB
VmSize:	  116120 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	    8688 kB
VmRSS:	    5012 kB
VmData:	  110100 kB
VmStk:	     136 kB
VmExe:	    5672 kB
VmLib:	       0 kB
VmPTE:	      64 kB
VmPMD:	      20 kB
VmSwap:	    3676 kB
HugetlbPages:	       0 kB
Threads:	6
SigQ:	2/128601
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	ffffffffffc1feff
CapInh:	00000000a80425fb
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
Seccomp:	2
Cpus_allowed:	00ff
Cpus_allowed_list:	0-7
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	44
nonvoluntary_ctxt_switches:	0
GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# ls -l /proc/32300/fd
total 0
lrwx------ 1 root root 64 Dec  9 18:23 0 -> /dev/null
lrwx------ 1 root root 64 Dec  9 18:23 1 -> /dev/null
lrwx------ 1 root root 64 Dec  9 18:23 2 -> /dev/null
lr-x------ 1 root root 64 Dec  9 18:23 4 -> /run/runc/d63938bf-7b31-43b1-6e7f-7a09c8d5608b
lr-x------ 1 root root 64 Dec  9 18:23 5 -> pipe:[92574074]
l-wx------ 1 root root 64 Dec  9 18:23 6 -> /null

dadoo:

GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# cat /proc/32706/status
Name:	dadoo
State:	S (sleeping)
Tgid:	32706
Ngid:	0
Pid:	32706
PPid:	5579
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	128
Groups:	0
NStgid:	32706
NSpid:	32706
NSpgid:	32706
NSsid:	5288
VmPeak:	  276004 kB
VmSize:	  276004 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	    8180 kB
VmRSS:	    6608 kB
VmData:	  265540 kB
VmStk:	     136 kB
VmExe:	    4032 kB
VmLib:	    2008 kB
VmPTE:	      96 kB
VmPMD:	      20 kB
VmSwap:	    1508 kB
HugetlbPages:	       0 kB
Threads:	9
SigQ:	0/128601
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	fffffffe7bfa7a27
SigIgn:	0000000000000000
SigCgt:	ffffffffffc1feff
CapInh:	0000000000000000
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000
Seccomp:	0
Cpus_allowed:	7fff
Cpus_allowed_list:	0-14
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	186
nonvoluntary_ctxt_switches:	1
GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# ls -l /proc/32706/fd
total 0
lr-x------ 1 root root 64 Dec  9 18:34 0 -> pipe:[92611099]
l-wx------ 1 root root 64 Dec  9 18:34 1 -> /dev/null
lrwx------ 1 root root 64 Dec  9 18:34 10 -> /dev/ptmx
lrwx------ 1 root root 64 Dec  9 18:34 11 -> /dev/pts/4
l-wx------ 1 root root 64 Dec  9 18:34 2 -> /dev/null
l-wx------ 1 root root 64 Dec  9 18:34 3 -> pipe:[92611097]
lr-x------ 1 root root 64 Dec  9 18:34 5 -> /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/stdin
l-wx------ 1 root root 64 Dec  9 18:34 6 -> /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/stdout
l-wx------ 1 root root 64 Dec  9 18:34 7 -> /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/stderr
lrwx------ 1 root root 64 Dec  9 18:34 8 -> /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/winsz
lrwx------ 1 root root 64 Dec  9 18:34 9 -> /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/exit

What has the stdout named-pipe open?

GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# lsof /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/stdout
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
guardian  5579 root   66r  FIFO 202,18      0t0 20185297 /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/stdout
dadoo    32706 root    6w  FIFO 202,18      0t0 20185297 /var/vcap/data/garden/depot/d63938bf-7b31-43b1-6e7f-7a09c8d5608b/processes/2484a062-cf71-49d7-568d-3c54e39504c2/stdout

... Just guardian.

What about other things dadoo has open?

GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# ps -efH | awk '{print $2}' | while read pid; do ls -l /proc/$pid/fd/* 2>/dev/null ; done | grep -F 'pipe:[92611099]'
lr-x------ 1 root root 64 Dec  9 18:34 /proc/32706/fd/0 -> pipe:[92611099]
GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# ps -efH | awk '{print $2}' | while read pid; do ls -l /proc/$pid/fd/* 2>/dev/null ; done | grep -F 'pipe:[92611097]'
l-wx------ 1 root root 64 Dec  9 18:34 /proc/32706/fd/3 -> pipe:[92611097]
GPDB-worker/d31cb48d-fcd6-44be-baf7-f9585f75fb4c:/# lsof /dev/pts/4
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
dadoo   32706 root   11u   CHR  136,4      0t0    7 /dev/pts/4

... seems like just itself.

@ringods
Copy link

ringods commented Jun 27, 2017

I currently have a deployment of Concourse 3.2.1, with web/atc deployed as a Docker container on an AWS ECS cluster with an ELB in front, and 2 workers using the binary on plain EC2 machines, each with an additional volume using btrfs.

Concourse is in a state where I have 4 jobs in a hung state. Currently, they always hang on me when put-ting a docker-image resource.

Some info:

name             containers  platform  tags  team  state    version
ip-10-32-21-76   192         linux     none  none  running  1.1
ip-10-32-22-205  28          linux     none  none  running  1.1

Contrary to @cjcjameson, my containers do not have <defunct> processes. I just have a long list of this:

4294967+ 17525  0.0  0.1 101112  3544 ?        Ssl  Jun26   0:00 /proc/self/exe init
4294967+ 17544  0.0  0.0 101112  1504 ?        Ssl  Jun26   0:00 /proc/self/exe init
4294967+ 19453  0.0  0.1  36632  3912 ?        Ssl  Jun26   0:00 /proc/self/exe init
root     22439  0.0  0.0 100604   620 ?        Ssl  Jun26   0:00 /proc/self/exe init
4294967+ 22849  0.0  0.0 101112  1500 ?        Ssl  Jun26   0:00 /proc/self/exe init
4294967+ 28215  0.0  0.0 102168  1508 ?        Ssl  Jun26   0:00 /proc/self/exe init
...

Checking on the status, each of these report this:

 cat /proc/17525/status
Name:	runc:[2:INIT]
State:	S (sleeping)
Tgid:	17525
Ngid:	0
Pid:	17525
PPid:	1
TracerPid:	0
Uid:	-2	-2	-2	-2
Gid:	-2	-2	-2	-2
FDSize:	64
Groups:
NStgid:	17525	1
NSpid:	17525	1
NSpgid:	17525	1
NSsid:	17525	1
VmPeak:	  157944 kB
VmSize:	  101112 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	    8684 kB
VmRSS:	    3544 kB
VmData:	   92648 kB
VmStk:	    2180 kB
VmExe:	    6060 kB
VmLib:	       0 kB
VmPTE:	      68 kB
VmPMD:	      20 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
Threads:	4
SigQ:	0/7935
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	ffffffffffc1feff
CapInh:	00000000a80425fb
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	00000000a80425fb
Seccomp:	2
Cpus_allowed:	0001
Cpus_allowed_list:	0
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	39
nonvoluntary_ctxt_switches:	59

When trying to intercept the hanging put of the docker-image resource, I can't get in the container:

$ fly-3.2.1 -t tl intercept -j marketplace/build-scraper
1: build #35, step: build-scraper, type: task
2: build #35, step: docker-marketplace-scraper, type: put
3: build #35, step: git-marketplace-scraper, type: get
choose a container: 2
_

The _ character just means that the cursor ends up there and nothing more happens.

Desperately trying to find a solution.

@julz
Copy link
Contributor

julz commented Jul 11, 2017

Hi @ringods - I'm not sure the multiple /proc/self/exe init processes are particularly interesting here -- since you have 192 containers running I expect to see 192 /proc/self/exe processes, and none of them look to be in a bad state just going from the above. The only thing above that seems odd is that intercepting in to the container is hung, but there's not a great deal of the sort of debug output above that would help us determine why that might be. I have had experience in the past of btrfs causing system hangs and pauses under load so that would be my initial suspicion. Often when this happens we see btrfs processes using lots of cpu in the ps list.

I'm pretty sure this issue is quite different from the original one in the thread so could you please open a new issue with as much information (full ps, garden logs with any passwords removed, deployment details etc) as you can share and we'll follow up there asking for any other things we need after that. Thanks!

@julz julz closed this as completed Jul 11, 2017
@ringods
Copy link

ringods commented Jul 11, 2017

@julz While it might have been another problem, we switched away our workers from btrfs to ext4/overlayfs too. My lockups are no longer there. So I do not have any details anymore to submit a new issue.

@francoisruty
Copy link

francoisruty commented Nov 24, 2017

I get this issue too (Concourse 3.6.0), some of the PUT tasks just hang forever, and if I try to intercept the container with fly intercept, I get "bad handshake".
It's not all PUT tasks though, it happens on a particular pipeline, I don't know why
I tried this shim: concourse/concourse#1028
without success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants