Skip to content
This repository has been archived by the owner on Nov 30, 2021. It is now read-only.

Deis builder fails (quite often) with mkdir /sys/system.slice: operation not permitted #5067

Closed
nathansamson opened this issue Jul 19, 2016 · 18 comments

Comments

@nathansamson
Copy link
Contributor

Since I've setup a new cluster on DigitalOcean with deis V1.13.2 and CoreOS 1068.6.0 the deis builder is quite unstable.

It errors out with "System error: mkdir /sys/system.slice: operation not permitted" when deploying an app

Restarting the builder generally helps but this removes the deploy cache which in our case is quite necessary (building from scratch including compiling assets and installing gems costs 15-20 minutes)

I basically have to restart the builder after every build

The node on the builder has more than enough RAM (16GB total, 1GB free with buffers/cache even 4GB available), restarting the builder does not increase this amount significantly.

A very similar setup cluster from 2 weeks ago did not expose this problem. The only difference I can imagine is the CoreOS version (but sadly with DO the stable when you generate is what you get)

This older cluster has since been removed, so I can't double check the CoreOS version but it was very similar.

The only differences I can think of are

  • Different CoreOS version
  • This cluster has been rebuilt from a stateless instance (database and registry) (the previously mentioned destroyed cluster). This should have no effect to the builder as far as I know
  • Using multiple users (something I did not do before), but I had the same problem with an app the other does not have permission for another app I had the problem with
[nathan@zeta beepleapp]$ git push deis-production
Counting objects: 60321, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (16025/16025), done.
Writing objects: 100% (60321/60321), 53.46 MiB | 2.68 MiB/s, done.
Total 60321 (delta 45676), reused 57004 (delta 42623)
Error response from daemon: Cannot start container f3f0a508c6dba947df42fcc7c70324a9b9d6efaf0ee105b0c466a23a99dd3036: [8] System error: mkdir /sys/system.slice: operation not permitted
To ssh://git@deis.eu1-cloud.beeple.eu:2222/bplpp-production.git
 ! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'ssh://git@deis.eu1-cloud.beeple.eu:2222/bplpp-production.git'
[nathan@zeta beepleapp]$ 
@bacongobbler
Copy link
Member

This looks similar to kubernetes/kubernetes#18202. What version of kubernetes are you running?

@nathansamson
Copy link
Contributor Author

This is on Deis V1 so I am not using a version of kubernetes :)

My CoreOS version is 1068.6.0

@nathansamson
Copy link
Contributor Author

Just more complete logs in case this is needed

Jul 19 16:03:48 eu1-cloud-worker-03 sh[26456]: [info] Deploy complete.
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Checking closer.
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Accepted connection.
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Adding key 'command="/usr/local/bin/gitreceive run Kristof 2a:68:a4:2b:c8:34:22:54:8d:21:5d:0e:e6:a2:65:e6",no-agent-forwarding,no-pty,no-user-rc,no-X11-forwarding,no-port-forwarding ssh-rsa AAA...
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Adding key 'command="/usr/local/bin/gitreceive run nathan 87:ad:87:10:89:2e:64:d9:43:11:2f:60:3b:93:fe:7f",no-agent-forwarding,no-pty,no-user-rc,no-X11-forwarding,no-port-forwarding ssh-rsa AAA...'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Adding key 'command="/usr/local/bin/gitreceive run nathan ff:5b:e0:6e:f6:b3:f0:21:19:57:4b:5d:eb:43:e0:ac",no-agent-forwarding,no-pty,no-user-rc,no-X11-forwarding,no-port-forwarding ssh-rsa AAA....
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Key accepted for user git.
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Channel type: session
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Key='LC_PAPER', Value='nl_BE.UTF-8'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Key='LC_MONETARY', Value='nl_BE.UTF-8'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Key='LC_NUMERIC', Value='nl_BE.UTF-8'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Key='XMODIFIERS', Value='@im=ibus'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Key='LANG', Value='nl_BE.UTF-8'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Key='LC_MEASUREMENT', Value='nl_BE.UTF-8'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Key='LC_TIME', Value='nl_BE.UTF-8'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Checking user /deis/builder/users/Kristof
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Checking user /deis/builder/users/nathan
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] Found user nathan for fingerprint 87:ad:87:10:89:2e:64:d9:43:11:2f:60:3b:93:fe:7f
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [debug] Directory '/home/git/bplpp-production-flexpoint.git' already exists.
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: [info] git-shell -c git-receive-pack 'bplpp-production-flexpoint.git'
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: Waiting for git-receive to run.
Jul 19 16:04:08 eu1-cloud-worker-03 sh[26456]: Waiting for deploy.
Jul 19 16:04:32 eu1-cloud-worker-03 sh[26456]: [info] Deploy complete.

@nathansamson
Copy link
Contributor Author

Docker version in CorOS

docker -v
Docker version 1.10.3, build 3cd164c

Docker version in deis-builder
bash-4.3# docker -v
Docker version 1.8.3, build f4bf5c7

@bacongobbler
Copy link
Member

I'm not quite sure what is causing this error, however it looks related to Docker so that would be the first place I'd start to look. Can you check the logs from the docker daemon on the host?

@nathansamson
Copy link
Contributor Author

nathansamson commented Jul 19, 2016

Jul 19 15:32:43 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:32:43.663706148Z" level=error msg="Handler for POST /v1.22/containers/44c8f32b6c28bf06d7984be405eb3265bebc5ba766683654834ef485b9a9bf36/kill returned error: Cannot kill container 44c8f32b6c28bf06d7984be405eb3265bebc5ba766683654834ef485b9a9bf36: Container 44c8f32b6c28bf06d7984be405eb3265bebc5ba766683654834ef485b9a9bf36 is not running"
Jul 19 15:32:43 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:32:43.665918162Z" level=error msg="Handler for POST /v1.22/containers/44c8f32b6c28bf06d7984be405eb3265bebc5ba766683654834ef485b9a9bf36/kill returned error: Cannot kill container 44c8f32b6c28bf06d7984be405eb3265bebc5ba766683654834ef485b9a9bf36: Container 44c8f32b6c28bf06d7984be405eb3265bebc5ba766683654834ef485b9a9bf36 is not running"
Jul 19 15:32:47 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:32:47.270141042Z" level=error msg="Handler for GET /v1.22/containers/deis-builder/json returned error: No such container: deis-builder"
Jul 19 15:32:47 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:32:47.271206791Z" level=error msg="Handler for GET /v1.22/images/deis-builder/json returned error: No such image: deis-builder"
Jul 19 15:56:44 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:56:44.668206187Z" level=info msg="Container 8dcb84f89ef9a66848dc8f96c6a01146717aeea25ff2553406ece61f1a9c801c failed to exit within 10 seconds of SIGTERM - using the force"
Jul 19 15:56:44 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:56:44.707906138Z" level=info msg="Container a558ecba094ed407b8328e0eec35c163b3475626dd99935cc750a964447a6055 failed to exit within 10 seconds of SIGTERM - using the force"
Jul 19 15:56:45 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:56:45.085191730Z" level=error msg="Handler for POST /v1.22/containers/8dcb84f89ef9a66848dc8f96c6a01146717aeea25ff2553406ece61f1a9c801c/kill returned error: Cannot kill container 8dcb84f89ef9a66848dc8f96c6a01146717aeea25ff2553406ece61f1a9c801c: Container 8dcb84f89ef9a66848dc8f96c6a01146717aeea25ff2553406ece61f1a9c801c is not running"
Jul 19 15:56:45 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:56:45.086714090Z" level=error msg="Handler for POST /v1.22/containers/8dcb84f89ef9a66848dc8f96c6a01146717aeea25ff2553406ece61f1a9c801c/kill returned error: Cannot kill container 8dcb84f89ef9a66848dc8f96c6a01146717aeea25ff2553406ece61f1a9c801c: Container 8dcb84f89ef9a66848dc8f96c6a01146717aeea25ff2553406ece61f1a9c801c is not running"
Jul 19 15:56:45 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:56:45.109998742Z" level=error msg="Handler for POST /v1.22/containers/a558ecba094ed407b8328e0eec35c163b3475626dd99935cc750a964447a6055/kill returned error: Cannot kill container a558ecba094ed407b8328e0eec35c163b3475626dd99935cc750a964447a6055: Container a558ecba094ed407b8328e0eec35c163b3475626dd99935cc750a964447a6055 is not running"
Jul 19 15:56:45 eu1-cloud-worker-03 dockerd[1978]: time="2016-07-19T15:56:45.110475577Z" level=error msg="Handler for POST /v1.22/containers/a558ecba094ed407b8328e0eec35c163b3475626dd99935cc750a964447a6055/kill returned error: Cannot kill container a558ecba094ed407b8328e0eec35c163b3475626dd99935cc750a964447a6055: Container a558ecba094ed407b8328e0eec35c163b3475626dd99935cc750a964447a6055 is not running"

Combined with (from the builder itself)

Jul 19 15:33:54 eu1-cloud-worker-03 sh[26456]: Waiting for deploy.
Jul 19 15:56:38 eu1-cloud-worker-03 sh[26456]: [info] Deploy complete.

I suspect the containers it tried to kill where the older versions of the app (although I fail to see how as I don't see those in a docker ps -a on the host)

Other logs from around the successfull deploy

Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO manager.go:138: Triggered systemd unit bplpp-production_v15.web.1.service stop: job=146048
Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO manager.go:259: Removing systemd unit bplpp-production_v15.web.1.service
Jul 19 15:56:34 eu1-cloud-worker-03 systemd[1]: Stopping bplpp-production_v15.web.1...
Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO manager.go:138: Triggered systemd unit bplpp-production_v15.web.3.service stop: job=146049
Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO manager.go:259: Removing systemd unit bplpp-production_v15.web.3.service
Jul 19 15:56:34 eu1-cloud-worker-03 systemd[1]: Stopping bplpp-production_v15.web.3...
Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO manager.go:182: Instructing systemd to reload units
Jul 19 15:56:34 eu1-cloud-worker-03 systemd[1]: Reloading.
Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=bplpp-production_v15.web.1.service reason="unit loaded but not sch
Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=bplpp-production_v15.web.3.service reason="unit loaded but not sch
Jul 19 15:56:34 eu1-cloud-worker-03 fleetd[1402]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Jul 19 15:56:34 eu1-cloud-worker-03 systemd[1]: Starting Garbage Collection for rkt...
Jul 19 15:56:34 eu1-cloud-worker-03 systemd[1]: Started Garbage Collection for rkt.

Although that looks not very suspicious to me

Also nothing that stands out in all the logs around the first failure of a new build

@nathansamson
Copy link
Contributor Author

would it help to modify the deis files so deis builder' docker starts up debug mode?

(If you can give me guidance on this that would also be helpfull)

@bacongobbler
Copy link
Member

bacongobbler commented Jul 19, 2016

Ah I forgot that builder in v1 starts docker-in-docker rather than mounting the host socket. Try grabbing the docker logs from within builder :)

would it help to modify the deis files so deis builder' docker starts up debug mode?

That might be useful. Docker is booted in builder here.

@nathansamson
Copy link
Contributor Author

bash-4.3# docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
1e264d97e840        deis/slugbuilder    "/tmp/builder/build.s"   42 minutes ago      Created                                 desperate_curie
840c8afa4602        deis/slugbuilder    "/tmp/builder/build.s"   About an hour ago   Created                                 distracted_cori
ea489de51b6f        deis/slugbuilder    "/tmp/builder/build.s"   About an hour ago   Created                                 stoic_swartz
f3f0a508c6db        deis/slugbuilder    "/tmp/builder/build.s"   About an hour ago   Created                                 adoring_bhabha
bash-4.3# docker logs 1e264d97e840
bash-4.3# ls         

What you see are the failed containers which could be started for the builds, but the containers itself did nog log anything (as they could not be started)

bash-4.3# docker -l=debug -D start 1e264d97e840
Error response from daemon: Cannot start container 1e264d97e840: [8] System error: mkdir /sys/system.slice: operation not permitted
Error: failed to start containers: [1e264d97e840]
bash-4.3# 

Which does not help much I guess (it is just the same error we already know is happening)

Can yu explain what you mean with "Try grabbing the docker logs from within builder :)"

I might try to setup a test image for deis builder with debug enabled and deploy that one.
I assume after building and publishing I can just do

deisctl config builder set image=myaccount/myimage:latest
deisctl restart deis-builder # I am not sure if this step is needed

@nathansamson
Copy link
Contributor Author

I made a debugging image but that did not seem to help much (not actually much got logged; I might have been looking on the wrong places though)

I have found out one intreseting thing though

At first I could not reproduce the issue and I was able to do multiple deploys in a row of the same app

This was a relatively small rails app which can be built in minutes (instead of nearly 20 minutes for my bigger app)

After deploying my bigger app successive deploys started to fail; also the ones for the smaller app.

Extra info though. I am pretty sure the old (apparantly working fine) cluster was " 1010.6.0", major updates to 1068.6.0 are the rkt, kernel and systemd.

@nathansamson
Copy link
Contributor Author

I've seen this problem as well on #5068
any ideas for a solution? (as updating th docker in docker script did not seem to help)

I've seen similar issues with like eg. kubernetes/kubernetes#28192 that mention the cgroupdriver, might this be something to look into?

@bacongobbler
Copy link
Member

This is not my area of expertise, so I do not know how to move forward with this. Someone at Docker may know more though. :)

@nathansamson
Copy link
Contributor Author

I am looking into using the dind docker image as a base instead of manually copying in the dind script (which seems to be deprecated anyway)

@mboersma
Copy link
Member

Closed by #5074. Thanks again @nathansamson!

@leroix
Copy link

leroix commented Oct 10, 2016

Even after updating to 1.13.3, I'm still hitting this issue.

core@deis2213712 ~ $ cat /etc/os-release 
NAME=CoreOS
ID=coreos
VERSION=1068.8.0
VERSION_ID=1068.8.0
BUILD_ID=2016-07-18-0616
PRETTY_NAME="CoreOS 1068.8.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
core@deis2213712 ~ $ deisctl --version
1.13.3
core@deis2213712 ~ $ docker --version
Docker version 1.10.3, build 3cd164c
core@deis2213712 ~ $ docker ps | grep builder
51b62e4de616        rafflecopter/deis-builder:latest           "/bin/entry /bin/boot"   2 days ago          Up 2 days           0.0.0.0:2223->2223/tcp, 2375/tcp                                                           deis-builder
core@deis2213712 ~ $ docker exec -it 51b62e4de616 /bin/bash  
bash-4.3# docker -v
Docker version 1.8.3, build f4bf5c7

@nathansamson
Copy link
Contributor Author

Note that our is not enough to just upgrade deis but also to reprovision
the cluster as critical info has been changed in the user data file.

Fyi I have had not a single issue since this fix

Nathan

On 10 Oct 2016 7:08 pm, "leroix" notifications@github.com wrote:

Even after updating to 1.13.3, I'm still hitting this issue.

core@deis2213712 ~ $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1068.8.0
VERSION_ID=1068.8.0
BUILD_ID=2016-07-18-0616
PRETTY_NAME="CoreOS 1068.8.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
core@deis2213712 ~ $ deisctl --version
1.13.3
core@deis2213712 ~ $ docker --version
Docker version 1.10.3, build 3cd164c
core@deis2213712 ~ $ docker ps | grep builder
51b62e4de616 rafflecopter/deis-builder:latest "/bin/entry /bin/boot" 2 days ago Up 2 days 0.0.0.0:2223->2223/tcp, 2375/tcp deis-builder
core@deis2213712 ~ $ docker exec -it 51b62e4de616 /bin/bash
bash-4.3# docker -v
Docker version 1.8.3, build f4bf5c7


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5067 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAFuJ4duyHLko1AoQGdHOedyiOoM8kuPks5qynDvgaJpZM4JP4LN
.

@leroix
Copy link

leroix commented Oct 10, 2016

Any idea if there's a way to incorporate those changes without reprovisioning?

@bacongobbler
Copy link
Member

You can take a look at http://docs.deis.io/en/latest/managing_deis/upgrading-deis/#upgrading-coreos to upgrade your hosts.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants