-
Notifications
You must be signed in to change notification settings - Fork 797
Deis builder fails (quite often) with mkdir /sys/system.slice: operation not permitted #5067
Comments
This looks similar to kubernetes/kubernetes#18202. What version of kubernetes are you running? |
This is on Deis V1 so I am not using a version of kubernetes :) My CoreOS version is 1068.6.0 |
Just more complete logs in case this is needed
|
Docker version in CorOS docker -v Docker version in deis-builder |
I'm not quite sure what is causing this error, however it looks related to Docker so that would be the first place I'd start to look. Can you check the logs from the docker daemon on the host? |
Combined with (from the builder itself)
I suspect the containers it tried to kill where the older versions of the app (although I fail to see how as I don't see those in a docker ps -a on the host) Other logs from around the successfull deploy
Although that looks not very suspicious to me Also nothing that stands out in all the logs around the first failure of a new build |
would it help to modify the deis files so deis builder' docker starts up debug mode? (If you can give me guidance on this that would also be helpfull) |
Ah I forgot that builder in v1 starts docker-in-docker rather than mounting the host socket. Try grabbing the docker logs from within builder :)
That might be useful. Docker is booted in builder here. |
What you see are the failed containers which could be started for the builds, but the containers itself did nog log anything (as they could not be started)
Which does not help much I guess (it is just the same error we already know is happening) Can yu explain what you mean with "Try grabbing the docker logs from within builder :)" I might try to setup a test image for deis builder with debug enabled and deploy that one.
|
I made a debugging image but that did not seem to help much (not actually much got logged; I might have been looking on the wrong places though) I have found out one intreseting thing though At first I could not reproduce the issue and I was able to do multiple deploys in a row of the same app This was a relatively small rails app which can be built in minutes (instead of nearly 20 minutes for my bigger app) After deploying my bigger app successive deploys started to fail; also the ones for the smaller app. Extra info though. I am pretty sure the old (apparantly working fine) cluster was " 1010.6.0", major updates to 1068.6.0 are the rkt, kernel and systemd. |
I've seen this problem as well on #5068 I've seen similar issues with like eg. kubernetes/kubernetes#28192 that mention the cgroupdriver, might this be something to look into? |
This is not my area of expertise, so I do not know how to move forward with this. Someone at Docker may know more though. :) |
I am looking into using the dind docker image as a base instead of manually copying in the dind script (which seems to be deprecated anyway) |
Closed by #5074. Thanks again @nathansamson! |
Even after updating to 1.13.3, I'm still hitting this issue.
|
Note that our is not enough to just upgrade deis but also to reprovision Fyi I have had not a single issue since this fix Nathan On 10 Oct 2016 7:08 pm, "leroix" notifications@github.com wrote:
|
Any idea if there's a way to incorporate those changes without reprovisioning? |
You can take a look at http://docs.deis.io/en/latest/managing_deis/upgrading-deis/#upgrading-coreos to upgrade your hosts. |
Since I've setup a new cluster on DigitalOcean with deis V1.13.2 and CoreOS 1068.6.0 the deis builder is quite unstable.
It errors out with "System error: mkdir /sys/system.slice: operation not permitted" when deploying an app
Restarting the builder generally helps but this removes the deploy cache which in our case is quite necessary (building from scratch including compiling assets and installing gems costs 15-20 minutes)
I basically have to restart the builder after every build
The node on the builder has more than enough RAM (16GB total, 1GB free with buffers/cache even 4GB available), restarting the builder does not increase this amount significantly.
A very similar setup cluster from 2 weeks ago did not expose this problem. The only difference I can imagine is the CoreOS version (but sadly with DO the stable when you generate is what you get)
This older cluster has since been removed, so I can't double check the CoreOS version but it was very similar.
The only differences I can think of are
The text was updated successfully, but these errors were encountered: