Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Performance regression with 'overlay' driver and privileged containers #1404
We are noticing that various tasks are hanging for 1-10m before executing. (the ui doesn't show the 'loading' icon spinning, just hangs)
Things we have tried:
This was also seen by @krishicks in the Slack
I've seen this as well. Just been on slack, and directed here.
Concourse Configuration: binary deployment to AWS clusters. 2 ATC's, 2 Linux Worker, and 2 winows Workers. internal LB for workers to talk to ATC's, External LB for hitting the UI. CodeDeploy over regular aws ami's.
Slack discussion begins here: https://concourseci.slack.com/archives/C07RY25QF/p1500654086732765
As per the slack discussion, I tend to see this leading up to problems with the linux workers. Number of volumes grows until the workers just start giving up, and then I end up having to rebuild the workers.
Happy to try and get any other information that may be required when it's available.
I was using the following and DID NOT see this issue:
I recently upgraded to Concourse 3.3.4, and I DO see this issue.
I upgraded to Concourse 3.1.1 in between 3.0.1 and 3.3.4 and I don't believe I saw the issue, but I could be mistaken.
Note: we are using the default baggage claim driver: overlay fs in 3.3.4
It seems like certain Jobs are affected by it (more so than others).
I have one job that builds a docker image (see below)
All the tasks run quick, with no lag... but when it gets to the docker-image-resource "put" task for "app-image" it takes 6+ minutes for it to start spinning...
During that 6+ minutes, it visually just looks like it has not started.
I also see the there is a container listed for that task when I do a
During this 6+ minutes, I did a "watch" on fly volumes for the container handle that is associated with that put task. I see a few container entries for fly containers for that handle... then after about 6 minutes, I do see a new volume entry show up in the
It would be great to get this fixed.. also if there is something I could look into... logs or anything to help track down, would be appreciated.
We got alot of gains with the new
With the switch to
Unfortunately with container tech as it is today, you either get instability (
Here are a few paths forward:
I'd prefer the instability over the slowness.
This slowness means we end up waiting 3-5 minutes regularly for tasks to start.
When you add to that the fact that volumes in 3.0.1+ often get reaped sooner than they should, and we have to download 7GB files again from the Internet, the time it takes jobs to run becomes absurd.
Then you can configure the driver back to btrfs. Is there an issue for the volume reaping bug? This is the first I've heard of it.…
On Tue, Aug 15, 2017, 11:14 AM krishicks ***@***.***> wrote: I'd prefer the instability over the slowness. This slowness means we end up waiting 3-5 minutes regularly for tasks to start. When you add to that the fact that volumes in 3.0.1+ often get reaped sooner than they should, and we have to download 7GB files again from the Internet, the time it takes jobs to run becomes absurd. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1404 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAHWMhtJ6KnTYdaa07JPaa7U7Bqs8jcks5sYbXggaJpZM4OeaDJ> .
changed the title from
Performance regression with 'overlay' driver and privileged containers
Oct 2, 2017
referenced this issue
Dec 28, 2017
We did see improvement when switching back. I had forgotten about this issue and realized it was affecting me on a new Concourse deploy. Thanks for reminding me to switch drivers!…
On Thu, Dec 28, 2017 at 09:56 Timothy R. Chavez ***@***.***> wrote: Did the folk that switched back to btrfs notice a performance improvement? Have you noticed increased instability? cc: @krishicks <https://github.com/krishicks>, @jadekler <https://github.com/jadekler> FWIW the slowness *and* the lack of feedback drive dev-folk here bonkers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1404 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAE1_b0JHOcZuTwx5Q-jTahjHWPVg-lzks5tE9ZcgaJpZM4OeaDJ> .
referenced this issue
Dec 30, 2017
referenced this issue
Jan 17, 2018
For anyone not following along in #1966, we found and fixed the source of a lot of the
added a commit
Mar 22, 2018
is it possible for some reason your worker falls into the