Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Bug: Volume creation issues for Concourse 3.3.3 #1419
We run a large deployment, with an average of 1000 active containers. This is a bosh deployment on AWS.
Our deployment was running V 2.7.3 successfully for a good length of time. The issue we want to report occurred when migrating the deployment to V 3.3.3.
After performing the upgrade successfully, the deployment was stable for a few hours, but the following error started to occur in various jobs, tasks, and pipelines:
The rest of the information I can provide is many errors being found in logs of the various components in the deployment. I have not tried to correlate specific timestamps across the different components (baggeclaim, garden, atc) because there really is just a giant block of 1000's of these errors that started immediately after upgrading to 3.3.3 (there is no log history above these errors to help).
Because of this problem, we were forced to downgrade to V 2.7.3.
Baggage claim on a worker, many 1000's of this error for many volume ID's:
as well many 1000's of this error for many volume ID's:
From garden logs, many 1000's of this error for many handles (what are handles in this case?):
From the ATC, many 1000's of this error for many container ID's:
Web Worker Logs
I've attached the complete logs for a single worker and for the web worker holding the ATC. I can provide more if desired. To extract, do this:
P.S. JTArchie, thank you for your help on this so far, we very much appreciate it. I was asked to put this in a separate issue.
@kmacoskey was this an upgrade to 3.3.3 or did you use a clean database and reset your pipelines, I have encountered the same thing going from 3.3.1 to 3.3.3 and have moved back to 3.3.1. I have not tried the upgrade on a new db and resetting pipelines at this stage.
I am also running into this problem. In my case I am not upgrading version, its a fresh installation (version 3.4.0). And I am running this in k8s cluster.
I am reading through this issue, and I am a bit confused about the symptoms you are seeing. In the original bug report:
but you are reporting the following errors:
Are you also seeing the original
@doty-pivotal yes I am seeing original
We experience the same/similar issue. I don't think it is an upgrade related (it came out weeks after the 3.2.1 -> 3.3.1 upgrade).
The logs show these errors:
We get this only on one job, the other ~dozen does not show similar symptomes.
Seeing the same error on 3.6.0. I have multiple tasks (9 in total) running
Concourse version: 3.6.0
On manually triggering the job after getting these messages it completed without issue (although I've had it fail with the same error on manual triggers before). I.e., it's an error that sometimes happens, and sometimes does not (on the same pipeline). Also, looking at the server logs I can see that they are basically being spammed with the following errors as well:
By spammed I mean that several messages (9-14) like the above are logged in 30 second intervals. Seems like it might be related so I figured it was worth including.
Some additional info:
Each task (9 in total) in my
name containers platform tags team state version ip-10-9-15-172 30 linux none none running 1.2 ip-10-9-19-225 15 linux none none running 1.2 ip-10-9-21-224 8 linux none none running 1.2
And after the job finally succeeded:
name containers platform tags team state version ip-10-9-15-172 16 linux none none running 1.2 ip-10-9-19-225 8 linux none none running 1.2 ip-10-9-21-224 2 linux none none running 1.2
In the last print of workers, I'm a bit surprised to see so many containers listed considering that nothing is actually running? When I ssh into e.g.
Worth noting that I am using caches for the tasks (in a folder called caches, where I move whatever I want cached). I have two different groups in the pipeline, one which tests pull requests and caches the .terraform folder (by moving it to
Not sure whats going on here or how to proceed, but Concourse is basically unusable until this is fixed (can't trust that jobs will actually be run). Happy to provide additional info and/or help debug this any way I can.
In case it is related
And then look up the volume in
In total I have 9 volumes like this - unknown type and n/a as identifier.
faced with this problem on 4.2.1 (installed from helm chart with managed postgres db)