concourse worker making excessive writes to disk #2906

natalieparellano · 2018-12-03T22:01:06Z

Bug Report

We have a Linux worker running in vsphere that seems to be making excessive writes to disk - on average of 50,000 Kbps. Redeploying the worker fixed the issue for perhaps 24 hours (bringing the disk iops down to ~7,000 Kbps) before it spiked up again.

Steps to Reproduce

We are using the following opsfiles from concourse-bosh-deployment:

cluster/external-worker.yml
cluster/operations/worker-ephemeral-disk.yml
cluster/operations/windows-worker.yml
cluster/operations/windows-worker-ephemeral-disk.yml

At some point, we removed our Windows worker to see if that was making a difference, but the high iops persist.

When we ssh onto the worker, using iotop we see that the offending process appears to be using the loop0 device:

# iotop -o -d 5 -a

Also relevant, loop0 is being used by baggageclaim:

$ mount | grep loop0
/dev/loop0 on /var/vcap/data/baggageclaim/volumes type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)

Expected Results

50,000 Kbps seems very high. We were expecting lower iops activity. This worker is only running a cf push every 5 minutes and a curl every 1 minute. It's hard to imagine what all it could be writing to disk!

Actual Results

Version Info

Concourse version: 4.2.1
Deployment type (BOSH/Docker/binary): bosh
Infrastructure/IaaS: vsphere
Browser (if applicable):
Did this used to work? ~~We don't know :(~~ Actually this did used to work - the worker was up for 30+ days before we started seeing the issue (judging from performance charts in vCenter). However when we did a fresh redeploy, the issue reoccurred within ~24 hours.

The text was updated successfully, but these errors were encountered:

natalieparellano · 2019-01-02T17:09:48Z

Some more things we learned:

We see a bunch of log lines with baggageclaim.repository.destroy-volume.failed-to-destroy and a bunch of dead containers in /var/vcap/data/baggageclaim/volumes/dead - maybe this is related to our issue? We unfortunately don't have the logs for the time period where we also know exactly when the spike in disk writes started.
We don't see excessive log lines printed in baggageclaim.stdout.log even when the spike in disk writes is occurring (the time period covered by each log roughly matches what we see when there is no spike)

Can you suggest other data we should be collecting, or other things to look for? We are starting to impact other teams in the data center with our excessive disk usage so we would like to get to the bottom of this.

samgurtman-zz · 2019-01-10T02:40:43Z

We are having somewhat similar issues on 4.2.2. I'm not sure if it's the same but we also get various failed-to-destroy and other volume management errors. This causes the worker to become stalled.

vito · 2019-01-15T22:29:27Z

Could you try switching the volume driver to overlay?

Just a quick note: this sounds more like a support request than a discrete bug report - I would suggest using the forums or our Discord channel for cases like this where more investigation is required and the behavior doesn't seem particularly related to Concourse code. This just makes our lives easier and keeps the issues from being an ever-growing backlog. 🙂 Thanks for doing the investigation you've done so far, though!

vito added the triage label Jan 9, 2019

vito removed the triage label Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concourse worker making excessive writes to disk #2906

concourse worker making excessive writes to disk #2906

natalieparellano commented Dec 3, 2018 •

edited

natalieparellano commented Jan 2, 2019

samgurtman-zz commented Jan 10, 2019

vito commented Jan 15, 2019

concourse worker making excessive writes to disk #2906

concourse worker making excessive writes to disk #2906

Comments

natalieparellano commented Dec 3, 2018 • edited

Bug Report

Steps to Reproduce

Expected Results

Actual Results

Version Info

natalieparellano commented Jan 2, 2019

samgurtman-zz commented Jan 10, 2019

vito commented Jan 15, 2019

natalieparellano commented Dec 3, 2018 •

edited