Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concourse worker making excessive writes to disk #2906

Open
natalieparellano opened this issue Dec 3, 2018 · 3 comments
Open

concourse worker making excessive writes to disk #2906

natalieparellano opened this issue Dec 3, 2018 · 3 comments

Comments

@natalieparellano
Copy link

natalieparellano commented Dec 3, 2018

Bug Report

We have a Linux worker running in vsphere that seems to be making excessive writes to disk - on average of 50,000 Kbps. Redeploying the worker fixed the issue for perhaps 24 hours (bringing the disk iops down to ~7,000 Kbps) before it spiked up again.

Steps to Reproduce

We are using the following opsfiles from concourse-bosh-deployment:

cluster/external-worker.yml
cluster/operations/worker-ephemeral-disk.yml
cluster/operations/windows-worker.yml
cluster/operations/windows-worker-ephemeral-disk.yml

At some point, we removed our Windows worker to see if that was making a difference, but the high iops persist.

When we ssh onto the worker, using iotop we see that the offending process appears to be using the loop0 device:

# iotop -o -d 5 -a
worker-iops

Also relevant, loop0 is being used by baggageclaim:

$ mount | grep loop0
/dev/loop0 on /var/vcap/data/baggageclaim/volumes type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)

Expected Results

50,000 Kbps seems very high. We were expecting lower iops activity. This worker is only running a cf push every 5 minutes and a curl every 1 minute. It's hard to imagine what all it could be writing to disk!

Actual Results

Version Info

  • Concourse version: 4.2.1
  • Deployment type (BOSH/Docker/binary): bosh
  • Infrastructure/IaaS: vsphere
  • Browser (if applicable):
  • Did this used to work? We don't know :( Actually this did used to work - the worker was up for 30+ days before we started seeing the issue (judging from performance charts in vCenter). However when we did a fresh redeploy, the issue reoccurred within ~24 hours.
@natalieparellano
Copy link
Author

Some more things we learned:

  • We see a bunch of log lines with baggageclaim.repository.destroy-volume.failed-to-destroy and a bunch of dead containers in /var/vcap/data/baggageclaim/volumes/dead - maybe this is related to our issue? We unfortunately don't have the logs for the time period where we also know exactly when the spike in disk writes started.
  • We don't see excessive log lines printed in baggageclaim.stdout.log even when the spike in disk writes is occurring (the time period covered by each log roughly matches what we see when there is no spike)

Can you suggest other data we should be collecting, or other things to look for? We are starting to impact other teams in the data center with our excessive disk usage so we would like to get to the bottom of this.

@vito vito added the triage label Jan 9, 2019
@samgurtman-zz
Copy link

We are having somewhat similar issues on 4.2.2. I'm not sure if it's the same but we also get various failed-to-destroy and other volume management errors. This causes the worker to become stalled.

@vito
Copy link
Member

vito commented Jan 15, 2019

Could you try switching the volume driver to overlay?

Just a quick note: this sounds more like a support request than a discrete bug report - I would suggest using the forums or our Discord channel for cases like this where more investigation is required and the behavior doesn't seem particularly related to Concourse code. This just makes our lives easier and keeps the issues from being an ever-growing backlog. 🙂 Thanks for doing the investigation you've done so far, though!

@vito vito removed the triage label Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants