Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concourse database hits 100% CPU usage #3948

Closed
rudolfv opened this issue May 30, 2019 · 2 comments

Comments

Projects
None yet
2 participants
@rudolfv
Copy link

commented May 30, 2019

Bug Report

We have some badly behaved custom resources that generate a new UUID version each time a resource check is done. Obviously we need to fix this at some point, but up till now we haven't come up with a clean solution for this.

Once the number of resource versions hits a certain threshold the database CPU goes up to 100%.

In the previous versions of concourse we could get around this in a fairly hacky way by deleting
resources of type x from the versioned_resources table with IDs older than a certain value. This worked very well.

After the version 5.1.0 upgrade we are experiencing similar problems, but caused by a different set of queries related to the retrieval of resource versions.

My first question would be if there is a way to safely delete resource versions in the new database structure the same way we did previously.

My second question is whether it would be possible to optimize these queries to handle the resource version volumes we are throwing at it. I can foresee that someone else might have a real use case for hitting these.

We also tried adding some additional indices to replace sequential scans that the PostgreSQL optimizer was choosing, with parallel index scans. But that did not alleviate the problem.

We also tried VACUUM, ANALYZE and REINDEXING to ensure that the optimal query plan was chosen. Running ANALYZE would fix the issue up to a certain row number threshold.

The queries that cause the CPU spikes in Concourse 5.1.0 are the following ones:

SELECT v.id, v.check_order, r.id, i.build_id, i.name, b.job_id, b.status = 'succeeded' FROM build_resource_config_version_inputs i JOIN builds b ON b.id = i.build_id JOIN resource_config_versions v ON v.version_md5 = i.version_md5 JOIN resources r ON r.id = i.resource_id WHERE r.resource_config_scope_id = v.resource_config_scope_id AND (r.id, v.version_md5) NOT IN (SELECT resource_id, version_md5 from resource_disabled_versions) AND v.check_order <> $1 AND r.pipeline_id = $2;

SELECT v.id, v.check_order, r.id FROM resource_config_versions v JOIN resources r ON r.resource_config_scope_id = v.resource_config_scope_id LEFT JOIN resource_disabled_versions d ON d.resource_id = r.id AND d.version_md5 = v.version_md5 WHERE v.check_order <> 0 AND d.resource_id IS NULL AND d.version_md5 IS NULL AND r.pipeline_id = $1;

The number of rows in the relevant tables are as follows:
resource_config_versions: 4609744
build_resource_config_version_inputs: 2287481
resources: 8062

Steps to Reproduce

Write a custom resource that generates a new UUID based version with every check and do those resource checks lots of times across many pipelines.

Expected Results

For CPU usage to stay within normal bounds.

Actual Results

CPU usage on the PostgreSQL database hovers around 100% and everything slows to a crawl.

Version Info

  • Concourse version: 5.1.0
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: AWS - RDS for the PostgreSQL DB
  • Browser (if applicable): N/A
  • Did this used to work?
    Only with our hacky workaround.

@rudolfv rudolfv added the bug label May 30, 2019

@rudolfv rudolfv changed the title 100% CPU usage for the concourse database Concourse database hits 100% CPU usage May 30, 2019

@marco-m

This comment has been minimized.

Copy link
Collaborator

commented May 31, 2019

The feeling I have is that it will take less time and be more robust to fix the behavior of your custom resources as opposed to find a way to be able to keep poking at the db also with concourse 5.x :-)

@rudolfv rudolfv closed this May 31, 2019

@rudolfv

This comment has been minimized.

Copy link
Author

commented May 31, 2019

I see that we can clear all the resource versions by destroying and re-creating the pipelines. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.