Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of concurrent Google Cloud jobs #4857

Closed
Tracked by #3096
teor2345 opened this issue Aug 1, 2022 · 4 comments · Fixed by #4981
Closed
Tracked by #3096

Limit the number of concurrent Google Cloud jobs #4857

teor2345 opened this issue Aug 1, 2022 · 4 comments · Fixed by #4981
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles C-bug Category: This is a bug I-integration-fail Continuous integration fails, including build and test failures

Comments

@teor2345
Copy link
Contributor

teor2345 commented Aug 1, 2022

Motivation

When we're running a lot of GitHub workflows at the same time, we can hit CPU limits or other quotas in Google Cloud.

For example, one of the limits is 500 CPUs.

Suggested Fix

  • Set the concurrency limit on jobs that use Google Cloud

I think the limit calculation is roughly:
500 quota / 16 CPUs per machine - 4 CD instances = 25 job concurrency limit
500 quota / 16 CPUs per machine / 4 jobs launched simultaneously per workflow - 1 CD "workflow" = 6 workflow concurrency limit

But the limits might only be 12 jobs or 3 workflows, if each core counts as 2 quota CPUs.

@teor2345 teor2345 added C-bug Category: This is a bug A-devops Area: Pipelines, CI/CD and Dockerfiles S-needs-triage Status: A bug report needs triage P-Medium ⚡ I-integration-fail Continuous integration fails, including build and test failures labels Aug 1, 2022
@teor2345
Copy link
Contributor Author

This caused a critical PR #4918 to fail, so it is now a high priority fix:

ERROR: (gcloud.compute.ssh) There was a problem refreshing your current auth tokens: ('Unable to retrieve Identity Pool subject token', '{ "message": "GitHub Actions is temporarily unavailable. Please visit https://www.githubstatus.com/ for the status of our services.", "ref": "Ref A: 95D36FBCF03245B9A2393A568103FC6C Ref B: BN3EDGE0708 Ref C: 2022-08-23T00:37:38Z" }')

https://github.com/ZcashFoundation/zebra/runs/7963813194?check_suite_focus=true#step:6:78

@teor2345
Copy link
Contributor Author

@gustavovalverde just letting you know that I increased the priority of this DevOps fix, because it's causing CI failures on critical Zebra PRs.

@teor2345
Copy link
Contributor Author

This has caused failures on multiple critical-priority PRs, so it is now a critical-priority fix:

ERROR: (gcloud.compute.instances.create-with-container) Could not fetch resource:

  • Quota 'C2D_CPUS' exceeded. Limit: 500.0 in region us-central1.

https://github.com/ZcashFoundation/zebra/runs/8061348032?check_suite_focus=true#step:7:82

gustavovalverde added a commit that referenced this issue Aug 29, 2022
Previous behavior:
Multiple Mainnet full syncs were able to run on the main branch at the
same time, and pushing multiple commits to the same branch would run
multiple CI workflows, when only the run from last commit was relevant

Expected behavior:
Ensure that only a single CI workflow runs at the same time in PRs.
The latest commit should cancel any previous running workflows from the
same PR.

Solution:
Use GitHub actions concurrency feature https://docs.github.com/en/actions/using-jobs/using-concurrency

Fixes #4977
Fixes #4857
@teor2345
Copy link
Contributor Author

PR #4891 partially fixes this issue by:

  • cancelling outdated jobs on a PR when a new commit is pushed
  • limiting main full syncs to one at a time

@mergify mergify bot closed this as completed in #4981 Aug 30, 2022
mergify bot pushed a commit that referenced this issue Aug 30, 2022
* ci(concurrency)!: run a single CI workflow as required

Previous behavior:
Multiple Mainnet full syncs were able to run on the main branch at the
same time, and pushing multiple commits to the same branch would run
multiple CI workflows, when only the run from last commit was relevant

Expected behavior:
Ensure that only a single CI workflow runs at the same time in PRs.
The latest commit should cancel any previous running workflows from the
same PR.

Solution:
Use GitHub actions concurrency feature https://docs.github.com/en/actions/using-jobs/using-concurrency

Fixes #4977
Fixes #4857

* docs: typo

* ci(concurrency): do not cancel running full syncs

Co-authored-by: teor <teor@riseup.net>

* fix(concurrency): explain the behavior better & add new ones

Co-authored-by: teor <teor@riseup.net>
@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles C-bug Category: This is a bug I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants