Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine/engine_impl.go: Add a mutex around pod image updates for step progress. #26

Conversation

reltuk
Copy link
Contributor

@reltuk reltuk commented Apr 12, 2020

When a pipeline fans out a DAG, concurrent calls are made to (*Kubernetes).Run
to start each step in parallel. But starting a step requires a
Read/Modify/Write loop to the Kubernetes API, and so updates are highly likely
to fail with a concurrent modification error. A user encountered this behavior
here[1], and it's easy to reproduce by creating a pipeline with high fan out.

This commit adds a mutex around the Read/Modify/Write loop in
(*Kubernetes).start. We know that concurrent modifications from the single
process will cause failures, so there is no reason to dispatch them
concurrently.

This commit also changes the current backoff parameters (5 max tries, .1
jitter factor) to be a little more robust. In local testing, even with the
mutex, concurrent modification errors were still possible but were much rarer.

1: https://discourse.drone.io/t/kube-runner-limitations/6853

…progress.

When a pipeline fans out a DAG, concurrent calls are made to (*Kubernetes).Run
to start each step in parallel. But starting a step requires a
Read/Modify/Write loop to the Kubernetes API, and so updates are highly likely
to fail with a concurrent modification error. A user encountered this behavior
here[^1], and it's easy to reproduce by creating a pipeline with high fan out.

This commit adds a mutex around the Read/Modify/Write loop in
(*Kubernetes).start. We know that concurrent modifications from the single
process will cause failures, so there is no reason to dispatch them
concurrently.

This commit also changes the current backoff parameters (5 max tries, .1
jitter factor) to be a little more robust. In local testing, even with the
mutex, concurrent modification errors were still possible but were much rarer.

[^1]: https://discourse.drone.io/t/kube-runner-limitations/6853
@CLAassistant
Copy link

CLAassistant commented Apr 12, 2020

CLA assistant check
All committers have signed the CLA.

@reltuk
Copy link
Contributor Author

reltuk commented Apr 12, 2020

A different approach is to use PATCH to update the fields we care about without risk of concurrent modification error. That approach is outlined here: reltuk#1. Objectively, it might be better — higher inherent parallelism and potentially less load on the API server if we do see concurrent modification errors. But it has downsides, not least of which is the RBAC requirements change mentioned in that PR. It's also a bigger change semantically compared to the approach taken here

engine/engine_impl.go Outdated Show resolved Hide resolved
engine/engine_impl.go Outdated Show resolved Hide resolved
engine/engine_impl.go Outdated Show resolved Hide resolved
engine/engine_impl.go Outdated Show resolved Hide resolved
@bradrydzewski bradrydzewski merged commit 9a9d077 into drone-runners:master Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants