Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Concurrent tasks modifying shared volumes produce non-deterministic results #3058
Steps to Reproduce
Suppose I have the following Job that concurrently runs two deployments that each consist of
- name: "deploy" max_in_flight: 1 plan: - aggregate: - do: - task: process manifest a # 1.1) file: repo/ci/util/manifest.yml params: MANIFEST: ci/x/manifests/a.yml - put: cf # 1.2) params: manifest: processed-manifests/a.yml - do: - task: process manifest b # 2.1) file: repo/ci/util/manifest.yml params: MANIFEST: ci/x/manifests/b.yml - put: cf # 2.2) params: manifest: processed-manifests/b.yml
(some details cut for brevity).
We noticed that sometimes the step 2) 's in each aggregate/do pair would fail to find the manifest from step 1). What I think is happening here is that both step 1)'s concurrently access the volume
I think we are hit so often by this issue as we use the random-worker scheduling strategy to achieve equal task distribution in our cluster.
I feel either one of these two things should happen:
I'm leaning towards the second option, as I think it's possible to come up with a case where the first option can't be defined in a sound way.
The task outputs on the shared volume clobber each other in an non-deterministic way (task execution order/worker volumes) leading to pipeline failures. I think this is similar to #483 but also related to #1799
Note that it's possible to work around this issue by providing the tasks with
Another question that has popped up is: should task inputs be read-only so they can't be modified (or are they?) In case they are writable, they can potentially exhibit the same behavior.
I'm pretty sure this is behaving as expected.
I don't think copy-on-write semantics are at fault here and I'm pretty sure it'd happen regardless of how many workers there are; it's a fundamental race with shared in-memory state (the name -> artifact mapping). Concourse already ensures tasks get copy-on-writes and don't share the data, but the in-memory state for tracking named artifacts is subject to cases like this where there are two tasks concurrently creating the same named artifact.
Thanks for the fast response @vito. This issue is clearly a corner case. However because it's a bit of a heisenbug situation for users with pipelines like this I'd very much appreciate if concourse could print a warning in case it detects the task graph has this condition (aggregate tasks producing the same output folders).
In hindsight it's pretty obvious what happens but it took quite a while to track down for us. A warning like we already have for unspecified params (credentials) in the build log would probably be the ideal solution.
A simple addition to the