version: every will skip versions if a parallel upstream job's latest build finishes before an older one #736

jpalermo · 2016-10-30T01:28:19Z

Bug Report

Follow up to #666 and #563

We're running 2.4.0.

What I saw this morning, Job starts multiple copies of itself; A, B, C, and D. In that order.

The jobs finished in A, C, D, B order. The next job that depends on that resource (which is version: every) did not run with resource version B.

The text was updated successfully, but these errors were encountered:

joonas · 2016-11-05T01:01:30Z

I'm also seeing this behavior on my end on the pullrequest-resource, currently running Concourse 2.4.0

vito · 2016-11-07T18:53:09Z

So this is a tricky one. The semantics for version: every coming from a job whose builds run in parallel are complicated by how the passed constraints work with version: every. See #358 (comment)

tl;dr: passed constraints allow it to skip versions, otherwise when a bad version enters the mix, all downstream jobs can no longer run.

The scheduling for the second job would have to know that a build is in-flight for the ideal next version, and know to wait for it to finish before determining the inputs.

oppegard · 2016-11-11T21:27:09Z

Our team also makes heavy use of the pullrequest-resource, and reliably running all passed versions of a PR in downstream jobs is important. As I currently understand it, we need to set serial: true or max_in_flight: 1 to ensure all PRs make it through a pipeline?

vito · 2016-11-14T16:12:41Z

@oppegard That's right. Until we figure out whether there's something better we could do. I'm a little wary of the changes this kind of implies; having scheduling so dependent on ephemeral job state feels wrong.

vito · 2016-12-02T19:50:08Z

Possible workaround:

Have each job do a put to an otherwise-unused time resource. Then thread that through the subsequent jobs with trigger: true and a passed constraint. Do this for each job. This would make it so there's always a newer version once the build completes.

e.g.:

resources:
- name: my-pr
  type: pr
  source: {...}

- name: dummy-time
  type: time
  source: {interval: 24h} # doesn't matter, but something has to be there

jobs:
- name: job-1
  plan:
  - get: my-pr
    version: every
    trigger: true
  - task: test-pr
    # ...
  - put: dummy-time

- name: job-2
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-3
  plan:
  - get: my-pr
    trigger: true
    passed: [job-2]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-2]
  - # ...
  - put: dummy-time

Note that version: every is configured on the dummy-time resource in subsequent jobs.

thewoolleyman · 2017-01-19T18:01:31Z

Hi @vito,

Is still an issue (and is the above workaround is still required) when using the pullrequest-resource, even though the mentioned problems with using passed and every together have been (I believe?) resolved by #358 (comment) ?

If so, could you please explain the details, and why the problem seems to be specific to pullrequest-resource?

My guess is that it is because the pullrequest-resource conflates two independent "versions" (PR numbers and SHAs) into a single "version", which breaks some of the concourse assumptions about how resources and versioning work. It would be great to have you confirm or deny that, and if so, possibly explain why from your perspective.

Thanks!
-- Chad

vito · 2017-01-19T18:44:10Z

The problem is not specific to the PR resource. This issue is still open. The linked comment explains why this issue exists; it's a fundamental issue with passed and every together, where it's either broken this way or broken in the way that comment mentioned, and I think this way is the less broken of the two (as the alternative is a deadlocked pipeline).

thewoolleyman · 2017-01-19T20:23:27Z

@vito OK, thanks.

EDIT
~~I think understand now~~. Actually, I'm still confused. I thought when you said above "the second job", you meant a downstream job, but now I realize you probably meant a concurrently executing instance of the same job. So, I'm still trying to understand the crux of the problem, and why it only occurs with the combination of version: every, passed:, and non-serial or max-in-flight > 1 (and others on my team are also having trouble understanding it). So, it would be great if you could expand and clarify exactly what problem this causes, and why:

tl;dr: passed constraints allow it to skip versions, otherwise when a bad version enters the mix, all downstream jobs can no longer run.

The scheduling for the second job would have to know that a build is in-flight for the ideal next version, and know to wait for it to finish before determining the inputs.

END EDIT

But, with respect to the pullrequest-resource specifically:

The only reason we need to use "every" is because the pullrequest-resource conflates two independent "versions" (PR# and SHA) into one emitted "version". If we want the latest (not every) SHA to be built for every PR#.
If we only used the standard concourse git resource, which only emits one "version" (a SHA), in conjunction with a pipeline-per-pr# (i.e. pipeline-per-branch) approach as you discuss here, we would not need to use "every", because the separate pipeline-per-pr# (each with separate standard git resources defined for the PRs' branches) would ensure the latest SHA for the per-branch git resource would always be run (not every SHA, which is fine because we only care about running the latest commit on the PR's branch).
Thus, since we wouldn't need to use "every", we would not be exposed to this issue.

Correct?

If not, then please explain in more detail, because I'm still confused :)

Thanks,
-- Chad

thewoolleyman · 2017-01-20T15:43:28Z

@vito Also, I tried out the workaround above.

It doesn't appear to work if you ever have more than one entry in a passed: array.

E.g., if you try this, job-3 hang and never start with the message "waiting for a suitable set of input versions\n dummy-time - no versions satisfy passed constraints"

resources:
- name: my-pr
  type: pr
  source: {...}

- name: dummy-time
  type: time
  source: {interval: 24h} # doesn't matter, but something has to be there

jobs:
- name: job-1
  plan:
  - get: my-pr
    version: every
    trigger: true
  - task: test-pr
    # ...
  - put: dummy-time

- name: job-2a
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-2b
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-3
  plan:
  - get: my-pr
    trigger: true
    passed: [job-2a, job-2b]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-2a, job-2b]
  - # ...
  - put: dummy-time

Is this expected (and thus the workaround doesn't work in this situation), or do I have something wrong?

Thanks,
-- Chad

The problem is in concourse/concourse#736. This is the workaround advised by the concourse folks. :)

nazrhom · 2019-04-01T18:38:45Z

Hey @vito we have the same setup as @thewoolleyman and, while the workaround seems to accomplish what it is meant for, we also experience jobs-2x being triggered multiple times (with the same my-pr version but different dummy-time).

Is this meant to work when there are multiple entries in the passed array downstream?

vito · 2019-04-03T16:13:38Z

@nazrhom (and, well, everyone) To be honest we're starting to see version: every + passed: as a bit of a mis-feature. By saying "run with every version" but also "oh wait but only the ones that made it through this job" it can be seen as kind of an oxymoron, and from that perspective the the (mis)behavior described in this issue is somewhat unsurprising. It's also not super surprising that the workaround has led to other things that need working-around. 🙂 I don't see why having multiple entries in passed would break it any more or any less than with a single entry, fwiw.

It seems like what y'all are really trying to do is have a full pipeline run for every version. I think that's actually a challenge that can be met by a concept similar to spaces (#1707). We've got a new plan laid out in concourse/rfcs#1 (comment) that should make this kind of thing possible. At that point we might deprecate configuring both version: every and passed: at the same time and encourage users to adopt the new pattern.

nazrhom · 2019-04-04T10:34:17Z

@vito thanks, will look into the rfcs. I did some more digging and the issue I am experiencing does not seem to be related to fan-in/out as I can reproduce the issue even with a setup like:

jobs:
- name: job-1
  plan:
  - get: my-pr
    version: every
    trigger: true
  - task: test-pr
    # ...
  - put: dummy-time

- name: job-2
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-3
  plan:
  - get: my-pr
    trigger: true
    passed: [job-2]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-2]
  - # ...
  - put: dummy-time

I have a question on the (current) expected semantics here: job-2 will get a certain version of dummy-time that was put by job-1; after it does its thing it will put a new version of dummy-time. Doesn't this create 2 versions of dummy-time that have passed job-2 (the one that job-2 gets and the one it puts?) that can potentially both trigger job-3?

vito · 2019-04-10T17:45:27Z

@nazrhom Yeah, that may be what's happening.

stale · 2019-07-16T15:00:44Z

Beep boop! This issue has been idle for long enough that it's time to check
in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a
PR or
continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be
~~exterminated~~ closed, in accordance with our stale issue
process.

phillbaker · 2019-07-17T00:31:59Z

Would ask that this issue and/or #1298 remain open as it's a common problem and the proposed fix (spaces) is a large and uncertain features (concourse/rfcs#24).

vito · 2019-07-30T20:06:07Z

I'll leave this issue open as long as the issue still stands. The resolution will probably be deprecating/disallowing version: every + passed: in favor of spatial pipelines using the across step (concourse/rfcs#29). The problem with version: every + passed: is pretty fundamental as noted in #736 (comment), and supporting it adds a lot of complexity in our scheduling algorithm.

I'll keep the stale bot at bay by placing this in our Spatial Resources epic as a long-term goal to close out once the epic is complete.

YenTheFirst · 2019-11-02T01:24:00Z

We've been hitting this same issue. So, I've been trying to understand how the scheduling algorithm is currently implemented. I'd like to check my understanding, and see if it inspires any reasonable fixes.

My understanding is,

A per-pipeline atc/scheduler/Runner runs, and on a regular basis, tick()s. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/runner.go#L67)
tick() loads, among other things, a (possibly-cached) copy of the pipeline config, and a VersionDB containing the set of all resource versions, build inputs & outputs related to a pipeline.
This is passed to a atc/scheduler/Scheduler.Schedule()
for each job in the pipeline, this attempts to ensurePendingBuildExists. The purpose of ensurePendingBuildExists seems to be to find-or-create the single next build that should start. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/scheduler.go#L62)
ensurePendingBuildExists calls atc/scheduler/InputMapper.SaveNextInputMapping().
SaveNextInputMapping calls, per-resource, atc/db/algorithm/InputConfigs.Resolve() (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/inputmapper/inputmapper.go#L66) [there's an additional 'final resolve' afterward, which isn't relevant to the current thread]
Resolve() fetches version candidates. In particular, it fetches exactly one of "all candidates", "the latest candidate", "a specific pinned candidate", or "the candidates which passed the configured jobs". (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_configs.go#L19-L54)
As an implementation detail, these candidates are returned in descending CheckOrder order, due to atc/db/algorithm/VersionDB.* calling atc/db/algorithm/Versions.With(), which finds-or-adds versions in sorted order. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/version.go#L36)
The collected candidates are then passed to atc/db/algorithm/InputCandidates.Reduce().
Reduce() experimentally pins each version of a resource, in order, until it finds a version that "IsNext()" https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_candidates.go#L99
IsNext() returns true if a build already exists for the given version of a resource (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_candidates.go#L30)
Once we've found a version for this resource that IsNext(), we return it as a mapping, we collect all those mappings, ensure that we have a mapping for each input config, do a final group resolve to check compatibility, save the input mapping to the DB, and then determine whether we should trigger the build.

If my understanding is correct about this process, the cause of this bug is pretty straightforward - When
A) multiple versions of a resource exist, and
B) the newest version has a build that exists
then no other version of that resource is ever considered as an input to a potential job, regardless of version: every.

Given this, it seems like there might also be edge cases in which inputs that were version: every and didn't have a passed could also skip versions. I'd have to dig in to confirm or reproduce that, though.

A naiive suggestion might be, in cases where an input_config is version: every, to consider candidates in ascending order of check_order, skipping candidates that already have a build. I wouldn't be surprised if this is actually how it was implemented before, and that caused other subtle problems of "job gets stuck and never schedules any new builds". When i get some more time, I'll do some history spelunking.

vito · 2019-11-03T22:57:00Z

@YenTheFirst Impressive sleuthing! We're actually near the end of a complete re-design and re-implementation of the scheduling algorithm, currently living on the algorithm-v3 and release/6.0.x branch. If you'd like to read a bit about it, I had a section on it in the v10 roadmap post. We're aiming to push it out as a 6.0 beta release sometime soon.

For inputs with version: every the new algorithm actually does exactly what you proposed: walk through versions in ascending order from whichever version was previously used, falling back on descending order if none of those are valid.

However, at least at some level, the problem with passed and version: every with parallel upstream jobs is still pretty fundamental. Both attributes demand two things that are naturally in conflict: "use every version" and "use only versions that passed through the job". "Correcting" the behavior with parallel upstream builds would likely mean changing the semantics of version: every to either a) wait for in-fight builds to complete before deciding which version to use or b) support walking "backwards" to execute older versions.

Neither of these options really feel compelling to me. Even version: every without passed constraints is actually really awkward to support, and I really want to replace it with something else in the future as it has been the root cause of a ton of complexity/confusion in both the old implementation and the new one.

YenTheFirst · 2019-11-04T19:28:51Z

I don't think the concepts of passed and version every are fundamentally at-odds, at least from a user's point of view. It's typically either "I want to use every version of a resource, as long as it passed an upstream job", or "I want to use every version of a resource, but wait until a sure-to-succeed upstream job has produced an accompanying artifact", or "I want to use every output of an upstream job (even if that's not every version of the resource)". It may make sense for those use-cases to get fulfilled by something other than version: every, but I do think those are use-cases that should be supported.

In our particular case, the pipeline is something like:

For every commit on every pr-branch, build a docker image
for every successfully built docker image, run unit tests
for every successfully built docker image, run integration tests
for every commit that was successfully built, run the linter

I suspect we could work around this by removing the initial "build a docker image", and instead having unit-tests, integration-tests do their own building, but that duplicates work and code.

[In this particular case, yes, we're using version: every as a hack around the fact that Concourse does not have spaces yet, but...Concourse doesn't have spaces yet. I suspect that there's reasonable variations on this pipeline where version: every isn't merely a hack]

I'll take a closer look at those algorithms-v3 and release/6.0.x branches. Where would be a reasonable place to leave feedback related to those branches, if applicable?

Based only on your description,

walk through versions in ascending order from whichever version was previously used

At first glance, this sounds like it would have the exact same problem. If a downstream build for D completes before an upstream build for C, the downstream build for C will never start, since "D" will be the version that was previously used.

b) support walking "backwards" to execute older versions.
I expect this is basically what's needed.

vito · 2019-11-05T15:03:43Z

To be clear, I don't doubt the validity of your use case at all. I just don't see us improving this specific approach any time soon, because it will take a nontrivial amount of work, and it will just delay the real fix which we've recently gained a pretty clear picture of.

The problem is that, while version: every + passed may not seem to be at odds from a user-level, it's technically impossible to satisfy without significant changes to how Concourse works, which would mean other user-level changes. I just don't think these changes are the right thing to do.

b) support walking "backwards" to execute older versions.
I expect this is basically what's needed.

This is why I think it's a fundamental issue. Concourse pipelines go forwards, not backwards. Changing this would have a ton of implications, and it isn't a clean fix anyway because now there's a question of when does version: every stop going backwards. For example, if I apply version: every to a regular Git repo without passed constraints, it's not likely that I would want it to literally go all the way back to the initial commit (resources will eventually collect all versions, not just from initial configuration time).

We already know this feature is a pain in the butt to support, and doubling down on it to get it working correctly would require re-designing significant portions of the product around it, both internally and in the UI. So at this point I'm not on board with supporting version: every + passed, and I would like to even phase out version: every in the long run. It's a crutch that makes things "work" just enough for us to not find better solutions.

Your use case of building every commit of every PR is totally fine. I think a model that better fits how Concourse works would be to configure a pipeline for each commit. We just need to improve the automation and navigation of pipelines. This is all broken down into many small features, laid out in the v10 roadmap post. Implementing most of these features will be easier or at least more beneficial in the long-term than doubling down on version: every.

I'll take a closer look at those algorithms-v3 and release/6.0.x branches. Where would be a reasonable place to leave feedback related to those branches, if applicable?

Hmm good point, not yet. We should just open a PR for it. 🤔 I'll link to it here once we do (or just keep an eye out if I forget!)

edit: PR is open #4721

we introduced max-in-flight hoping to limit the number of clusters we run at the same time. It turns out it doesn't work that way. As soon as a deploy job is done the next one will be queued although no cluster has been deleted. This will allow for as many clusters as we have time to deploy until the first cluster is deleteted. We also add the "version: every" setting to make sure we don't skip versions if jobs complete in random order. See also here: concourse/concourse#736 Signed-off-by: Dimitris Karakasilis <DKarakasilis@suse.com>

because this together with `passed:` is confusing to Concourse and doesn't let the step trigger when it should. See also: concourse/concourse#736 (comment)

because that blocks automatic triggering of jobs concourse/concourse#736 This combination of settting was introduced when we tried to force Concourse to run later jobs on every version that passed the previous one. E.g. ver1 passed job 1, ver2 passed job 1 job2 should trigger both for ver1 and ver2 otherwise ver1 would never be tested on job2. The new scheduling algorighm may have solved this: https://concourse-ci.org/scheduler.html#scheduling-behavior Given our pipeline needs us to manually trigger jobs all the time, this worths a shot.

concourse-bot added the unscheduled label Oct 30, 2016

concourse-bot added in-flight and removed unscheduled labels Nov 7, 2016

vito mentioned this issue Dec 30, 2016

Having trouble stabilizing a demonstration pipeline concourse/pool-resource#21

Closed

concourse-bot added unscheduled and removed in-flight labels Feb 20, 2017

vito removed the needs-investigation label Oct 2, 2017

vito changed the title ~~Still seeing resources with version: every getting skipped~~ version: every will skip versions if a parallel upstream job's latest build finishes before an older one Oct 2, 2017

kars7e mentioned this issue Apr 11, 2018

Workaround for scheduling issue in concourse vmware-archive/dispatch#373

Merged

nat-henderson added a commit to GoogleCloudPlatform/magic-modules that referenced this issue Jul 11, 2018

This solves skipped Magician runs Once And For All.

ecf7dcc

The problem is in concourse/concourse#736. This is the workaround advised by the concourse folks. :)

nat-henderson mentioned this issue Jul 11, 2018

Workaround for skipped Concourse runs GoogleCloudPlatform/magic-modules#345

Closed

stale bot added the wontfix label Jul 16, 2019

phillbaker mentioned this issue Jul 17, 2019

Resource versions not being passed #1298

Closed

stale bot removed the wontfix label Jul 17, 2019

concourse deleted a comment from concourse-bot Jul 30, 2019

phillbaker mentioned this issue Oct 29, 2019

Not every commit is picked up by PR resource telia-oss/github-pr-resource#144

Closed

vito added this to Related issues in v10 roadmap Feb 24, 2020

jimmykarily mentioned this issue Mar 6, 2020

Remove max in flight and add every in the concourse pipeline cloudfoundry-incubator/kubecf#457

Merged

7 tasks

jimmykarily mentioned this issue Jun 4, 2020

Remove version: "every" from publish step artifacts cloudfoundry-incubator/kubecf#982

Merged

7 tasks

jimmykarily mentioned this issue Aug 21, 2020

Remove "version: every" when "passed:" contraints are defined cloudfoundry-incubator/kubecf#1243

Merged

vito mentioned this issue May 7, 2021

Jobs with both "passed" and "version:every" constraints fails to auto trigger build on a special case #6988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

version: every will skip versions if a parallel upstream job's latest build finishes before an older one #736

version: every will skip versions if a parallel upstream job's latest build finishes before an older one #736

jpalermo commented Oct 30, 2016

joonas commented Nov 5, 2016

vito commented Nov 7, 2016

oppegard commented Nov 11, 2016

vito commented Nov 14, 2016

vito commented Dec 2, 2016 •

edited

Loading

thewoolleyman commented Jan 19, 2017 •

edited

Loading

vito commented Jan 19, 2017

thewoolleyman commented Jan 19, 2017 •

edited

Loading

thewoolleyman commented Jan 20, 2017

nazrhom commented Apr 1, 2019

vito commented Apr 3, 2019 •

edited

Loading

nazrhom commented Apr 4, 2019

vito commented Apr 10, 2019

stale bot commented Jul 16, 2019

phillbaker commented Jul 17, 2019

vito commented Jul 30, 2019

YenTheFirst commented Nov 2, 2019

vito commented Nov 3, 2019

YenTheFirst commented Nov 4, 2019

vito commented Nov 5, 2019 •

edited

Loading

version: every will skip versions if a parallel upstream job's latest build finishes before an older one #736

version: every will skip versions if a parallel upstream job's latest build finishes before an older one #736

Comments

jpalermo commented Oct 30, 2016

Bug Report

joonas commented Nov 5, 2016

vito commented Nov 7, 2016

oppegard commented Nov 11, 2016

vito commented Nov 14, 2016

vito commented Dec 2, 2016 • edited Loading

thewoolleyman commented Jan 19, 2017 • edited Loading

vito commented Jan 19, 2017

thewoolleyman commented Jan 19, 2017 • edited Loading

thewoolleyman commented Jan 20, 2017

nazrhom commented Apr 1, 2019

vito commented Apr 3, 2019 • edited Loading

nazrhom commented Apr 4, 2019

vito commented Apr 10, 2019

stale bot commented Jul 16, 2019

phillbaker commented Jul 17, 2019

vito commented Jul 30, 2019

YenTheFirst commented Nov 2, 2019

vito commented Nov 3, 2019

YenTheFirst commented Nov 4, 2019

vito commented Nov 5, 2019 • edited Loading

vito commented Dec 2, 2016 •

edited

Loading

thewoolleyman commented Jan 19, 2017 •

edited

Loading

thewoolleyman commented Jan 19, 2017 •

edited

Loading

vito commented Apr 3, 2019 •

edited

Loading

vito commented Nov 5, 2019 •

edited

Loading