Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triggering a concourse build no longer forces a check on resources #3845

Open
StevenArmstrong opened this issue May 11, 2019 · 14 comments

Comments

6 participants
@StevenArmstrong
Copy link

commented May 11, 2019

Bug Report

Prior to concourse 5.1 when triggering a build manually it would automatically force all resources to re-check then kick off. This no longer happens after the change to scheduling in concourse 5.1.
As per the 5.1 release note, a manual trigger should short circuit the check interval for all resources so there is not a long wait

image

Steps to Reproduce

Set a high check_every value for a job with multiple git resources
The job will not start executing until the very last resource check has finished its interval and went back to 0.

Expected Results

As per the 5.1 release note, a manual trigger should force an instant re-check of all resources as opposed to waiting.

Actual Results

The short circuit doesn't appear to be happening and it is waiting for the normal check duration.
This means that manual triggers appear slow to users now as a manual trigger used to automatically force an instant recheck then kick off after a few seconds. Now with jobs with higher check_every settings users need to wait for the interval to expire before the build will start.

Additional Context

This is frustrating for our users as they use manual triggers to deploy during release windows to production. This is resulting in teams saying concourse is slow and complaining.

Version Info

  • Concourse version: 5.1
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: Iaas
  • Browser (if applicable): All
  • Did this used to work? Yes
@vito

This comment has been minimized.

Copy link
Member

commented May 13, 2019

I'll prioritise this highly but having a more concrete steps-to-reproduce will help speed things up, e.g. an example pipeline snippet and any web node logs that demonstrate the issue.

@mmb

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

We have been experiencing this too since 5.1.0.

Manually triggered jobs get stuck at preparing build and do not start even though all the checklist items are checked.

I don't see any errors in the web node logs. If I pin one of the resources to the most recent version then the waiting job starts immediately so I assume it must be waiting on that resource somehow. We have been using this for a workaround.

This pinned resource is a "tasks" git repo that is used by almost every pipeline so maybe there is some contention with resource locking?

@freelock

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

The referenced case, #3759 is our experience with this exact bug. In our scenario, we use a templated pipeline with 19 different resources, used across a bunch of different jobs. One of these is a "nightly check" job that uses 3 of those. We have over 50 of these pipelines, and each night each pipeline is unpaused, the check job is triggered, and when it has completed the pipeline is paused and a new pipeline is unpaused. We run 2 of these jobs at a time, all triggered.

We had to roll back to v5.0.1, because v5.1.0 is entirely not usable for our scenario -- v5.0.1 completes the entire run of all pipelines in 3.5 hours, whereas v5.1.0 takes over 18 hours to complete -- meaning there's not much room for any of the actual deployment jobs we need to run.

Triggering a job in v5.0.1 after a pipeline has been unpaused (and many of the resources already cached) takes 30 seconds to 3 - 4 minutes. In 5.1.0, every one of these triggered jobs was taking 18 - 35 minutes to start, even when that pipeline had been running a different job less than an hour before.

What I think is happening, based on @pivotal-jwinters comments in the other issue, is that triggering a job causes a re-check on every single resource in the pipeline before allowing the job to start, instead of just rechecking the resources used by the job (and letting the other resources check on a normal interval).

Has there been any change that would affect this in 5.2.0? We'd love to make use of the new features, but unless this scheduling for triggered jobs gets addressed, we're stuck on 5.0.1...

@freelock

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

We also have used the --enable-global-resources flag -- like @mmb, many of our resources are git repositories used across all of our pipelines, and these would not even need to be checked beyond the normal refresh interval before triggering a job that used them -- we already use fly check-resource when updating one of these to make sure we're getting the latest, if we actually need to update one of these.

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

I think this is working as intended.

This is the pipeline I used:

resources:
- name: test-resource-a
  type: git
  source: {uri: "https://github.com/pivotal-jwinters/test-a"}

- name: test-resource-b
  type: git
  source: {uri: "https://github.com/pivotal-jwinters/test-b"}

- name: test-resource-c
  type: git
  source: {uri: "https://github.com/pivotal-jwinters/test-c"}

jobs:
- name: test-job
  plan:
  - get: test-resource-a
    trigger: true

  - get: test-resource-b
    trigger: true

  - task: test-task
    config:
      platform: linux

      image_resource:
        type: registry-image
        source:
          repository: alpine

      inputs:
      - name: test-resource-a
      - name: test-resource-b

      run:
        path: echo
        args: ["hello-job"]

- name: other-test-job
  plan:
  - get: test-resource-c
    trigger: true

  - task: other-test-task
    config:
      platform: linux

      image_resource:
        type: registry-image
        source:
          repository: alpine

      inputs:
      - name: test-resource-c

      run:
        path: echo
        args: ["hello-other-job"]

When I set CONCOURSE_RESOURCE_CHECKING_INTERVAL: 10m, I can verify on each resource page in UI that (after the initial check) they don't check for 10 minutes.

However if I manually trigger test-job, resource-a and resource-b are both short circuited, while resource-c is not.

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

It might be worth noting that the short circuit behaviour relies on LISTEN/NOTIFY in postgres.

@StevenArmstrong

This comment has been minimized.

Copy link
Author

commented May 17, 2019

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

@StevenArmstrong can you modify my example with what you're talking about.

@StevenArmstrong

This comment has been minimized.

Copy link
Author

commented May 17, 2019

@freelock

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

If you add multiple instances of the same resource to the pipeline on other jobs, I think that seems to cause the issue to rear it's head. It means that the check_every for the resued resource goes to a higher value than it should and also the short circuit does not work and you have extremely long wait times. I would add that on your test pipeline so you can reproduce. As an example adding it 10 times should be enough to see the behaviour.

On Fri, 17 May 2019, 16:24 Joshua Winters, @.***> wrote: I think this is working as intended. This is the pipeline I used: resources: - name: test-resource-a type: git source: {uri: "https://github.com/pivotal-jwinters/test-a"} - name: test-resource-b type: git source: {uri: "https://github.com/pivotal-jwinters/test-b"} - name: test-resource-c type: git source: {uri: "https://github.com/pivotal-jwinters/test-c"} jobs: - name: test-job plan: - get: test-resource-a trigger: true - get: test-resource-b trigger: true - task: test-task config: platform: linux image_resource: type: registry-image source: repository: alpine inputs: - name: test-resource-a - name: test-resource-b run: path: echo args: ["hello-job"] - name: other-test-job plan: - get: test-resource-c trigger: true - task: other-test-task config: platform: linux image_resource: type: registry-image source: repository: alpine inputs: - name: test-resource-c run: path: echo args: ["hello-other-job"] When I set CONCOURSE_RESOURCE_CHECKING_INTERVAL: 10m, I can verify on each resource page in UI that (after the initial check) they don't check for 10 minutes. However if I trigger test-job resource-a and resource-b are both short circuited, while resource-c is not. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3845?email_source=notifications&email_token=ABLT5PUFXEUMNDJI6MCU7NTPV3E2FA5CNFSM4HMI2FF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVVB7MY#issuecomment-493494195>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLT5PQ7QFNNWHXYEDF2IOLPV3E2FANCNFSM4HMI2FFQ .

Oh that could certainly explain why 5.1.0 made this all much worse... in our case, a couple resources are on nearly every single job...

@clarafu

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@StevenArmstrong Do you see this happening on only certain resources and does it usually happen after a deploy of Concourse or failed workers? If this is what is happening, it might be related to resource checks hanging and causing the builds to never trigger because it is waiting for the resource to finish checking.

Or if it doesn't have anything to do with hanging checks, do you usually have your pipelines paused and your process is usually that you unpause your pipeline and immediately try to manually trigger a build?

@StevenArmstrong

This comment has been minimized.

Copy link
Author

commented Jun 24, 2019

@mmb

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

As another data point, we have tried enabling global resources as a workaround.

With that enabled manually triggered builds start right away.

@StevenArmstrong

This comment has been minimized.

Copy link
Author

commented Jun 25, 2019

@clarafu clarafu moved this from Backlog to In Flight in Core Jul 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.