After upgrade from v5.8.0 to v6.0 makes scheduling very slow #5378

gowrisankar22 · 2020-03-30T13:37:35Z

gowrisankar22
Mar 30, 2020

Hello Colleagues,

We are using the helm/k8s based concourse setup. Recently, we have upgraded our test environment to evaluate v6.0. Deployments everything went absolutely fine. After upgrading our job scheduling became very very slow.

v5.8.2 - same job took 35s
v6.0- same job took: ~3-4min

pipeline.yml

resources:
- name: iacbox
  type: registry-image
  source:
    repository: weberstephanhd/iacbox2
    tag: v293.02
    username: ""
    password: ""
- icon: av-timer
  name: 5-min
  source:
    interval: 5m
  type: time
jobs:
- name: test-concourse-v6.0
  serial: true
  build_log_retention:
    days: 3
  plan:
  - get: 5-min
    trigger: true
  - get: iacbox
  - task: test-concourse-v6.0
    image: iacbox
    timeout: 10m
    config:
      platform: linux
      run:
        path: sh
        args:
        - -c
        - "echo Test concourse v6.0"

our resource consumption:

k top po
NAME                             CPU(cores)   MEMORY(bytes)
concourse-web-6cc5d66847-z5dr8   188m         47Mi
concourse-worker-0               375m         529Mi
concourse-worker-1               665m         619Mi
concourse-worker-2               35m          714Mi

CPU/MEM requests set:

web: 1cpu/3gbmem
worker: 1cpu/1gbmem

@vito

Answered by xtreme-sameer-vohra

Apr 21, 2020

@clarafu and I were able to reproduce the issue on v6
I'll be opening an issue for this.

View full answer

vito · 2020-03-30T13:52:27Z

vito
Mar 30, 2020
Maintainer

By "scheduling" do you mean build duration?

0 replies

gowrisankar22 · 2020-03-30T14:21:40Z

gowrisankar22
Mar 30, 2020
Author

@vito yes right. Please find the stackdriver traces

0 replies

vito · 2020-03-30T16:34:31Z

vito
Mar 30, 2020
Maintainer

What does the build look like it's doing during this time? That image is quite large (4.23GB!), are you sure it's not just being downloaded for the first time after the deploy and having to be streamed to the task? How many times has the job been run?

For me, on a brand new local docker-compose installation, the first build fetches the image and completes in 2m 51s. The next builds don't have to fetch the image and completes in 30s-42s.

0 replies

gowrisankar22 · 2020-03-30T17:18:25Z

gowrisankar22
Mar 30, 2020
Author

@vito
I have 3 workers, until all three workers gets the image once to cache it up, it downloads the image. We run the job every 5 mins .. there are 5 jobs runs quite often.

Seems like if the jobs is running in one specific worker is fast as before .. other two workers are taking time upto 3-4 mins ..

Any suggestions?

0 replies

gowrisankar22 · 2020-04-01T13:25:38Z

gowrisankar22
Apr 1, 2020
Author

@vito @cirocosta Facing issue still. Any suggestions, how to fix the issue?
I recreated all the workers and web nodes as well but no luck.

out of 4-5 builds one build is work as expected with less time(35s) rest all the builds takes (3-6)mins now.

0 replies

vito · 2020-04-01T23:43:19Z

vito
Apr 1, 2020
Maintainer

Sorry but without being able to observe this directly there's not a whole lot to go on here. If you can replicate the scenario in a controlled environment like Docker Compose that would help narrow things down. Right now all I can do is guess.

Here are some shots in the dark:

Can you show the output of fly workers -d?
Can you provide the timing breakdown for each step in the build plan, and/or a trace of one that was normal and one that was slow? (Is it possible to send an 'export'? Not sure if that's a thing.)

0 replies

gowrisankar22 · 2020-04-02T08:08:50Z

gowrisankar22
Apr 2, 2020
Author

@vito
Our setup currently running on GKE.
Recently with concourse version 5.8.2, we ran into the resource cache problem. This point we were using image_resource in the task config. After discussing indiscord we moved image_resource in the task to image in the build plan. [1]

[1]https://discordapp.com/channels/219899946617274369/413770960089382922/689179388431826982

With concourse version 5.8.2 this change worked like a charm and performance was quite good.

After concourse 6.0.0 release, we tried updating to the latest version that's where the problem started.

pipeline1.yaml : moved image_resource in the task image in the build plan

resources:
- name: iacbox
  type: registry-image
  source:
    repository: weberstephanhd/iacbox2
    tag: v305
- icon: av-timer
  name: 5-min
  source:
    interval: 5m
  type: time
jobs:
- name: test-concourse-v6.0
  serial: true
  build_log_retention:
    days: 3
  plan:
  - get: iacbox
  - get: 5-min
    trigger: true
  - task: test-concourse-v6.0
    image: iacbox
    timeout: 10m
    config:
      platform: linux
      run:
        path: sh
        args:
        - -c
        - "echo Test concourse v6.0"

pipeline1 trace : issue

pipeline1 trace: working

pipeline2.yaml : image_resource in the task config rather than image in the build plan

resources:
- icon: av-timer
  name: 5-min
  source:
    interval: 5m
  type: time
jobs:
- name: test-concourse-v6.0-test
  serial: true
  build_log_retention:
    days: 3
  plan:
  - get: 5-min
    trigger: true
  - task: test-concourse-v6.0
    timeout: 10m
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: weberstephanhd/iacbox2
          tag: v305
          username: ""
          password: ""
      run:
        path: sh
        args:
        - -c
        - "echo Test concourse v6.0"

with this pipeline definition it works always with few secs as expected.

name                containers  platform  tags  team  state    version  age    garden address   baggageclaim url        active tasks  resource types
concourse-worker-0  2           linux     none  none  running  2.2      4h6m   10.96.1.2:37683  http://10.96.1.2:38199  0             bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, mock, pool, registry-image, s3, semver, time, tracker
concourse-worker-1  6           linux     none  none  running  2.2      4h6m   10.96.1.2:38907  http://10.96.1.2:45329  0             bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, mock, pool, registry-image, s3, semver, time, tracker
concourse-worker-2  4           linux     none  none  running  2.2      4h36m  10.96.1.2:45859  http://10.96.1.2:39603  0             bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, mock, pool, registry-image, s3, semver, time, tracker

Hopefully, this information helps to narrow down the problem?

0 replies

jamieklassen · 2020-04-02T12:47:18Z

jamieklassen
Apr 2, 2020
Collaborator

First off, totally cool that tracing exists in v6 to show this kind of stuff. Kudos to @cirocosta.

But, perhaps tracing could be improved - even though it's clearly the task step whose behavior changes so drastically, we don't seem to have a breakdown of volume streaming/"container setup" vs script execution. Maybe we should have separate spans for each of those procedures.

It seems like the cached bits for that resource version live on one worker - call it A - and only on A. When the task step lands on A it's probably fast (because the resource cache is already colocated), but for the other two, the bits are always being streamed from A before the task runs. Is this expected behaviour for resource caches? Should there be redundant caches on multiple workers?

If the above theory is correct, why was performance better in v5.8.2? Is this "cluster-unique resource cache per version" behaviour new in v6? If not, did volume streaming get slower in v6?

0 replies

drahnr · 2020-04-02T13:34:03Z

drahnr
Apr 2, 2020
Collaborator

@gowrisankar22 I see the exact same behaviour with (admittedly) slow connections between workers and similarly sized images.

0 replies

gowrisankar22 · 2020-04-03T16:29:18Z

gowrisankar22
Apr 3, 2020
Author

@vito any update on this ??

0 replies

vito · 2020-04-03T16:42:50Z

vito
Apr 3, 2020
Maintainer

@gowrisankar22 If there were any updates they would be posted here. Please slow down with the bumping and direct messaging (on Discord too). I can't address everything immediately, and it's only been a day since the last activity. 😅 This is a performance regression, but things still work, right?

@xtreme-sameer-vohra Do you know if we ended up changing the get step during the refactor so that it no longer picks a random worker and fetches onto it? Do we only fetch once per-cluster now? I remember discussing the 'viral cache spreading' idea but don't think it ended up on the roadmap. Sounds like we may want to prioritize it.

0 replies

gowrisankar22 · 2020-04-03T16:55:17Z

gowrisankar22
Apr 3, 2020
Author

@vito sorry for that .. we were planning to bump v6 on our prod system next week and we have quite a heavy load on the system running over 500 pipelines. This was the reason for the continuous ask.

Off course everything works but performance regression is a very big issue and deployments runs long hours which is not good ..

0 replies

gowrisankar22 · 2020-04-14T17:22:28Z

gowrisankar22
Apr 14, 2020
Author

@vito any update on this issue ??

0 replies

xtreme-sameer-vohra · 2020-04-14T21:38:37Z

xtreme-sameer-vohra
Apr 14, 2020
Collaborator

Hi @gowrisankar22
I am going to take a look at this issue and will post updates as often as possible. Please be assured that we're trying to address issues as fast as humanly possible and thank you for your patience.

My first action is to do a code review of v6 runtime logic to determine if there is any change in behaviour that explains the issue you are experiencing. We did a fairly substantial refactor in v6 of the runtime steps logic, however, the expectation was that there would be no behavioural changes.

0 replies

gowrisankar22 · 2020-04-15T05:44:31Z

gowrisankar22
Apr 15, 2020
Author

@xtreme-sameer-vohra Sure :) you easily reproduce the problem if you have more than 1 worker.
let say try the same pipeline definition where you have more than 1 worker(example: 3 workers)
check the behavior between 5.8.0 to 6.0.0

0 replies

gowrisankar22 · 2020-04-17T08:13:02Z

gowrisankar22
Apr 17, 2020
Author

@xtreme-sameer-vohra today I have updated my prod system and the issue is reproducible there as well.

0 replies

xtreme-sameer-vohra · 2020-04-21T22:19:40Z

xtreme-sameer-vohra
Apr 21, 2020
Collaborator

@clarafu and I were able to reproduce the issue on v6
I'll be opening an issue for this.

1 reply

xtreme-sameer-vohra Apr 21, 2020
Collaborator

Created - #5485

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After upgrade from v5.8.0 to v6.0 makes scheduling very slow #5378

{{title}}

Replies: 17 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

After upgrade from v5.8.0 to v6.0 makes scheduling very slow #5378

gowrisankar22 Mar 30, 2020

Replies: 17 comments · 1 reply

vito Mar 30, 2020 Maintainer

gowrisankar22 Mar 30, 2020 Author

vito Mar 30, 2020 Maintainer

gowrisankar22 Mar 30, 2020 Author

gowrisankar22 Apr 1, 2020 Author

vito Apr 1, 2020 Maintainer

gowrisankar22 Apr 2, 2020 Author

jamieklassen Apr 2, 2020 Collaborator

drahnr Apr 2, 2020 Collaborator

gowrisankar22 Apr 3, 2020 Author

vito Apr 3, 2020 Maintainer

gowrisankar22 Apr 3, 2020 Author

gowrisankar22 Apr 14, 2020 Author

xtreme-sameer-vohra Apr 14, 2020 Collaborator

gowrisankar22 Apr 15, 2020 Author

gowrisankar22 Apr 17, 2020 Author

xtreme-sameer-vohra Apr 21, 2020 Collaborator

xtreme-sameer-vohra Apr 21, 2020 Collaborator

gowrisankar22
Mar 30, 2020

Replies: 17 comments 1 reply

vito
Mar 30, 2020
Maintainer

gowrisankar22
Mar 30, 2020
Author

vito
Mar 30, 2020
Maintainer

gowrisankar22
Mar 30, 2020
Author

gowrisankar22
Apr 1, 2020
Author

vito
Apr 1, 2020
Maintainer

gowrisankar22
Apr 2, 2020
Author

jamieklassen
Apr 2, 2020
Collaborator

drahnr
Apr 2, 2020
Collaborator

gowrisankar22
Apr 3, 2020
Author

vito
Apr 3, 2020
Maintainer

gowrisankar22
Apr 3, 2020
Author

gowrisankar22
Apr 14, 2020
Author

xtreme-sameer-vohra
Apr 14, 2020
Collaborator

gowrisankar22
Apr 15, 2020
Author

gowrisankar22
Apr 17, 2020
Author

xtreme-sameer-vohra
Apr 21, 2020
Collaborator

xtreme-sameer-vohra Apr 21, 2020
Collaborator