After upgrade from v5.8.0 to v6.0 makes scheduling very slow #5378
-
Hello Colleagues, We are using the helm/k8s based concourse setup. Recently, we have upgraded our test environment to evaluate v6.0. Deployments everything went absolutely fine. After upgrading our job scheduling became very very slow. v5.8.2 - same job took 35s pipeline.yml
our resource consumption:
CPU/MEM requests set:
|
Beta Was this translation helpful? Give feedback.
Replies: 17 comments 1 reply
-
By "scheduling" do you mean build duration? |
Beta Was this translation helpful? Give feedback.
-
@vito yes right. Please find the stackdriver traces |
Beta Was this translation helpful? Give feedback.
-
What does the build look like it's doing during this time? That image is quite large (4.23GB!), are you sure it's not just being downloaded for the first time after the deploy and having to be streamed to the task? How many times has the job been run? For me, on a brand new local |
Beta Was this translation helpful? Give feedback.
-
@vito Seems like if the jobs is running in one specific worker is fast as before .. other two workers are taking time upto 3-4 mins .. Any suggestions? |
Beta Was this translation helpful? Give feedback.
-
@vito @cirocosta Facing issue still. Any suggestions, how to fix the issue? out of 4-5 builds one build is work as expected with less time(35s) rest all the builds takes (3-6)mins now. |
Beta Was this translation helpful? Give feedback.
-
Sorry but without being able to observe this directly there's not a whole lot to go on here. If you can replicate the scenario in a controlled environment like Docker Compose that would help narrow things down. Right now all I can do is guess. Here are some shots in the dark:
|
Beta Was this translation helpful? Give feedback.
-
@vito [1]https://discordapp.com/channels/219899946617274369/413770960089382922/689179388431826982 With concourse version After concourse pipeline1.yaml : moved image_resource in the task image in the build plan
pipeline2.yaml : image_resource in the task config rather than image in the build plan
with this pipeline definition it works always with few secs as expected.
Hopefully, this information helps to narrow down the problem? |
Beta Was this translation helpful? Give feedback.
-
First off, totally cool that tracing exists in v6 to show this kind of stuff. Kudos to @cirocosta. But, perhaps tracing could be improved - even though it's clearly the task step whose behavior changes so drastically, we don't seem to have a breakdown of volume streaming/"container setup" vs script execution. Maybe we should have separate spans for each of those procedures. It seems like the cached bits for that resource version live on one worker - call it A - and only on A. When the task step lands on A it's probably fast (because the resource cache is already colocated), but for the other two, the bits are always being streamed from A before the task runs. Is this expected behaviour for resource caches? Should there be redundant caches on multiple workers? If the above theory is correct, why was performance better in v5.8.2? Is this "cluster-unique resource cache per version" behaviour new in v6? If not, did volume streaming get slower in v6? |
Beta Was this translation helpful? Give feedback.
-
@gowrisankar22 I see the exact same behaviour with (admittedly) slow connections between workers and similarly sized images. |
Beta Was this translation helpful? Give feedback.
-
@vito any update on this ?? |
Beta Was this translation helpful? Give feedback.
-
@gowrisankar22 If there were any updates they would be posted here. Please slow down with the bumping and direct messaging (on Discord too). I can't address everything immediately, and it's only been a day since the last activity. 😅 This is a performance regression, but things still work, right? @xtreme-sameer-vohra Do you know if we ended up changing the |
Beta Was this translation helpful? Give feedback.
-
@vito sorry for that .. we were planning to bump v6 on our prod system next week and we have quite a heavy load on the system running over 500 pipelines. This was the reason for the continuous ask. Off course everything works but performance regression is a very big issue and deployments runs long hours which is not good .. |
Beta Was this translation helpful? Give feedback.
-
@vito any update on this issue ?? |
Beta Was this translation helpful? Give feedback.
-
Hi @gowrisankar22 My first action is to do a code review of v6 runtime logic to determine if there is any change in behaviour that explains the issue you are experiencing. We did a fairly substantial refactor in v6 of the runtime steps logic, however, the expectation was that there would be no behavioural changes. |
Beta Was this translation helpful? Give feedback.
-
@xtreme-sameer-vohra Sure :) you easily reproduce the problem if you have more than 1 worker. |
Beta Was this translation helpful? Give feedback.
-
@xtreme-sameer-vohra today I have updated my prod system and the issue is reproducible there as well. |
Beta Was this translation helpful? Give feedback.
-
@clarafu and I were able to reproduce the issue on v6 |
Beta Was this translation helpful? Give feedback.
@clarafu and I were able to reproduce the issue on v6
I'll be opening an issue for this.