Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression on triggered jobs #3759

Open
freelock opened this issue Apr 19, 2019 · 32 comments

Comments

6 participants
@freelock
Copy link
Contributor

commented Apr 19, 2019

Bug Report

The scheduler changes in 5.1.0 calls out that triggered jobs now must run a version check on every resource before the job is triggered. This makes Concourse practically unusable for us, as we trigger almost every run using a bot, and this has increased the startup time for each pipeline job from a few minutes to nearly 20 minutes. Across > 50 pipelines, a nightly triggered job now takes 18 hours instead of 3 1/2 -- meaning there's not much time left before the nightly job runs again.

Steps to Reproduce

  1. Our typical pipeline has ~19 resources. The vast majority of these are shared. We have started the binary Concourse Web with --enable-global-resources.
  2. We run over 50 of these, generally 2 at a time (if there are more than 2 pipelines unpaused, our check worker historically has reached ridiculous load levels and basically locks up).
  3. Unpause a pipeline, trigger one of the jobs that uses 4 resources -- one of them is unique to this pipeline, the other 3 shared across all other pipelines.
  4. Wait for 15 - 20 minutes for the job to actually start. Even after this worker has all the containers/volumes from the pipeline being previously active.

Expected Results

Our job should start within 5 minutes, if none of its resources are on the worker, or a minute or two if the other global resources are already present and have not exceeded some "freshness" age.

Actual Results

I think after the job is triggered, all 19 resources in the pipeline are being created and checked. If they are being shared across other pipelines, it does not seem to be affecting the startup time...

Additional Context

I think the ATC should go back to using existing resources in triggered jobs, if they have been checked previously (e.g. using the value in the resource's "check_every" parameter -- if the current resource on the worker is less than that value's age, use the cache version, and only fetch a fresher resource if the cached resource is older than that, or a "fly check-resource" has been called.

If the current behavior is preferred, then at least a startup option for the web?

Version Info

  • Concourse version: 5.1.0
  • Deployment type (BOSH/Docker/binary): Binary
  • Infrastructure/IaaS: GCE -- n1.standard web, n1.standard worker with local SSD disk for worker
  • Browser (if applicable): n/a
  • Did this used to work? Much better than now!

@freelock freelock added the bug label Apr 19, 2019

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2019

@freelock manually triggered builds have actually always run a version check for every resource. The change in 5.1.0 only means that those checks don't block other builds from running anymore. Let's see if I can explain this with some pseudo-code.

Pre 5.1.0

for job in jobs
  if job.manuallyTriggered
    // this blocks other jobs from running
    for resource in job.resources
      resource.scan
    end
  end
 schedule job
end

5.1.0

for job in jobs
  if job.manuallyTriggered
     if job.createTime > job.resources.lastScan
      // we'll run it during the next scheduler interval
      break
    end
  end
 schedule job
end

The delay you are noticing might be the fact that your build needs to wait for a scheduler tick to happen after all your resources have been checked. Have you modified your scheduler interval?

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2019

Hi,

Thanks for your response! I haven't done any customizations, not sure how to modify the scheduler interval...

What I think is happening is that resource checks are not happening because the pipeline is paused -- with over 50 pipelines that are idle most of the time, instead of leaving them active all the time constantly checking resources, we unpause a pipeline and manually trigger a job (actually this is automated for us by our bot). So when the job is triggered, none of the resources have been checked, and with the scheduler change in 5.1.0, it takes at least 10 minutes and sometimes up to 40 minutes before the job actually starts (turns yellow in the UI).

We rolled back to 5.0.1, and this is back to taking somewhere between 30 seconds and up to 4 or 5 minutes at most.

We do have "global resources" enabled, which I think does make a big difference -- out of some 19 defined resources in our pipelines, more than half of them are identical across our pipelines -- generic deployment images, scripts, etc.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2019

Rereading the psuedo-code, so it looks like the pre-5.1.0 is actually triggering the job's resources to scan, and that's leading to a comparatively snappy startup for the way we use Concourse.

In 5.1.0, how do the resources get scanned, after a pipeline gets re-enabled? Given that most of our jobs only use 5 - 6 of the ~19 resources defined in the pipeline, it seems that in my case it's not scanning the resources needed by the job first -- and it's not scanning all the resources in the pipeline at the start (or if it is, possibly it's causing some other resource issue -- but I'm not seeing that).

So it seems like the scheduler may be only scanning a few resources at a time, and then waiting until those have fully completed before scanning more? Perhaps the trouble is shear size here -- many of our resources are sizable fractions of a GB (3 different branches of git repositories that are generally > 200MB, Docker images that have not necessarily been optimized...)... it wouldn't surprise me if getting ALL the resources for a pipeline means transferring 2 - 5 GB of data, while a single job is only depending on more like 100MB - 500MB...

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2019

... and just to provide more info about how we have this configured, I just counted up 14 different jobs that get manually triggered. Many of these do automatically trigger other jobs as part of a normal Concourse pipeline -- but this illustrates that our pipelines have many resources defined that might not ever get used in a given job pipeline.
Selection_633

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2019

@freelock in 5.1.0 when a manually triggered build is kicked off, it still triggers a scan of all its resources, it just happens out-of-band. All the resource scans actually happen in parallel now, whereas they used to happen serially, so in theory that should speed things up. I wonder if its slow due to a resource contention issue with everything now happening at the same time.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2019

That could well be the issue -- but it's really biting us hard.

It strikes me that for our use case, serial fetching of resources works a lot better, especially if it gets those resources needed for the job being triggered first.

Two scenarios to illustrate:

  1. Many of our resources are different branches of the same git repository -- "project-master", "project-release", "project-release-version", and "project-develop". That's 4 copies of a repository that's generally at least 200M, and at most 2 are used by any given job. They are pulling from a gitolite server which has been brought down by Concourse in the past (when we have more than 5 or 6 pipelines active, the check jobs alone crushes our gitolite server, which is what led us to pausing pipelines when not in use). I do often see dozens of "git" processes running on the workers, and I'm sure just the sheer volume of I/O going on is slowing everything down.

  2. Many of our other resources are Docker images, generally pretty heavy ones but most should share a lot of the layers, so if these were fetched serially, probably a lot less to fetch.

And, pretty much all our Docker images and several other git repositories are shared across pretty much every other pipeline we run -- if those can skip being checked if they are fresh, I'm guessing that alone would make a huge difference (I think it made a noticeable difference in 5.0.0 with --enable-global-resources, but perhaps this algorithm has now defeated that?)

Would it be possible to have a startup switch for concourse web to use a serial scheduler instead of parallel?

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2019

In general, for our setup, Concourse has always been too heavily parallel -- we've had to refactor all sorts of things to make it run serially anyway -- heavy use of pool resources, pausing pipelines when not in use, etc. And even with that, the load on the workers goes to levels I didn't know were possible -- last week we were hitting loads over 800! Without that, it crushes the rest of our infrastructure -- our git servers, our dev servers when tests are run, etc.

For us at least, spreading jobs out to run one or two at a time greatly speeds things up.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2019

Ha just thinking about it, resource-contention and too much happening in parallel is one of the reasons we manually trigger so many jobs in the first place -- otherwise Concourse crushes the workers and our infrastructure!

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2019

@freelock instead of relying on manual checks for everything you can tune concourse to be less aggressive with its auto resource checking by setting the flag --resource-checking-interval. By default it checks every 30s, but you can back this way off.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2019

But will that help for manually triggered jobs? Now that we are set up to work this way, it work(ed) extremely well for us until this scheduling change -- and I'm not sure how having an automatic scheduler for these jobs would be an improvement -- our bot has become so convenient for triggering what we need done, and it now connects several other systems too, becoming an "uber scheduler" for our company...

Please make manually triggered jobs useful again!

@vito

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Possibly related?: #3845

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

Responding to @ddadlani from #3695 ...

Looking through your comment history on #3759, however, it looks like you have modified your workflow to rely on manually triggering builds because if not, your Concourse workers are under very high load. I'm more interested in what exactly was happening in the automatic build trigger scenario. Do you have statistics on how many containers and volumes were on each worker? How did the workers fail?

Well, our "normal" pipeline template has 19 resources, 25 individual jobs, > 1400 lines... Our bot manually triggers about 15 of those jobs, though only about 4 are used as part of the everyday process. We have 2 other templates, and in all around 60 pipelines total instantiated. So many of the jobs are strung together to trigger automatically once the first job is kicked off (and many of the jobs have multiple tasks).

We were never able to rely on automatic scheduling, because as I said before, our workers got crushed when we turned on more than 4 or 5. The workers start registering a load in the several hundred range (didn't know you could get anything out of a computer showing a load of 800!), and the most common bottleneck seems to be I/O, with dozens of git processes all vying for attention. And all that happens without running any jobs whatsoever, which is why we came up with the unpause/pause strategy -- dollars for more hardware doesn't make sense for us, as the pipelines sit idle most of the time!

Even with no more than 2 pipelines enabled, running no more than 1 job in each, a single worker in our system (GCP n-standard-1, 3.75GB RAM, local 375GB SSD scratch disk for the worker) the load fluctuates from 3 to 27 (node_load1 stat from a Prometheus node_exporter). With more active, we just don't see progress on jobs for hours, until eventually something times out or gets killed or runs out of subnet addresses... So we don't get that many errors, it just becomes too slow to be worthwhile.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

oh and often the worker just gets too loaded for the ATC to see it anymore... at some point the ATC kills the job and it errors out.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 23, 2019

Ok, was starting to feel a bit behind the times, back on v5.0.1, so I tried upgrading to v5.3.0 yesterday. Today our jobs were back to really, really slow -- the nightly job was only about 1/3 finished after 6 hours instead of finishing entirely in < 4 hours.

So I decided to try experimenting with the scheduler to see if I could restore quickness on a modern build. After figuring out how to develop on Concourse, and getting a custom binary put on our production server, I found that commenting out the check in atc/scheduler/buildstarter.go, tryStartNextPendingBuild();

         if resource.LastCheckEndTime().Before(nextPendingBuild.CreateTime()) {
				return false, nil
			}

... this made triggered jobs start MUCH MUCH faster.

I'm thinking this is checking for the resources associated with a job, whether the last check completed after the job was triggered. But with a pipeline with 19 resources, many of which associated with multiple jobs, it can take 1/2 hour for the scheduler to get around to checking the inputs for a particular job -- particularly if the resource has a long "check_resource" interval -- it might not get checked for a long time!

And when a resource rarely changes, this introduces a lot of delay into the system.

So I think instead of checking whether the resource has been checked after the job is triggered, it should check whether the resource is still valid at the time the job is triggered -- e.g. is it still within its check_every period? If so, let the job start!

For now, removing this block entirely seems to have fixed the performance regression for us -- resources are checked frequently enough as it is, and we can manually call fly check-resource if we actually do update a resource...

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 23, 2019

Just to note the performance issues. On plain 5.0.1, Saturday morning's run completed 53 jobs in 4 hours (~13 jobs per hour). This morning after 7 hours, 26 jobs were complete (~3.5 jobs per hour). After removing that code block, it completed 7 jobs in 30 minutes (back to 13 jobs per hour).

These jobs are all the same exact job in a different pipeline, with 1 resource that is unique to the particular pipeline.

Is that block even needed? We'll run this configuration for a while to see if we have any side effects...

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

Is that block even needed? We'll run this configuration for a while to see if we have any side effects...

Yes, it is needed... without it my jobs are not picking up new commits in various repository resources.

Trying this replacement block:

			if resource.LastCheckEndTime().Before(resource.LastCheckStartTime()) {
				return false, nil // This resource is currently being checked, try again later
			}
			expire, _ := time.ParseDuration(resource.CheckEvery())
			if resource.LastCheckEndTime().Add(expire).Before(nextPendingBuild.CreateTime()) {
				return resource.Reload()
			}

... so far this appears to be working. Jobs are not starting as fast as earlier, but we will see in our overnight run what the results will be.

I think this will block resource checking when there is a triggered job, but hopefully not as much -- ignoring resources that have not expired, or are being checked already when the triggered job is checked...

@marco-m

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@freelock

After figuring out how to develop on Concourse, and getting a custom binary put on our production server,

Just wanted to say that you are doing a very good job! 💪 Hope your investigation gives enough hints to the Concourse team to spot this subtle bug (I say subtle since this is the first time I see something like this reported)

@marco-m

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

I think what could help (since you are already warm in Concourse code :-) is to create a repro with the extensive test suites of Concourse.

@clarafu

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@freelock That block should be the part that ensures that if you manually trigger a build, we will only start it after we have successfully ran a check for every resource that is an input to the job. If a resource has a long check-every interval, we should disregard that because we use a notification to trigger off a resource scan for all the inputs when you manually trigger a build. I think a problem we are seeing (as a theory) is that the notification does not work as intended, sometimes it gets ignored and that causes the manually triggered builds to not run a resource scan immediately as intended.

@xtremerui

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

That block is only for manually triggered build. In your case, all nightly running build (that is triggered by scheduler) should not be affected. Is it possible that something else causes slowness of resource check?

@xtremerui

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

So I think instead of checking whether the resource has been checked after the job is triggered, it should check whether the resource is still valid at the time the job is triggered -- e.g. is it still within its check_every period? If so, let the job start!

@vito this is an interesting case that might worth thinking. Though I believe originally the user case for manually triggering build we want to force a check on every resource always.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

@xtremerui we are using an external process to trigger these nightly builds (after unpausing the pipeline in question).

@clarafu could be... I wonder if having the pipeline paused beforehand exacerbates this... in our system, leaving pipelines active consumes way too many resources -- we have over 50 pipelines, and if we have more than 3 - 4 active, it starts crushing our git servers, and more -- we already have 3 servers dedicated to Concourse, we don't want to have to scale all our systems! And turning pipelines on only when we needed them worked fine -- until 5.1.0 and this scheduler change.

In general, because our pipelines are paused most of the time, they are not getting checked often... but most of the resources are shared, and might be fresh because they were just used in other pipelines. I don't see a reason to check them if they've been scanned within that "check every" timeout -- as long as we can manually trigger a check-resource to freshen them if needed...

@clarafu

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@xtremerui yup, I think the block is necessary to make sure that we always check after we manually trigger a job to ensure we have the latest version.

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

So I also added a log message right before the resource.Reload() to see how often it got hit. Our nightly run took 4.5 hours this morning, and this reload got called 4 times in all, over 53 runs.

I'm thinking I'll add another logger in the original condition -- when the resource check time is before the job trigger time -- and see which resources were blocking the triggered job from starting in a timely way...

@marco-m regarding setting up a repro case, I'm not really sure how to do this effectively -- it's a performance regression we see that I think is really exacerbated by our rather low-end production environment we're pushing quite hard. Our pipeline template is over 1000 lines long, with 19 resources, and the nightly job is near the end of the yaml file...

Another issue linked earlier in this thread mentioned the possibility that resources for triggered jobs weren't only getting checked once per run, but perhaps once every time any job used it as an input. That would certainly cause a big lag in our pipeline, if some 50 - 60 resource checks were happening before the nightly job could start!

@clarafu do you think what I changed it to meets the needs? if resource.LastCheckEndTime().Add(expire).Before(nextPendingBuild.CreateTime()) {

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

oh whoops, I lied... that logger logged 16 times on the nightly run, not 4... New logger in place, will see tomorrow how many resource checks this code is now skipping...

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

@clarafu

If a resource has a long check-every interval, we should disregard that because we use a notification to trigger off a resource scan for all the inputs when you manually trigger a build.

... that's how it used to work, but I believe that code got stripped out here: dd9a7c9#diff-ec5d156bedf6973365a4b99c7af3d828L101

... and that's why we have a performance regression. Now instead of kicking off a scan for a job's resources, it just sits there lazily awaiting the checkEvery timeout to expire, until the resource finally gets checked after the manual trigger, and allows the job to start. And it doesn't prioritize inputs to the triggered job -- it's checking all resources in whatever order they happen to check, while the manually triggered job just sits there waiting for everything to be ready.

At least that's my theory...

@clarafu

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@freelock Just to make sure I'm interpreting it right, that change is making it so that if the resource has checked within the check_every interval that is configured from the time that pending build is started, it will still go through with the manual build trigger.

In my opinion, I think that change wouldn't satisfy the current purpose of the manual build trigger where users can trust that when they manually trigger a build, it will always run a check for every input to find the latest version to use within the build. If we want to make the change you made, we would need to reevaluate what a manual build trigger means.

We don't manually run a scan for a job's resources anymore, but we do kick off a notification to radar to run a scan for each resource. https://github.com/concourse/concourse/blob/master/atc/api/jobserver/create_build.go#L52 Thats why I'm thinking that there is a bug where notifications are not triggering off resource scans. 🤔We are going to be looking into this bug this week, will keep you updated on the findings.

I've previously tried testing this by configuring a pipeline with a resource with a long check_every interval and manually triggering a build, and it seemed like the notification worked properly as the resource was checked and a build was triggered really fast. So I think this bug might be related to the high stress your Concourse deployment has on the (possibly) web nodes?

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

@clarafu does that line happen before or after the job starts?

Yes, the change does make it so it skips checking a resource if it's within its "check_every" period. I think at some point way back when (might've been a bug) manually triggering a job did not always get the latest version of every resource, so that was not an assumption I held -- and in our scenario, that's not a requirement, either -- we would rather have the job start quicker than check a bunch of resources that rarely change.

In our scenario, we are quite interested in global resources (as most of ours are the same across pipelines). Ideally if we do change one of these resources in a critical way, we could just manually call "fly check-resource" on one instance and have it noticed across pipelines -- otherwise we wouldn't need it checked for days at a time -- many of these go for weeks without a change.

I know the project discourages knobs, but how about a parameter on fly trigger-job to check all resources before starting/skip checks on resources that have not expired?

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

Regarding high stress on nodes, our web node has no trouble, it's the worker that has the check containers where we see loads go sky high, and sometimes an I/O bottleneck on our git server. On both sides we see dozens of git processes all just waiting for I/O or CPU...

@clarafu

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@freelock It happens independently with the job starting, because the component that starts the job is the scheduler tick.

Hm ya that might've been a bug, a manual trigger should've always ran a check and fetched the latest version.

I'm not sure about adding a separate feature for this, if we had the manual trigger notification to resource scans fixed, would that not be enough for your use case? I think you might be wanting to remove the checks from a manual trigger because of the way you are using the manual trigger is different than how it is intended to be used. In the common case, it should just be used to ensure you have the latest version, but in your case you are using it to trigger jobs because you normally keep the pipelines paused. Maybe instead, we can dive into your problem of having resources constantly checking and that overloading your web/worker nodes? Like for example, maybe more intelligent checking patterns #3975 would help?

@vito vito added this to Backlog in Core Jun 24, 2019

@clarafu

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@freelock I might have a theory to why the manually triggered builds are not working for you. When you unpause a pipeline, it takes a little bit of time for the pipeline syncer to start up a radar for your pipeline. And if you had already tried to manually trigger a build before it has started up the radar process for that pipeline, the notification gets lost. Can you try to unpause your pipeline, wait around 30 seconds to a minute, and then try to trigger a build?

@freelock

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

Hi,

Thanks for your attention on this!

I'm not sure about adding a separate feature for this, if we had the manual trigger notification to resource scans fixed, would that not be enough for your use case?

Yes, that should be fine, because that was working for us up until 5.1.0.

Maybe instead, we can dive into your problem of having resources constantly checking and that overloading your web/worker nodes?

In theory, this would help -- but we have the unpause/pause approach working very well for us right now, along with a queuing system -- at the moment we have a single web node and two workers, one tagged and running behind a firewall. I'm pretty convinced at this point that we can scale this meager hardware much further by pausing everything while not in use (which is most of the time!) As it is, we think we can go from ~50 pipelines to over 200 without having to really scale our infrastructure much -- and at that point we can spin up another instance.

I'm concerned that if we turn these pipelines on all the time, that means having to scale up our hardware much sooner across the board, all to solve a different problem than we are trying to solve.

When you unpause a pipeline, it takes a little bit of time for the pipeline syncer to start up a radar for your pipeline. And if you had already tried to manually trigger a build before it has started up the radar process for that pipeline, the notification gets lost. Can you try to unpause your pipeline, wait around 30 seconds to a minute, and then try to trigger a build?

... we can try this, certainly to collect more information -- but I would hate to introduce a minute delay to every job we run... sounds like a likely theory! The job does start eventually, under vanilla 5.1.0 - 5.3.0 -- but it was taking 18 - 35 minutes to do so (instead of 1 - 5 minutes).

Cheers,
John

clarafu added a commit that referenced this issue Jul 4, 2019

atc: add condition notifier to interval runner
If the interval runner is not started up yet and we send a notifiction
to check a resource, the notification gets lost. This commit adds a
safeguard by having a condition notifier that will check a check
requested time value on the resources table to confirm whether or not we
need to run a check.

[#3759]

Signed-off-by: Clara Fu <cfu@pivotal.io>
Co-authored-by: Alex Suraci <suraci.alex@gmail.com>

clarafu added a commit that referenced this issue Jul 15, 2019

atc: add condition notifier to interval runner
If the interval runner is not started up yet and we send a notifiction
to check a resource, the notification gets lost. This commit adds a
safeguard by having a condition notifier that will check a check
requested time value on the resources table to confirm whether or not we
need to run a check.

[#3759]

Signed-off-by: Clara Fu <cfu@pivotal.io>
Co-authored-by: Alex Suraci <suraci.alex@gmail.com>

@clarafu clarafu moved this from Backlog to In Flight in Core Jul 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.