Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Lidar scanner to not use go-routines. #8056

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

evanchaoli
Copy link
Contributor

@evanchaoli evanchaoli commented Feb 10, 2022

Changes proposed by this PR

Changed Lidar Scanner to not use go-routines. Because with in-memory-check, Lidar scanner will only create build objects and push them to a channel, as there is no db operations, no IO operations, it doesn't have to scan resource in a go-routine. This way will save a lot of go-routines.

Added --lidar-scanner-batch-size to limit max resources to scan in a round. To emphasize, batch size will only count created checks. Say there are resource 1-10, and batch size is 3. In first round, resource 1-3 are scanned. Then in second round, since 1-3 has not reached to next scan interval, 4-6 will be scanned, and so on. This way helps distribute resource scans across ATCs. The reason why I must add --lidar-scanner-batch-size is that, our cluster now have more than 80K resources, without a limitation, I'm afraid ATC might be exhausted when it takes turn to run Lidar scanner. With this batch size setting, we will have a weapon in hand.

  • done

Notes to reviewer

My only afraid that, with go-routine mode, if a resource causes a crash, the go-routine will be recovered, so that entire ATC process won't crash. But I searched over our ATC logs for "panic-in-scanner-run" and didn't find any match.

Though there is a rate limit in checkDelegate.WaitToRun(), but I think that is too later. When reached to that rate limit, too much CPU cyles and IOs have been wasted.

Metrics show reduced go-routes after applying this PR:

8056

Release Note

  • Optimized ATC performance by getting rid of go-routines from Lidar scanner.
  • Add --lidar-scanner-batch-size. When a deployment has too many resources, this option may limit how many resources to scan in a round.

…atch-size.

Signed-off-by: Evan <chaol@vmware.com>
@evanchaoli evanchaoli added this to the v7.7.0 milestone Feb 15, 2022
@xtremerui
Copy link
Contributor

I'd like to see the checks related metrics to verify the change not dropping check rates at least e.g. checks_started before/after the PR.

@evanchaoli
Copy link
Contributor Author

I'd like to see the checks related metrics to verify the change not dropping check rates at least e.g. checks_started before/after the PR.

Check metrics doesn't show much difference before and after applying this PR, which is expected.

8056-checks

I think key of this PR is to add a limit which helps distribute in-memory-checks across ATCs. Removing go-routine supports the change, because we can easily test check()'s result and determine if or not to do more checks.

@xtremerui
Copy link
Contributor

First of all I think go-routins are probably the best feature that golang has, which we should not be afraid of at all. Even now Lidar only creates build object, it still does quite a lot of work in TryCreateCheck method including finding base resource types, deserializaiton(nested) and create build eventually. I think it still falls into the use case of go-routines especially as each check is independant to each other.

I am not against doing serialize scan check but the fact that it needs batching worries me. With this new env var, we probably will get lots of questions about what is the right value for X number of checks in my deployment. And when someone has performance issue, we get one new value to check and yet getting the right value for this env var is non-trivial. If one's deployment becomes bigger, it also needs to consider bumping the value and require redeployment.

In fact, from what i can tell the metrics for web cpu and memory are more flat (better distributed between nods) but the average usage is higher. With above reasons and the performance improvement shown in the metris attached, IMO it is not worth removing go-routines in lidar.

@evanchaoli
Copy link
Contributor Author

@xtremerui I think I stated in #8056 (comment) that performance improvement is not the key of this PR. Instead, I want to have an option to restrict max number of resources to scan in a batch, but which became to your main concern.

Our cluster has 27K resources to check. Meaning that, in each Lidar scan, it will start 27K go-routines, and the first batch may generate 27K checks, so totally 54K go-routines. Will it a problem? I don't know. Currently I don't have a test cluster that have same capacities as our prod cluster. And I'm not able to simulate the same pipelines as prod in the test environment. So I don't have a strong evidence to convince you. So if you hesitate with this PR, that is fine, then how about switch #8061 into 7.7?

When we upgrade to 7.7/8 next time, if we encounter too many go-routine on single ATC problem, I can apply this PR to a private build, and try the build on our prod cluster. If that solves the problem, then it will be a nice evidence.

@xtremerui xtremerui removed this from the v7.7.0 milestone Feb 17, 2022
@clarafu
Copy link
Contributor

clarafu commented Feb 23, 2022

I'm wondering what the batch scanning is trying to achieve? Are you seeing any problems with web CPU/memory everytime the lidar scanner runs?

I like the idea about spreading out the scans in batches but my concern is that we already have the rate limiter which purpose is exactly what this is also trying to achieve. Each time the rate limiter runs, it should help distribute the check times of every resource within the check interval. That means that the "last_checked" time should also be spread out more evenly and with the last checked spread more evenly, the scanning should end up spreading out as a result. Maybe instead of adding a new knob we can try to consolidate them into having the rate limiter do its job properly by actually being able to distribute the check load? It seems like the real problem here is that the rate limiter is not doing it's job properly, maybe we can dig into that a bit further to see if there is anything we can do to improve on it? When you say:

When reached to that rate limit, too much CPU cyles and IOs have been wasted.

Are you referring to the work that has to be done within the check_step before we rate limit? Like this code

source, err := creds.NewSource(state, step.plan.Source).Evaluate()
if err != nil {
return false, fmt.Errorf("resource config creds evaluation: %w", err)
}
var imageSpec runtime.ImageSpec
var imageResourceCache db.ResourceCache
if step.plan.TypeImage.GetPlan != nil {
var err error
imageSpec, imageResourceCache, err = delegate.FetchImage(ctx, *step.plan.TypeImage.GetPlan, step.plan.TypeImage.CheckPlan, step.plan.TypeImage.Privileged)
if err != nil {
return false, err
}
} else {
imageSpec.ResourceType = step.plan.TypeImage.BaseType
}
resourceConfig, err := step.resourceConfigFactory.FindOrCreateResourceConfig(step.plan.Type, source, imageResourceCache)
if err != nil {
return false, fmt.Errorf("create resource config: %w", err)
}
// XXX(check-refactor): we should remove scopes as soon as it's safe to do
// so, i.e. global resources is on by default. i think this can be done when
// time resource becomes time var source (resolving thundering herd problem)
// and IAM is handled via var source prototypes (resolving unintentionally
// shared history problem)
scope, err := delegate.FindOrCreateScope(resourceConfig)
if err != nil {
return false, fmt.Errorf("create resource config scope: %w", err)
}
// Point scope to resource before check runs. Because a resource's check build
// summary is associated with scope, only after pointing to scope, check status
// can be fetched.
if step.plan.Resource != "" {
err = delegate.PointToCheckedConfig(scope)
if err != nil {
return false, fmt.Errorf("update resource config scope: %w", err)
}
}
If it is that code then one thing we could do is push the rate limiter part further up the code so that it blocks earlier.

@evanchaoli
Copy link
Contributor Author

evanchaoli commented Feb 28, 2022

@clarafu IMO, current rate limiter has two problems:

  • It's too late. It's placed after scoped is calculated, but rate limiter doesn't use scope at all.
  • When hit limit, it just delay a while, goroutine is not released. After a sleep, it continue to run without checking limit again.

My batch idea works like this way:

Say there are 10 resource r1-1, r2-1, r3-1, r4-1, r1-2, r5-1, r6-1, r2-2, r7-1, r8-1 where r1-1 implies first resource of scope-1. Then, say batch size is 3, Lidar scan interval is 10 seconds, resource check interval is 1 minute.

First round Lidar is on ATC1: r1-1, r2-1, r3-1 are checked. And they all finish in 10 seconds.

After 10 seconds, second round Lidar is on ATC2: r1-1, r2-1, r3-1 are ignored as are done in round1. r4-1 is checked. r1-2 will be ignored as scope-1 has been checked in round 1. And r5-1, r6-1 will be checked. All checks finish in 10 seconds.

After 10 seconds, third round Liard is on ATC3: r1-1, r2-1, r3-1, r4-1, r1-2, r5-1, r6-1 are ignored as they have done in round 1 and 2. r2-2 will also be ignored, as scope-2 has been checked. Then r7-1, r8-1 are checked.

This way, you can see, checks are spread across 3 ATCs, and each Lidar round only 3 checks are invoked.

But a problem is batch is that, in round-2, say r2-1 has been updated and no long be scope-2. But as Lidar doesn't recalculate scope, so r2-1 is ignored, and it gets no chance to recalculate scope for r2-1. That's reason why I added resource update time in my dirty PR: https://github.com/concourse/concourse/pull/7991/files#diff-d2d09dceaae2073ebee6362308076ed28da0652690d5a2743a4af48d61e4d195R72

I should add resource update time to this PR as well.

Our prod cluster has 27K resources to check and keeps growing. With in-memory check, all resources in a Lidar round are only run on a single ATC, how many goroutine a process may have? In my load test, I have observed that when an ATC havng too many goroutine, it's performance dropped significantly. That's reason why I wanted to add batch to Lidar.

@evanchaoli
Copy link
Contributor Author

@marco-m If you also operates a busy cluster and encountering problem issues, what do you think about this PR?

@marco-m
Copy link
Contributor

marco-m commented Mar 29, 2022

@evanchaoli I am not familiar with this code, so I can only give my general opinion at the risk of sounding like a fortune cookie :-)

As @xtremerui points out, an high number of goroutines is not a problem per se. The Go runtime can handle that order of magnitude. To me, it is more a question of wether the design choice of mapping one goroutine to one resource is appropriate for the problem at hand or not. Before enjoying Go, I have been greatly influenced by Erlang, so having tens of thousands of goroutines can be the "natural" approach. If on the other hand we can see a clear simplification by not employing goroutines, then we should consider the simplification.

You make the point that the current design is able to recover from a goroutine crash, while this PR removes this safety. Would it be possible to push it down one level? I mean: could the logic in this PR be wrapped in one or few goroutines? Then it would also be possible to recover these goroutines in case of panic.

Somehow even more high level from a design point of view is what @clarafu mentions regarding two aspects:

  1. the rate limiter, which is meant to already address the problem we are discussing. If it is not sufficient, is it for a fundamental reason or can it be made to work with a redesign or change?
  2. the idea about striving as much as possible not to add knobs. Instead, the system should be able to auto-tune in the majority of cases. Quoting John Ousterhout (of Tcl and RAFT fame), in his book "A philosophy of software design" (highly recommended! See as intro his talk):

Avoid configuration parameters
You should avoid configuration parameters as much as possible. Before exporting a configuration parameter, ask yourself: “will users (or higher-level modules) be able to determine a better value than we can determine here?” When you do create configuration parameters, see if you can compute reasonable defaults automatically, so users will only need to provide values under exceptional conditions. Ideally, each module should solve a problem completely; configuration parameters result in an incomplete solution, which adds to system complexity.

@evanchaoli
Copy link
Contributor Author

@marco-m May I think how many resources do your cluster have? In our cluster, there are ~50K resources.

@marco-m
Copy link
Contributor

marco-m commented Mar 31, 2022

Considering only Linux, we have 40--70 workers (autoscaling) and the metric concourse_workers_containers ranges between 1k--2k, but I am not sure you meant this by "how many resources". How do you count the resources ?

@clarafu clarafu removed their assignment May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants