New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Lidar scanner to not use go-routines. #8056
base: master
Are you sure you want to change the base?
Conversation
…atch-size. Signed-off-by: Evan <chaol@vmware.com>
a3654be
to
46b4678
Compare
I'd like to see the checks related metrics to verify the change not dropping check rates at least e.g. checks_started before/after the PR. |
Check metrics doesn't show much difference before and after applying this PR, which is expected. I think key of this PR is to add a limit which helps distribute in-memory-checks across ATCs. Removing go-routine supports the change, because we can easily test |
First of all I think go-routins are probably the best feature that golang has, which we should not be afraid of at all. Even now Lidar only I am not against doing serialize scan check but the fact that it needs batching worries me. With this new env var, we probably will get lots of questions about In fact, from what i can tell the metrics for web cpu and memory are more flat (better distributed between nods) but the average usage is higher. With above reasons and the performance improvement shown in the metris attached, IMO it is not worth removing go-routines in lidar. |
@xtremerui I think I stated in #8056 (comment) that performance improvement is not the key of this PR. Instead, I want to have an option to restrict max number of resources to scan in a batch, but which became to your main concern. Our cluster has 27K resources to check. Meaning that, in each Lidar scan, it will start 27K go-routines, and the first batch may generate 27K checks, so totally 54K go-routines. Will it a problem? I don't know. Currently I don't have a test cluster that have same capacities as our prod cluster. And I'm not able to simulate the same pipelines as prod in the test environment. So I don't have a strong evidence to convince you. So if you hesitate with this PR, that is fine, then how about switch #8061 into 7.7? When we upgrade to 7.7/8 next time, if we encounter too many go-routine on single ATC problem, I can apply this PR to a private build, and try the build on our prod cluster. If that solves the problem, then it will be a nice evidence. |
I'm wondering what the batch scanning is trying to achieve? Are you seeing any problems with web CPU/memory everytime the lidar scanner runs? I like the idea about spreading out the scans in batches but my concern is that we already have the rate limiter which purpose is exactly what this is also trying to achieve. Each time the rate limiter runs, it should help distribute the check times of every resource within the check interval. That means that the "last_checked" time should also be spread out more evenly and with the last checked spread more evenly, the scanning should end up spreading out as a result. Maybe instead of adding a new knob we can try to consolidate them into having the rate limiter do its job properly by actually being able to distribute the check load? It seems like the real problem here is that the rate limiter is not doing it's job properly, maybe we can dig into that a bit further to see if there is anything we can do to improve on it? When you say:
Are you referring to the work that has to be done within the concourse/atc/exec/check_step.go Lines 106 to 146 in 7533e1d
|
@clarafu IMO, current rate limiter has two problems:
My batch idea works like this way: Say there are 10 resource r1-1, r2-1, r3-1, r4-1, r1-2, r5-1, r6-1, r2-2, r7-1, r8-1 where r1-1 implies first resource of scope-1. Then, say batch size is 3, Lidar scan interval is 10 seconds, resource check interval is 1 minute. First round Lidar is on ATC1: r1-1, r2-1, r3-1 are checked. And they all finish in 10 seconds. After 10 seconds, second round Lidar is on ATC2: r1-1, r2-1, r3-1 are ignored as are done in round1. r4-1 is checked. r1-2 will be ignored as scope-1 has been checked in round 1. And r5-1, r6-1 will be checked. All checks finish in 10 seconds. After 10 seconds, third round Liard is on ATC3: r1-1, r2-1, r3-1, r4-1, r1-2, r5-1, r6-1 are ignored as they have done in round 1 and 2. r2-2 will also be ignored, as scope-2 has been checked. Then r7-1, r8-1 are checked. This way, you can see, checks are spread across 3 ATCs, and each Lidar round only 3 checks are invoked. But a problem is batch is that, in round-2, say r2-1 has been updated and no long be scope-2. But as Lidar doesn't recalculate scope, so r2-1 is ignored, and it gets no chance to recalculate scope for r2-1. That's reason why I added resource update time in my dirty PR: https://github.com/concourse/concourse/pull/7991/files#diff-d2d09dceaae2073ebee6362308076ed28da0652690d5a2743a4af48d61e4d195R72 I should add resource update time to this PR as well. Our prod cluster has 27K resources to check and keeps growing. With in-memory check, all resources in a Lidar round are only run on a single ATC, how many goroutine a process may have? In my load test, I have observed that when an ATC havng too many goroutine, it's performance dropped significantly. That's reason why I wanted to add batch to Lidar. |
@marco-m If you also operates a busy cluster and encountering problem issues, what do you think about this PR? |
@evanchaoli I am not familiar with this code, so I can only give my general opinion at the risk of sounding like a fortune cookie :-) As @xtremerui points out, an high number of goroutines is not a problem per se. The Go runtime can handle that order of magnitude. To me, it is more a question of wether the design choice of mapping one goroutine to one resource is appropriate for the problem at hand or not. Before enjoying Go, I have been greatly influenced by Erlang, so having tens of thousands of goroutines can be the "natural" approach. If on the other hand we can see a clear simplification by not employing goroutines, then we should consider the simplification. You make the point that the current design is able to recover from a goroutine crash, while this PR removes this safety. Would it be possible to push it down one level? I mean: could the logic in this PR be wrapped in one or few goroutines? Then it would also be possible to recover these goroutines in case of panic. Somehow even more high level from a design point of view is what @clarafu mentions regarding two aspects:
|
@marco-m May I think how many resources do your cluster have? In our cluster, there are ~50K resources. |
Considering only Linux, we have 40--70 workers (autoscaling) and the metric |
Changes proposed by this PR
Changed Lidar Scanner to not use go-routines. Because with in-memory-check, Lidar scanner will only create build objects and push them to a channel, as there is no db operations, no IO operations, it doesn't have to scan resource in a go-routine. This way will save a lot of go-routines.
Added
--lidar-scanner-batch-size
to limit max resources to scan in a round. To emphasize, batch size will only count created checks. Say there are resource 1-10, and batch size is 3. In first round, resource 1-3 are scanned. Then in second round, since 1-3 has not reached to next scan interval, 4-6 will be scanned, and so on. This way helps distribute resource scans across ATCs. The reason why I must add--lidar-scanner-batch-size
is that, our cluster now have more than 80K resources, without a limitation, I'm afraid ATC might be exhausted when it takes turn to run Lidar scanner. With this batch size setting, we will have a weapon in hand.Notes to reviewer
My only afraid that, with go-routine mode, if a resource causes a crash, the go-routine will be recovered, so that entire ATC process won't crash. But I searched over our ATC logs for "panic-in-scanner-run" and didn't find any match.
Though there is a rate limit in
checkDelegate.WaitToRun()
, but I think that is too later. When reached to that rate limit, too much CPU cyles and IOs have been wasted.Metrics show reduced go-routes after applying this PR:
Release Note
--lidar-scanner-batch-size
. When a deployment has too many resources, this option may limit how many resources to scan in a round.