Enable cookbook version response caching #2955

marcparadise · 2021-11-18T19:42:48Z

Most of our time post-depsolve in the cookbook_versions/ endpoint is spent assembling the results for the client to consume. On a heavily loaded server since OTP 20, the same
call can take 60+ seconds to complete and significantly degrade performance of the entire system. Prior to that, it could still take 3-6 seconds though without systemic performance degradation.

This change introduces an ets-based cache that:

allows callers to cache the results
has overload protection, allowing the caller to force client retry (via 503 response) instead of waiting for the cache to become available
has synchronization capabilities, so that one caller can determine that another process is already calculating results for a given key and retry the request instead of allowing multiple callers to do the same results calculation.

Some numbers, using sample org data that tends towards generating large
responses to cookbook_version endpoint, because of the total number of
cookbook files:

7k clients/15min interval - cache disabled - CPU 1800-2000%, LA 25. cookbook_versions rough average time range 2-5s
7k clients/15min interval - cache enabled - CPU 150-200%, LA 5-6. cookbook_versions rough average time range 0.2-1.1s (edited)
40k clients/15min interval - cache enabled - CPU 1000-1200%, LA 19 . CBV time range: 0.2-3s with most on the < 0.5 end.
40k clients/15min interval - cache disabled - server fails within a minute

For more details, see the commit messages in this PR.

Signed-off-by: Marc A. Paradise marc.paradise@gmail.com

netlify · 2021-11-18T19:42:53Z

👷 Deploy Preview for chef-server processing.

🔨 Explore the source changes: fe968e0

🔍 Inspect the deploy log: https://app.netlify.com/sites/chef-server/deploys/61aa4d237db5600007ba65d6

omnibus/files/server-ctl-cookbooks/infra-server/attributes/default.rb

src/oc_erchef/apps/chef_objects/src/chef_cbv_cache.erl

src/oc_erchef/apps/chef_objects/test/chef_cbv_cache_tests.erl

src/oc_erchef/apps/oc_chef_wm/src/chef_wm_depsolver.erl

lbakerchef

Looks good. Just some general questions and thoughts:

Do you have some kind of conceptual diagram or diagrams showing the design? I don't know if this even applies (thus, why I'd find such a thing useful :) ), but what I'm imagining is a a square or block of squares representing ETS cells, a circle representing a gen_server interacting with both callers and the ETS cell(s), callers calling the gen_server, and (if this even happens) callers interacting with the ETS cell(s) directly, bypassing the gen_server.
Is the gen_server necessary so that we aren't duplicating or repeating work all over the place? Is it more write coordination, read coordination, or both? Or neither?
Do callers have to go through the gen_server to read the cache, or do they read it directly (I noticed you set read_concurrency to true)? From my experience years ago optimizing erlang concurrency, I found that the less a gen_server is a bottleneck to anything, the better. Having said that, whatever the case is here, this should hopefully be a tremendous improvement to the current state of affairs.

Good job.

src/oc_erchef/apps/chef_objects/src/chef_cbv_cache.erl

src/oc_erchef/apps/oc_chef_wm/src/chef_wm_depsolver.erl

marcparadise · 2021-11-24T14:24:54Z

Do you have some kind of conceptual diagram or diagrams showing the design? I don't know if this even applies (thus, why I'd find such a thing useful :) ), but what I'm imagining is a a square or block of squares representing ETS cells, a circle representing a gen_server interacting with both callers and the ETS cell(s), callers calling the gen_server, and (if this even happens) callers interacting with the ETS cell(s) directly, bypassing the gen_server.

I have some things I sketched out by hand as I was thinking it through. It does seem worthwhile to update that and make it into a shareable doc, will see what I can do there.

Is the gen_server necessary so that we aren't duplicating or repeating work all over the place? Is it more write coordination, read coordination, or both? Or neither?

Yes, we're using it to coordinate writes, and to allow callers to retry reads when a write is pending for the solution/key they're waiting for. Though that might be worth looking closer at. The flow could look something like:

caller attempts direct lookup in ets, if it fails..
try to 'claim' the solution in chef_cbv_cache; if it succeeds it calculates the solution and stores it, as it does now. If it fails, it does the same retry behavior it has now.

This would still prevent multiple callers from assembling the same results on a cache miss; and further reduces the possibility that the gen_server gets overloaded. But (thinking this through as I type...) this approach does open a small gap where a caller (A) might miss on ETS , and another caller (B) completes its own cache put. When (A) attempts to claim intent on the key, the gen_server will allow it because (B) already finished, causing (A) to recalculate the response that (B) just populated. If that happens for multiple "A"s at the same time, things get messy.

Do callers have to go through the gen_server to read the cache, or do they read it directly (I noticed you set read_concurrency to true)?

They do go through the gen_server for reads. I set read_concurrency true because this table is very read-heavy; and I was initially considering a pool of workers which would enable concurrent access. The flag is left over from that, I'll remove it.

From my experience years ago optimizing erlang concurrency, I found that the less a gen_server is a bottleneck to anything, the better. Having said that, whatever the case is here, this should hopefully be a tremendous improvement to the current state of affairs.

That was my main concern as well - and it actually did happen in early testing under load, which is what led to adding the 'circuit breaker' behavior[1]. It works well both in our load testing and for a customer seeing this issue. But I've since added the claim/intent behavior to prevent dup work. Testing with that change has shown that even under heaviest load the breaker was rarely 'tripped'. I think this is for a couple of reasons:

a) multiple callers were no longer building the same solution, which reduces load by a surprising amount, and leaves the cbv_cache gen server more cycles to continue processing messages
b) multiple callers were no longer sending the same /large/ put message of {Key, ReallyBigSolutionTerm} to the cbv_cache as they finished their work.

The breaker still does get tripped under burst load when the server is already under load, but I've only seen it a handful of times for a very short period. More often I would see 503s because of callers giving up on being able to claim the key, but even those were rare: About 3-5% of cookbook_version endpoint calls would respond with 503, when the server was under max sustainable load of 40k run/15min [about 520 total API requests per second]. I don't have exact numbers, but estimate over 90% of those 503s were from giving up on being able to claim the key.

marcparadise · 2021-11-24T16:06:03Z

I tried a load run without the circuit breaker logic, and it handled it well up to 30k@15; but became a bottleneck at 40k -- although it still affected only this endpoint, and did not otherwise overload the system.

Will leave that in place.

src/oc_erchef/apps/oc_chef_wm/src/chef_wm_depsolver.erl

src/oc_erchef/apps/chef_objects/src/chef_cbv_cache.erl

lbakerchef · 2021-11-29T16:35:02Z

Do callers have to go through the gen_server to read the cache, or do they read it directly (I noticed you set read_concurrency to true)?

They do go through the gen_server for reads. I set read_concurrency true because this table is very read-heavy; and I was initially considering a pool of workers which would enable concurrent access. The flag is left over from that, I'll remove it.

Actually, I wasn't suggesting it be removed (it's generally good to have, but use your own judgement in this case). I was merely hoping it was the case that processes didn't have to go through the gen_server for reads. But if they do, no worries, and either way this is a substantial improvement to what we have, and it solves a customer issue nicely.

marcparadise · 2021-12-02T16:27:00Z

Will rebase this, run a final round of perf/load test, then get this merged today

When a POST to `cookbook_versions` results in a large list of cookbook files, the cumulative cost of generating the formatted result can take a long time - upwards of 5 seconds in some cases. Due to scheduling changes in or around OTP 21, the impact of this on the rest of the system has been greatly increased. This change adds a `chef_cbv_cache` for caching these result sets. This cache has several protections built in to both reduce the likelihood of it becoming a bottleneck, and to help prevent multiple callers from attempting to assemble the same results concurrently when it is not in the cache yet: - built in circuit-breaker, so that if the process message queue for the cache gets overwhelmed, callers are notified and - a 'claim' system where it is necessary to 'claim' the key you want to set before setting it with `put`. This allows the caller to be notified with a `retry` response to `get` or `claim` that indicates another process is working to populate the value for that key, and that the caller should try to retrieve the value again after a delay. - keys are set to expire with a TTL defined in `chef_objects` config as `chef_cbv_cache_item_ttl`. In order to prevent all keys set around the same time from expiring together and causing a rush of requests to populate the key, the actual TTL varies and is calculated as: (50% of configured TTL) + (0-50% of configured TTL) Note that this may no longer be necessary, because the final implementation allows callers to synchronize and avoid multiple callers doing the same work. Usage: - check for a key: `chef_cbv_cache:get(Key)` - if the process is too busy to handle the request, this returns `busy` - if another process has claimed this key and is in the process of populating it, this returns `retry` indicating client should try to `get` again. - claim the key, so that the current process will be the only one working on it: `chef_cbv_cache:claim(Key)` - if the process is too busy to handle the request, this returns `busy` - if another process has claimed this key and is in the process of populating it, this returns `retry`, indicating client should try to `get` again instead of continuing to attempt to populate the key. - to set a key value after claiming it: `chef_cbv_cache:put(Key, Value)` - Attempting set a value without first claiming that key will result in an error. Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>

This makes use of the new `chef_cbv_cache` to minimize the amount of times overall that a given formatted cookbook response is re-created, because this operation is very expensive when the list of cookbook files is long. Since ~OTP 20 the impact of these CPU-intensive responses has significant impact on the overall health and responsiveness of the system under load. By significantly reducing the frequency that we build a response, we reduce the overall VM load; and the relatively infrequent number of times a new response has to be built has much less impact on the VM. Some numbers, using sample org data that tends towards generating large responses to cookbook_version endpoint, because of the total number of cookbook files: 7k clients/15min interval - cache disabled - CPU 1800-2000%, LA 25. cookbook_versions rough average time range 2-5s 7k clients/15min interval - cache enabled - CPU 150-200%, LA 5-6. cookbook_versions rough average time range 0.2-1.1s (edited) 40k clients/15min interval - cache enabled - CPU 1000-1200%, LA 19 . CBV time range: 0.2-3s with most on the < 0.5 end. 40k clients/15min interval - cache disabled - server fails within a minute Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>

Signed-off-by: jan shahid shaik <jashaik@progress.com>

Based on code review feedback, this adds an up-front call to envy:get to determine if the cache is enabled before doing breaker check or sending the message to the cache. The environment lookup calls are lighter-weight than round-tripping two messages for each call for a disabled service. oc_chef_wm_sup will also no longer start chef_cbv_cache if it is disabled. Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>

marcparadise · 2021-12-02T22:54:26Z

@lbakerchef For a future refinement, I've just about talked myself into allowing callers to read the ets directly without going through the gen_server. I think we'd need a minor modification to handle the race scenario I talked about above:

this approach does open a small gap where a caller (A) might miss on ETS , and another caller (B) completes its own cache put. When (A) attempts to claim intent on the key, the gen_server will allow it because (B) already finished, causing (A) to recalculate the response that (B) just populated. If that happens for multiple "A"s at the same time, things get messy.

Modifying the cache to check for the key already existing on claim would be a small change - responding retry if it is present /should/ still ensure that only one proc does the work of assembling the solution in that situation too.

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>

IanMadd

Docs suggestions

docs-chef-io/content/server/v14/config_rb_server_optional_settings.md

Co-authored-by: Ian Maddaus <IanMadd@users.noreply.github.com> Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>

lbakerchef · 2021-12-03T16:33:51Z

@lbakerchef For a future refinement, I've just about talked myself into allowing callers to read the ets directly without going through the gen_server. I think we'd need a minor modification to handle the race scenario I talked about above

@marcparadise In general, if something is good as-is (or in this case, great as-is), I'm all for going with what we've got right now, and reserving any refinements for the future if and when we get around to it, and if and when we deem it practicable to do, the reward outweighs the risk, there is a need, etc. In this case, maybe a couple line comment somewhere in the source (assuming it's not already there) briefly describing at a high level the optimization that could be looked at, and the race condition to be solved, would be more than adequate :)

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>

lbakerchef · 2021-12-06T23:28:46Z

@marcparadise Notes for potential future improvements look great!

marcparadise force-pushed the mp/cbv-cache branch 5 times, most recently from 2d664dd to 312f90d Compare November 24, 2021 01:10

marcparadise marked this pull request as ready for review November 24, 2021 01:17

marcparadise requested review from a team as code owners November 24, 2021 01:17