Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable cookbook version response caching #2955

Merged
merged 9 commits into from
Dec 6, 2021
Merged

Enable cookbook version response caching #2955

merged 9 commits into from
Dec 6, 2021

Conversation

marcparadise
Copy link
Member

@marcparadise marcparadise commented Nov 18, 2021

Most of our time post-depsolve in the cookbook_versions/ endpoint is spent assembling the results for the client to consume. On a heavily loaded server since OTP 20, the same
call can take 60+ seconds to complete and significantly degrade performance of the entire system. Prior to that, it could still take 3-6 seconds though without systemic performance degradation.

This change introduces an ets-based cache that:

  • allows callers to cache the results
  • has overload protection, allowing the caller to force client retry (via 503 response) instead of waiting for the cache to become available
  • has synchronization capabilities, so that one caller can determine that another process is already calculating results for a given key and retry the request instead of allowing multiple callers to do the same results calculation.

Some numbers, using sample org data that tends towards generating large
responses to cookbook_version endpoint, because of the total number of
cookbook files:

7k clients/15min interval - cache disabled - CPU 1800-2000%, LA 25. cookbook_versions rough average time range 2-5s
7k clients/15min interval - cache enabled - CPU 150-200%, LA 5-6. cookbook_versions rough average time range 0.2-1.1s (edited)
40k clients/15min interval - cache enabled - CPU 1000-1200%, LA 19 . CBV time range: 0.2-3s with most on the < 0.5 end.
40k clients/15min interval - cache disabled - server fails within a minute

For more details, see the commit messages in this PR.

Signed-off-by: Marc A. Paradise marc.paradise@gmail.com

@netlify
Copy link

netlify bot commented Nov 18, 2021

👷 Deploy Preview for chef-server processing.

🔨 Explore the source changes: fe968e0

🔍 Inspect the deploy log: https://app.netlify.com/sites/chef-server/deploys/61aa4d237db5600007ba65d6

@marcparadise marcparadise force-pushed the mp/cbv-cache branch 5 times, most recently from 2d664dd to 312f90d Compare November 24, 2021 01:10
@marcparadise marcparadise marked this pull request as ready for review November 24, 2021 01:17
@marcparadise marcparadise requested review from a team as code owners November 24, 2021 01:17
Copy link
Contributor

@lbakerchef lbakerchef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just some general questions and thoughts:

  1. Do you have some kind of conceptual diagram or diagrams showing the design? I don't know if this even applies (thus, why I'd find such a thing useful :) ), but what I'm imagining is a a square or block of squares representing ETS cells, a circle representing a gen_server interacting with both callers and the ETS cell(s), callers calling the gen_server, and (if this even happens) callers interacting with the ETS cell(s) directly, bypassing the gen_server.

  2. Is the gen_server necessary so that we aren't duplicating or repeating work all over the place? Is it more write coordination, read coordination, or both? Or neither?

  3. Do callers have to go through the gen_server to read the cache, or do they read it directly (I noticed you set read_concurrency to true)? From my experience years ago optimizing erlang concurrency, I found that the less a gen_server is a bottleneck to anything, the better. Having said that, whatever the case is here, this should hopefully be a tremendous improvement to the current state of affairs.

Good job.

src/oc_erchef/apps/chef_objects/src/chef_cbv_cache.erl Outdated Show resolved Hide resolved
src/oc_erchef/apps/chef_objects/src/chef_cbv_cache.erl Outdated Show resolved Hide resolved
src/oc_erchef/apps/oc_chef_wm/src/chef_wm_depsolver.erl Outdated Show resolved Hide resolved
src/oc_erchef/apps/oc_chef_wm/src/chef_wm_depsolver.erl Outdated Show resolved Hide resolved
@marcparadise
Copy link
Member Author

marcparadise commented Nov 24, 2021

Do you have some kind of conceptual diagram or diagrams showing the design? I don't know if this even applies (thus, why I'd find such a thing useful :) ), but what I'm imagining is a a square or block of squares representing ETS cells, a circle representing a gen_server interacting with both callers and the ETS cell(s), callers calling the gen_server, and (if this even happens) callers interacting with the ETS cell(s) directly, bypassing the gen_server.

I have some things I sketched out by hand as I was thinking it through. It does seem worthwhile to update that and make it into a shareable doc, will see what I can do there.

Is the gen_server necessary so that we aren't duplicating or repeating work all over the place? Is it more write coordination, read coordination, or both? Or neither?

Yes, we're using it to coordinate writes, and to allow callers to retry reads when a write is pending for the solution/key they're waiting for. Though that might be worth looking closer at. The flow could look something like:

  1. caller attempts direct lookup in ets, if it fails..
  2. try to 'claim' the solution in chef_cbv_cache; if it succeeds it calculates the solution and stores it, as it does now. If it fails, it does the same retry behavior it has now.

This would still prevent multiple callers from assembling the same results on a cache miss; and further reduces the possibility that the gen_server gets overloaded. But (thinking this through as I type...) this approach does open a small gap where a caller (A) might miss on ETS , and another caller (B) completes its own cache put. When (A) attempts to claim intent on the key, the gen_server will allow it because (B) already finished, causing (A) to recalculate the response that (B) just populated. If that happens for multiple "A"s at the same time, things get messy.

Do callers have to go through the gen_server to read the cache, or do they read it directly (I noticed you set read_concurrency to true)?

They do go through the gen_server for reads. I set read_concurrency true because this table is very read-heavy; and I was initially considering a pool of workers which would enable concurrent access. The flag is left over from that, I'll remove it.

From my experience years ago optimizing erlang concurrency, I found that the less a gen_server is a bottleneck to anything, the better. Having said that, whatever the case is here, this should hopefully be a tremendous improvement to the current state of affairs.

That was my main concern as well - and it actually did happen in early testing under load, which is what led to adding the 'circuit breaker' behavior[1]. It works well both in our load testing and for a customer seeing this issue. But I've since added the claim/intent behavior to prevent dup work. Testing with that change has shown that even under heaviest load the breaker was rarely 'tripped'. I think this is for a couple of reasons:

a) multiple callers were no longer building the same solution, which reduces load by a surprising amount, and leaves the cbv_cache gen server more cycles to continue processing messages
b) multiple callers were no longer sending the same /large/ put message of {Key, ReallyBigSolutionTerm} to the cbv_cache as they finished their work.

The breaker still does get tripped under burst load when the server is already under load, but I've only seen it a handful of times for a very short period. More often I would see 503s because of callers giving up on being able to claim the key, but even those were rare: About 3-5% of cookbook_version endpoint calls would respond with 503, when the server was under max sustainable load of 40k run/15min [about 520 total API requests per second]. I don't have exact numbers, but estimate over 90% of those 503s were from giving up on being able to claim the key.

@marcparadise
Copy link
Member Author

marcparadise commented Nov 24, 2021

I tried a load run without the circuit breaker logic, and it handled it well up to 30k@15; but became a bottleneck at 40k -- although it still affected only this endpoint, and did not otherwise overload the system.

Will leave that in place.

@marcparadise marcparadise requested a review from a team as a code owner November 25, 2021 09:55
@lbakerchef
Copy link
Contributor

Do callers have to go through the gen_server to read the cache, or do they read it directly (I noticed you set read_concurrency to true)?

They do go through the gen_server for reads. I set read_concurrency true because this table is very read-heavy; and I was initially considering a pool of workers which would enable concurrent access. The flag is left over from that, I'll remove it.

Actually, I wasn't suggesting it be removed (it's generally good to have, but use your own judgement in this case). I was merely hoping it was the case that processes didn't have to go through the gen_server for reads. But if they do, no worries, and either way this is a substantial improvement to what we have, and it solves a customer issue nicely.

@marcparadise
Copy link
Member Author

marcparadise commented Dec 2, 2021

Will rebase this, run a final round of perf/load test, then get this merged today

When a POST to `cookbook_versions` results in a large list of cookbook files,
the cumulative cost of generating the formatted result can take a long time - upwards
of 5 seconds in some cases.  Due to scheduling changes in or around OTP 21, the impact of this on the
rest of the system has been greatly increased.

This change adds a `chef_cbv_cache` for caching these result sets.  This
cache has several protections built in to both reduce the likelihood of it
becoming a bottleneck, and to help prevent multiple callers from
attempting to assemble the same results concurrently when it is not in
the cache yet:

- built in circuit-breaker, so that if the process message queue for the
  cache gets overwhelmed, callers are notified and
- a 'claim' system where it is necessary to 'claim' the key you want to
  set before setting it with `put`.  This allows the caller to be
  notified with a `retry` response to `get` or `claim` that indicates
  another process is working to populate the value for that key, and
  that the caller should try to retrieve the value again after a delay.
- keys are set to expire with a TTL defined in `chef_objects` config as
  `chef_cbv_cache_item_ttl`. In order to prevent all keys set around
  the same time from expiring together and causing a rush of requests to
  populate the key, the actual TTL varies and is calculated as:
    (50% of configured TTL) + (0-50% of configured TTL)
  Note that this may no longer be necessary, because the final
  implementation allows callers to synchronize and avoid multiple
  callers doing the same work.

Usage:

- check for a key: `chef_cbv_cache:get(Key)`
  - if the process is too busy to handle the request, this returns
    `busy`
  - if another process has claimed this key and is in the process of
    populating it, this returns `retry` indicating client should try to
    `get` again.

- claim the key, so that the current process will be the only one
  working on it:
    `chef_cbv_cache:claim(Key)`
  - if the process is too busy to handle the request, this returns
    `busy`
  - if another process has claimed this key and is in the process of
    populating it, this returns `retry`, indicating client should try to
    `get` again instead of continuing to attempt to populate the key.
- to set a key value after claiming it:
    `chef_cbv_cache:put(Key, Value)`
  - Attempting set a value without first claiming that key will result in
    an error.

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
This makes use of the new `chef_cbv_cache` to minimize the amount of
times overall that a given formatted cookbook response is re-created,
because this operation is very expensive when the list of cookbook files
is long.  Since ~OTP 20 the impact of these CPU-intensive responses has significant
impact on the overall health and responsiveness of the system under
load.

By significantly reducing the frequency that we build a response, we reduce the
overall VM load; and the relatively infrequent number of times a new
response has to be built has much less impact on the VM.

Some numbers, using sample org data that tends towards generating large
responses to cookbook_version endpoint, because of the total number of
cookbook files:

7k clients/15min interval - cache disabled - CPU 1800-2000%, LA 25. cookbook_versions rough average time range 2-5s
7k clients/15min interval - cache enabled - CPU 150-200%, LA 5-6. cookbook_versions rough average time range 0.2-1.1s (edited)
40k clients/15min interval - cache enabled - CPU 1000-1200%, LA 19 . CBV time range: 0.2-3s with most on the < 0.5 end.
40k clients/15min interval - cache disabled - server fails within a minute

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
marcparadise and others added 4 commits December 2, 2021 11:31
Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
Signed-off-by: jan shahid shaik <jashaik@progress.com>
Based on code review feedback, this adds an up-front call to envy:get to
determine if the cache is enabled before doing breaker check or sending
the message to the cache.  The environment lookup calls are
lighter-weight than round-tripping two messages for each call for a
disabled service.

oc_chef_wm_sup will also no longer start chef_cbv_cache if it is disabled.

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
@marcparadise
Copy link
Member Author

marcparadise commented Dec 2, 2021

@lbakerchef For a future refinement, I've just about talked myself into allowing callers to read the ets directly without going through the gen_server. I think we'd need a minor modification to handle the race scenario I talked about above:

this approach does open a small gap where a caller (A) might miss on ETS , and another caller (B) completes its own cache put. When (A) attempts to claim intent on the key, the gen_server will allow it because (B) already finished, causing (A) to recalculate the response that (B) just populated. If that happens for multiple "A"s at the same time, things get messy.

Modifying the cache to check for the key already existing on claim would be a small change - responding retry if it is present /should/ still ensure that only one proc does the work of assembling the solution in that situation too.

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
Copy link
Contributor

@IanMadd IanMadd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs suggestions

Co-authored-by: Ian Maddaus <IanMadd@users.noreply.github.com>
Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
@lbakerchef
Copy link
Contributor

lbakerchef commented Dec 3, 2021

@lbakerchef For a future refinement, I've just about talked myself into allowing callers to read the ets directly without going through the gen_server. I think we'd need a minor modification to handle the race scenario I talked about above

@marcparadise In general, if something is good as-is (or in this case, great as-is), I'm all for going with what we've got right now, and reserving any refinements for the future if and when we get around to it, and if and when we deem it practicable to do, the reward outweighs the risk, there is a need, etc. In this case, maybe a couple line comment somewhere in the source (assuming it's not already there) briefly describing at a high level the optimization that could be looked at, and the race condition to be solved, would be more than adequate :)

Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
@marcparadise marcparadise merged commit 8013e79 into main Dec 6, 2021
@marcparadise marcparadise deleted the mp/cbv-cache branch December 6, 2021 15:56
@lbakerchef
Copy link
Contributor

@marcparadise Notes for potential future improvements look great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants