-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable cookbook version response caching #2955
Conversation
👷 Deploy Preview for chef-server processing. 🔨 Explore the source changes: fe968e0 🔍 Inspect the deploy log: https://app.netlify.com/sites/chef-server/deploys/61aa4d237db5600007ba65d6 |
2d664dd
to
312f90d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just some general questions and thoughts:
-
Do you have some kind of conceptual diagram or diagrams showing the design? I don't know if this even applies (thus, why I'd find such a thing useful :) ), but what I'm imagining is a a square or block of squares representing ETS cells, a circle representing a gen_server interacting with both callers and the ETS cell(s), callers calling the gen_server, and (if this even happens) callers interacting with the ETS cell(s) directly, bypassing the gen_server.
-
Is the gen_server necessary so that we aren't duplicating or repeating work all over the place? Is it more write coordination, read coordination, or both? Or neither?
-
Do callers have to go through the gen_server to read the cache, or do they read it directly (I noticed you set
read_concurrency
totrue
)? From my experience years ago optimizing erlang concurrency, I found that the less a gen_server is a bottleneck to anything, the better. Having said that, whatever the case is here, this should hopefully be a tremendous improvement to the current state of affairs.
Good job.
I have some things I sketched out by hand as I was thinking it through. It does seem worthwhile to update that and make it into a shareable doc, will see what I can do there.
Yes, we're using it to coordinate writes, and to allow callers to retry reads when a write is pending for the solution/key they're waiting for. Though that might be worth looking closer at. The flow could look something like:
This would still prevent multiple callers from assembling the same results on a cache miss; and further reduces the possibility that the gen_server gets overloaded. But (thinking this through as I type...) this approach does open a small gap where a caller (A) might miss on ETS , and another caller (B) completes its own cache put. When (A) attempts to claim intent on the key, the gen_server will allow it because (B) already finished, causing (A) to recalculate the response that (B) just populated. If that happens for multiple "A"s at the same time, things get messy.
They do go through the gen_server for reads. I set read_concurrency true because this table is very read-heavy; and I was initially considering a pool of workers which would enable concurrent access. The flag is left over from that, I'll remove it.
That was my main concern as well - and it actually did happen in early testing under load, which is what led to adding the 'circuit breaker' behavior[1]. It works well both in our load testing and for a customer seeing this issue. But I've since added the claim/intent behavior to prevent dup work. Testing with that change has shown that even under heaviest load the breaker was rarely 'tripped'. I think this is for a couple of reasons: a) multiple callers were no longer building the same solution, which reduces load by a surprising amount, and leaves the cbv_cache gen server more cycles to continue processing messages The breaker still does get tripped under burst load when the server is already under load, but I've only seen it a handful of times for a very short period. More often I would see 503s because of callers giving up on being able to claim the key, but even those were rare: About 3-5% of cookbook_version endpoint calls would respond with 503, when the server was under max sustainable load of 40k run/15min [about 520 total API requests per second]. I don't have exact numbers, but estimate over 90% of those 503s were from giving up on being able to claim the key. |
I tried a load run without the circuit breaker logic, and it handled it well up to 30k@15; but became a bottleneck at 40k -- although it still affected only this endpoint, and did not otherwise overload the system. Will leave that in place. |
f5d7b65
to
b90c4a9
Compare
Actually, I wasn't suggesting it be removed (it's generally good to have, but use your own judgement in this case). I was merely hoping it was the case that processes didn't have to go through the gen_server for reads. But if they do, no worries, and either way this is a substantial improvement to what we have, and it solves a customer issue nicely. |
Will rebase this, run a final round of perf/load test, then get this merged today |
When a POST to `cookbook_versions` results in a large list of cookbook files, the cumulative cost of generating the formatted result can take a long time - upwards of 5 seconds in some cases. Due to scheduling changes in or around OTP 21, the impact of this on the rest of the system has been greatly increased. This change adds a `chef_cbv_cache` for caching these result sets. This cache has several protections built in to both reduce the likelihood of it becoming a bottleneck, and to help prevent multiple callers from attempting to assemble the same results concurrently when it is not in the cache yet: - built in circuit-breaker, so that if the process message queue for the cache gets overwhelmed, callers are notified and - a 'claim' system where it is necessary to 'claim' the key you want to set before setting it with `put`. This allows the caller to be notified with a `retry` response to `get` or `claim` that indicates another process is working to populate the value for that key, and that the caller should try to retrieve the value again after a delay. - keys are set to expire with a TTL defined in `chef_objects` config as `chef_cbv_cache_item_ttl`. In order to prevent all keys set around the same time from expiring together and causing a rush of requests to populate the key, the actual TTL varies and is calculated as: (50% of configured TTL) + (0-50% of configured TTL) Note that this may no longer be necessary, because the final implementation allows callers to synchronize and avoid multiple callers doing the same work. Usage: - check for a key: `chef_cbv_cache:get(Key)` - if the process is too busy to handle the request, this returns `busy` - if another process has claimed this key and is in the process of populating it, this returns `retry` indicating client should try to `get` again. - claim the key, so that the current process will be the only one working on it: `chef_cbv_cache:claim(Key)` - if the process is too busy to handle the request, this returns `busy` - if another process has claimed this key and is in the process of populating it, this returns `retry`, indicating client should try to `get` again instead of continuing to attempt to populate the key. - to set a key value after claiming it: `chef_cbv_cache:put(Key, Value)` - Attempting set a value without first claiming that key will result in an error. Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
This makes use of the new `chef_cbv_cache` to minimize the amount of times overall that a given formatted cookbook response is re-created, because this operation is very expensive when the list of cookbook files is long. Since ~OTP 20 the impact of these CPU-intensive responses has significant impact on the overall health and responsiveness of the system under load. By significantly reducing the frequency that we build a response, we reduce the overall VM load; and the relatively infrequent number of times a new response has to be built has much less impact on the VM. Some numbers, using sample org data that tends towards generating large responses to cookbook_version endpoint, because of the total number of cookbook files: 7k clients/15min interval - cache disabled - CPU 1800-2000%, LA 25. cookbook_versions rough average time range 2-5s 7k clients/15min interval - cache enabled - CPU 150-200%, LA 5-6. cookbook_versions rough average time range 0.2-1.1s (edited) 40k clients/15min interval - cache enabled - CPU 1000-1200%, LA 19 . CBV time range: 0.2-3s with most on the < 0.5 end. 40k clients/15min interval - cache disabled - server fails within a minute Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
Signed-off-by: jan shahid shaik <jashaik@progress.com>
Based on code review feedback, this adds an up-front call to envy:get to determine if the cache is enabled before doing breaker check or sending the message to the cache. The environment lookup calls are lighter-weight than round-tripping two messages for each call for a disabled service. oc_chef_wm_sup will also no longer start chef_cbv_cache if it is disabled. Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
f789fba
to
5ba47dc
Compare
@lbakerchef For a future refinement, I've just about talked myself into allowing callers to read the ets directly without going through the gen_server. I think we'd need a minor modification to handle the race scenario I talked about above:
Modifying the cache to check for the key already existing on |
Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs suggestions
docs-chef-io/content/server/v14/config_rb_server_optional_settings.md
Outdated
Show resolved
Hide resolved
docs-chef-io/content/server/v14/config_rb_server_optional_settings.md
Outdated
Show resolved
Hide resolved
docs-chef-io/content/server/v14/config_rb_server_optional_settings.md
Outdated
Show resolved
Hide resolved
docs-chef-io/content/server/v14/config_rb_server_optional_settings.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Ian Maddaus <IanMadd@users.noreply.github.com> Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
f20fe66
to
ed66572
Compare
@marcparadise In general, if something is good as-is (or in this case, great as-is), I'm all for going with what we've got right now, and reserving any refinements for the future if and when we get around to it, and if and when we deem it practicable to do, the reward outweighs the risk, there is a need, etc. In this case, maybe a couple line comment somewhere in the source (assuming it's not already there) briefly describing at a high level the optimization that could be looked at, and the race condition to be solved, would be more than adequate :) |
Signed-off-by: Marc A. Paradise <marc.paradise@gmail.com>
@marcparadise Notes for potential future improvements look great! |
Most of our time post-depsolve in the
cookbook_versions/
endpoint is spent assembling the results for the client to consume. On a heavily loaded server since OTP 20, the samecall can take 60+ seconds to complete and significantly degrade performance of the entire system. Prior to that, it could still take 3-6 seconds though without systemic performance degradation.
This change introduces an ets-based cache that:
Some numbers, using sample org data that tends towards generating large
responses to cookbook_version endpoint, because of the total number of
cookbook files:
7k clients/15min interval - cache disabled - CPU 1800-2000%, LA 25. cookbook_versions rough average time range 2-5s
7k clients/15min interval - cache enabled - CPU 150-200%, LA 5-6. cookbook_versions rough average time range 0.2-1.1s (edited)
40k clients/15min interval - cache enabled - CPU 1000-1200%, LA 19 . CBV time range: 0.2-3s with most on the < 0.5 end.
40k clients/15min interval - cache disabled - server fails within a minute
For more details, see the commit messages in this PR.
Signed-off-by: Marc A. Paradise marc.paradise@gmail.com