Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy monitor load issues #3254

Open
michel-laterman opened this issue Feb 5, 2024 · 3 comments
Open

Policy monitor load issues #3254

michel-laterman opened this issue Feb 5, 2024 · 3 comments
Labels
Team:Fleet Label for the Fleet team

Comments

@michel-laterman
Copy link
Contributor

Overview

The fleet-server performance tests have been failing for a while.
The specific test case that fails is multiple policy changes (submitted before they are fully acked).
This is reproducible on an 8GB ess instance with 5k agents.

Currently we do not have reports from customers or the community where this is an issue.

There is some discussion and investigation of this issue in the policy monitor debounce pr but I believe we should move the discussion of the issue here.

Our current proposal for a work around is to increase the default policy-monitor debounce time and extend/remove the policy dispatch timeout (currently 5s).

Issues

I believe that the issue with this scale test is created by the following components:

  • policy monitor limiter (previously the throttle)
  • policy monitor has a single mutex
  • no early cancelations

The limiter introduces a delay (default 5ms) in between dispatches to subscriptions. This is done to avoid having a policy change that effects a lot of agents trigger all checkins from writing at once (and allocating expensive gzip writers).

However, this delay increases the time it takes for the dispatchPending method, and since there is a single mutex other operations on the policy monitor are blocked during this, including processing new subscriptions, unsubscribes, and updating the policies.

There is also no way to cancel a dispatch early; specifically if we are dispatching 10k subscriptions on rev:2, and rev:3 is detected, we should stop dispatching rev:2, update the policy reference to rev:3 then dispatch the newest revision to the remaining subscriptions.

Proposed solutions

I think we can use a combination of solutions to fix the policyMonitor issues under load.

Unify action+policy limits in checkinT

The action limit (introduced in #2994) and the policy limit are both used to stop the checkin endpoint from writing every response to awaiting agents at once (so that expensive gzip writer allocations can be avoided).

I think that we should remove the limits from the actionDispatcher and policyMonitor and use a unified limiter in checkT (to control access to the gzip pool in writeResponse).
This would allow the policyMonitor to dispatch pending changes much faster.

This is the easiest to implement and likely has the most impact.

Interruptible dispatches

If we refactor the policy monitor so that dispatchPending may be canceled early when an updated policy is received by the underlying monitor we can avoid dispatching a policy that is outdated to agents.

This solution requires a lot more work in terms of implementation/testing and can be done along with any other solution.

Change subscription tracking

Currently the policy subscriptions are tracked as a linked list associated with a policy id as well as the pendingQ of the policy monitor (another linked list) that tracks subscriptions with pending changes.
When a new change is detected the policyMonitor will update policy's subscriptions by unlinking the list one item at a time and adding the subscriptions to the pendingQ, then to dispatch the queue linearly.

Changing how subscriptions are tracked may have to be done as a port of implementing interruptible dispatches.
If we change how the subscriptions are transfered to the pendingQ we should be able to pass the entire list at once (O(1) instead of O(n))

@michel-laterman michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team labels Feb 5, 2024
@michel-laterman michel-laterman self-assigned this Feb 5, 2024
@jen-huang jen-huang removed the bug Something isn't working label Feb 6, 2024
@michel-laterman
Copy link
Contributor Author

The unified throttle seems to solve the 10k perf test for the main branch.
However it does not succeed on the "multiple" branch; where the 10s delay in between policy updates (for the multiple policy test) has been removed.

The underlying (policy) index monitor is successfully detecting all the updated docs in the ES index.
However, when dispatching to the policy monitor's subscription it drops notifications due to a timeout.

We can try to solve this with interruptible dispatches in the policy monitor, as outline in the issue description.

Alternatively we can alter how the underlying index monitors function so that notify does not timeout, but may be canceled if it is called again.
However this will take a significant amount of effort to implement and test, and may break some assumptions we have in regards to document delivery.

I don't think it's worth addressing for the time being as errors like the perf-test failure for the "multiple" branch has never been reported.
Also the work around for it would be very simple; issue a new policy revision if agents are stuck on an old revision for an extended period of time.

@michel-laterman
Copy link
Contributor Author

Looks like the unified limiter has actually caused some pods to experience OOM crashes in the 10k serverless test.
I'm reverting the PR: #3274, we can rebase #3273 as an attempt to fix it after; however it may need more work

@michel-laterman
Copy link
Contributor Author

We're going to push this issue out for a later time when we have more capacity as it does not affect users and the work around is pretty simple

@michel-laterman michel-laterman removed their assignment Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

2 participants