-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix actions acks delay #2406
Fix actions acks delay #2406
Conversation
This pull request does not have a backport label. Could you fix it @aleksmaus? 🙏
NOTE: |
added the backport label for 8.7 in case if it can make it into 8.7.0 or 8.7.1 |
🌐 Coverage report
|
@aleksmaus we are pretty late in our 8.7 testing cycle so I would not recommend backporting it to 8.7.0. |
We should backport this to 8.7.0 so it goes into 8.7.1. We likely should not merge the 8.7 backport until the 8.7.0 release has completed in case a last second build candidate needs to be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test to ensure that two requests can be made concurrently?
Most of the necessary HTTP setup looks like it is already done in https://github.com/cmacknz/elastic-agent/blob/7ab92c8a6d7350502a87e69b3cfef60c4d92ecbe/internal/pkg/remote/client_test.go#L76.
It should be possible to write a test where two goroutines try to make a request concurrently and then assert that two instances of the configured HTTP handler were invoked concurrently or something similar.
|
||
// Using the same lock that was used for sorting above | ||
c.clientLock.Lock() | ||
requester.SetLastError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this is completely correct if Send()
is called concurrently for two different API endpoints on the same host.
This assumes that errors from one HTTP route have equal weight for other routes on the same host which isn't actually true. It might be true if the error is a 5xx error but isn't necessarily true for 4xx.
At the same time I'm not sure that fixing this makes any functional difference since this just alters the order in which hosts are tried.
This is probably worth a comment though in case the logic here changes in the future. Can you add some comments making it obvious that this function is intended to be used concurrently for multiple APIs on the same host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only obvious fixes for this are having the errors be per API, having the agent use two clients, etc. Those seems like they add complexity vs just accepting that under some circumstances the priority of the hosts won't be exactly optimal when we are sorting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not debating the existing code logic here. The logic of the errors "per client" is still the same, and I'm not exactly sure why this was implemented the way it is, just making the existing code safe for using concurrently.
Thanks, new test looks great. I confirmed it fails without this fix applied. The only thing missing is a changelog fragment. |
* Fix actions acks delay * Make linter happy * Add unit test replicating the issue and confirming that it works after the change * Add changelog fragment (cherry picked from commit c7abda1)
What does this PR do?
Fixes the issue with acks being delayed by the agent.
The cause of the issue is the shared fleet server "client" being locked here
elastic-agent/internal/pkg/remote/client.go
Line 164 in 018bc0b
preventing any fleet server requests until the existing requests is finished.
The problem is exacerbated by the fact that there the long poll request doesn't return until there is any new action or timeout which as far as I remember was 5 minutes or possibly longer.
In such cases the acks request will be awaiting to acquire that which can only happens when the user either changes the policy or executes the action.
Depending on the timing the checkins and the actions dispatching loops, they competed with each other for one single shared client.
Here is a little bit more details/comments on the investigation:
#2410
#2410
As far as I understand this problem was introduced in 8.6 as a result of separation of the checkin and the actions dispatching loops which lead to the concurrent access to the http "client" and competition for the "clients lock" in the current implementation.
This is the "minimal viable" fix for this problem.
Why is it important?
Fixes a huge issue with the actions execution/acking.
This affects any action acking, but most of all osquery that relies on .fleet-action-results to be delivered in timely manner.
Checklist
./changelog/fragments
using the changelog toolHow to test this PR locally
Run the agent with osquery, execute live pack of 30-40 queries.
Related issues
Screenshots
The live packs work as expected with the fix