Fix actions acks delay #2406

aleksmaus · 2023-03-27T22:34:11Z

What does this PR do?

Fixes the issue with acks being delayed by the agent.

The cause of the issue is the shared fleet server "client" being locked here

elastic-agent/internal/pkg/remote/client.go

Line 164 in 018bc0b

defer c.clientLock.Unlock()

preventing any fleet server requests until the existing requests is finished.

The problem is exacerbated by the fact that there the long poll request doesn't return until there is any new action or timeout which as far as I remember was 5 minutes or possibly longer.
In such cases the acks request will be awaiting to acquire that which can only happens when the user either changes the policy or executes the action.

Depending on the timing the checkins and the actions dispatching loops, they competed with each other for one single shared client.

Here is a little bit more details/comments on the investigation:
#2410
#2410

As far as I understand this problem was introduced in 8.6 as a result of separation of the checkin and the actions dispatching loops which lead to the concurrent access to the http "client" and competition for the "clients lock" in the current implementation.

This is the "minimal viable" fix for this problem.

Why is it important?

Fixes a huge issue with the actions execution/acking.
This affects any action acking, but most of all osquery that relies on .fleet-action-results to be delivered in timely manner.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

How to test this PR locally

Run the agent with osquery, execute live pack of 30-40 queries.

Related issues

Closes V2 Agent actions acks/results are delayed by 5 minutes #2410

Screenshots

The live packs work as expected with the fix

mergify · 2023-03-27T22:34:47Z

This pull request does not have a backport label. Could you fix it @aleksmaus? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

aleksmaus · 2023-03-27T22:35:57Z

added the backport label for 8.7 in case if it can make it into 8.7.0 or 8.7.1

elasticmachine · 2023-03-27T22:51:20Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-03-29T14:58:15.836+0000
Duration: 18 min 24 sec

Test stats 🧪

Test	Results
Failed	0
Passed	5251
Skipped	23
Total	5274

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-03-27T22:51:30Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.438% (`63/64`)	👍
Files	69.863% (`153/219`)	👍
Classes	68.523% (`283/413`)	👍
Methods	53.989% (`873/1617`)	👍 0.028
Lines	39.288% (`9740/24791`)	👍 0.054
Conditionals	100.0% (`0/0`)	💚

jlind23 · 2023-03-28T04:57:34Z

@aleksmaus we are pretty late in our 8.7 testing cycle so I would not recommend backporting it to 8.7.0.
Cc @cmacknz

cmacknz · 2023-03-28T13:09:08Z

We should backport this to 8.7.0 so it goes into 8.7.1. We likely should not merge the 8.7 backport until the 8.7.0 release has completed in case a last second build candidate needs to be created.

cmacknz

Can you add a test to ensure that two requests can be made concurrently?

Most of the necessary HTTP setup looks like it is already done in https://github.com/cmacknz/elastic-agent/blob/7ab92c8a6d7350502a87e69b3cfef60c4d92ecbe/internal/pkg/remote/client_test.go#L76.

It should be possible to write a test where two goroutines try to make a request concurrently and then assert that two instances of the configured HTTP handler were invoked concurrently or something similar.

cmacknz · 2023-03-28T17:04:15Z

internal/pkg/remote/client.go


+		// Using the same lock that was used for sorting above
+		c.clientLock.Lock()
+		requester.SetLastError(err)


I'm not convinced this is completely correct if Send() is called concurrently for two different API endpoints on the same host.

This assumes that errors from one HTTP route have equal weight for other routes on the same host which isn't actually true. It might be true if the error is a 5xx error but isn't necessarily true for 4xx.

At the same time I'm not sure that fixing this makes any functional difference since this just alters the order in which hosts are tried.

This is probably worth a comment though in case the logic here changes in the future. Can you add some comments making it obvious that this function is intended to be used concurrently for multiple APIs on the same host?

The only obvious fixes for this are having the errors be per API, having the agent use two clients, etc. Those seems like they add complexity vs just accepting that under some circumstances the priority of the hosts won't be exactly optimal when we are sorting them.

Not debating the existing code logic here. The logic of the errors "per client" is still the same, and I'm not exactly sure why this was implemented the way it is, just making the existing code safe for using concurrently.

…r the change

aleksmaus · 2023-03-28T20:08:01Z

Added the test that shows the issue, the test timed out while the second request is being blocked by the first request "in-flight" before the fix:

and it passes after the fix.

cmacknz · 2023-03-29T14:53:31Z

Thanks, new test looks great. I confirmed it fails without this fix applied.

The only thing missing is a changelog fragment.

* Fix actions acks delay * Make linter happy * Add unit test replicating the issue and confirming that it works after the change * Add changelog fragment (cherry picked from commit c7abda1)

* Fix actions acks delay * Make linter happy * Add unit test replicating the issue and confirming that it works after the change * Add changelog fragment (cherry picked from commit c7abda1) Co-authored-by: Aleksandr Maus <aleksandr.maus@elastic.co>

Fix actions acks delay

be0ebd0

aleksmaus added the bug Something isn't working label Mar 27, 2023

aleksmaus requested a review from cmacknz March 27, 2023 22:34

aleksmaus requested a review from a team as a code owner March 27, 2023 22:34

aleksmaus requested review from blakerouse and removed request for a team March 27, 2023 22:34

mergify bot assigned aleksmaus Mar 27, 2023

mergify bot added the backport-skip label Mar 27, 2023

aleksmaus added 8.8-candidate backport-v8.7.0 Automated backport with mergify labels Mar 27, 2023

mergify bot removed the backport-skip label Mar 27, 2023

Make linter happy

7ab92c8

cmacknz requested review from michalpristas and michel-laterman March 28, 2023 16:51

cmacknz reviewed Mar 28, 2023

View reviewed changes

Add unit test replicating the issue and confirming that it works afte…

2bd45d8

…r the change

aleksmaus requested a review from cmacknz March 28, 2023 20:08

Add changelog fragment

0c69ceb

cmacknz approved these changes Mar 29, 2023

View reviewed changes

aleksmaus merged commit c7abda1 into elastic:main Mar 29, 2023

mergify bot mentioned this pull request Mar 29, 2023

[8.7](backport #2406) Fix actions acks delay #2417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix actions acks delay #2406

Fix actions acks delay #2406

aleksmaus commented Mar 27, 2023

mergify bot commented Mar 27, 2023

aleksmaus commented Mar 27, 2023

elasticmachine commented Mar 27, 2023 •

edited

Build stats

Test stats 🧪

elasticmachine commented Mar 27, 2023 •

edited

jlind23 commented Mar 28, 2023

cmacknz commented Mar 28, 2023

cmacknz left a comment

cmacknz Mar 28, 2023

cmacknz Mar 28, 2023

aleksmaus Mar 28, 2023

aleksmaus commented Mar 28, 2023 •

edited

cmacknz commented Mar 29, 2023

Fix actions acks delay #2406

Fix actions acks delay #2406

Conversation

aleksmaus commented Mar 27, 2023

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Screenshots

mergify bot commented Mar 27, 2023

aleksmaus commented Mar 27, 2023

elasticmachine commented Mar 27, 2023 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Mar 27, 2023 • edited

🌐 Coverage report

jlind23 commented Mar 28, 2023

cmacknz commented Mar 28, 2023

cmacknz left a comment

Choose a reason for hiding this comment

cmacknz Mar 28, 2023

Choose a reason for hiding this comment

cmacknz Mar 28, 2023

Choose a reason for hiding this comment

aleksmaus Mar 28, 2023

Choose a reason for hiding this comment

aleksmaus commented Mar 28, 2023 • edited

cmacknz commented Mar 29, 2023

elasticmachine commented Mar 27, 2023 •

edited

elasticmachine commented Mar 27, 2023 •

edited

aleksmaus commented Mar 28, 2023 •

edited