Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix actions acks delay #2406

Merged
merged 4 commits into from
Mar 29, 2023
Merged

Fix actions acks delay #2406

merged 4 commits into from
Mar 29, 2023

Conversation

aleksmaus
Copy link
Member

What does this PR do?

Fixes the issue with acks being delayed by the agent.

The cause of the issue is the shared fleet server "client" being locked here

defer c.clientLock.Unlock()

preventing any fleet server requests until the existing requests is finished.

The problem is exacerbated by the fact that there the long poll request doesn't return until there is any new action or timeout which as far as I remember was 5 minutes or possibly longer.
In such cases the acks request will be awaiting to acquire that which can only happens when the user either changes the policy or executes the action.

Depending on the timing the checkins and the actions dispatching loops, they competed with each other for one single shared client.

Here is a little bit more details/comments on the investigation:
#2410
#2410

As far as I understand this problem was introduced in 8.6 as a result of separation of the checkin and the actions dispatching loops which lead to the concurrent access to the http "client" and competition for the "clients lock" in the current implementation.

This is the "minimal viable" fix for this problem.

Why is it important?

Fixes a huge issue with the actions execution/acking.
This affects any action acking, but most of all osquery that relies on .fleet-action-results to be delivered in timely manner.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

How to test this PR locally

Run the agent with osquery, execute live pack of 30-40 queries.

Related issues

Screenshots

The live packs work as expected with the fix

Screenshot 2023-03-27 at 6 03 01 PM

@aleksmaus aleksmaus added the bug Something isn't working label Mar 27, 2023
@aleksmaus aleksmaus requested a review from cmacknz March 27, 2023 22:34
@aleksmaus aleksmaus requested a review from a team as a code owner March 27, 2023 22:34
@aleksmaus aleksmaus requested review from blakerouse and removed request for a team March 27, 2023 22:34
@mergify
Copy link
Contributor

mergify bot commented Mar 27, 2023

This pull request does not have a backport label. Could you fix it @aleksmaus? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@aleksmaus aleksmaus added 8.8-candidate backport-v8.7.0 Automated backport with mergify labels Mar 27, 2023
@mergify mergify bot removed the backport-skip label Mar 27, 2023
@aleksmaus
Copy link
Member Author

added the backport label for 8.7 in case if it can make it into 8.7.0 or 8.7.1

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 27, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-29T14:58:15.836+0000

  • Duration: 18 min 24 sec

Test stats 🧪

Test Results
Failed 0
Passed 5251
Skipped 23
Total 5274

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 27, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.438% (63/64) 👍
Files 69.863% (153/219) 👍
Classes 68.523% (283/413) 👍
Methods 53.989% (873/1617) 👍 0.028
Lines 39.288% (9740/24791) 👍 0.054
Conditionals 100.0% (0/0) 💚

@jlind23
Copy link
Contributor

jlind23 commented Mar 28, 2023

@aleksmaus we are pretty late in our 8.7 testing cycle so I would not recommend backporting it to 8.7.0.
Cc @cmacknz

@cmacknz
Copy link
Member

cmacknz commented Mar 28, 2023

We should backport this to 8.7.0 so it goes into 8.7.1. We likely should not merge the 8.7 backport until the 8.7.0 release has completed in case a last second build candidate needs to be created.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test to ensure that two requests can be made concurrently?

Most of the necessary HTTP setup looks like it is already done in https://github.com/cmacknz/elastic-agent/blob/7ab92c8a6d7350502a87e69b3cfef60c4d92ecbe/internal/pkg/remote/client_test.go#L76.

It should be possible to write a test where two goroutines try to make a request concurrently and then assert that two instances of the configured HTTP handler were invoked concurrently or something similar.


// Using the same lock that was used for sorting above
c.clientLock.Lock()
requester.SetLastError(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is completely correct if Send() is called concurrently for two different API endpoints on the same host.

This assumes that errors from one HTTP route have equal weight for other routes on the same host which isn't actually true. It might be true if the error is a 5xx error but isn't necessarily true for 4xx.

At the same time I'm not sure that fixing this makes any functional difference since this just alters the order in which hosts are tried.

This is probably worth a comment though in case the logic here changes in the future. Can you add some comments making it obvious that this function is intended to be used concurrently for multiple APIs on the same host?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only obvious fixes for this are having the errors be per API, having the agent use two clients, etc. Those seems like they add complexity vs just accepting that under some circumstances the priority of the hosts won't be exactly optimal when we are sorting them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not debating the existing code logic here. The logic of the errors "per client" is still the same, and I'm not exactly sure why this was implemented the way it is, just making the existing code safe for using concurrently.

@aleksmaus
Copy link
Member Author

aleksmaus commented Mar 28, 2023

Added the test that shows the issue, the test timed out while the second request is being blocked by the first request "in-flight" before the fix:
Screenshot 2023-03-28 at 3 48 50 PM

and it passes after the fix.

@aleksmaus aleksmaus requested a review from cmacknz March 28, 2023 20:08
@cmacknz
Copy link
Member

cmacknz commented Mar 29, 2023

Thanks, new test looks great. I confirmed it fails without this fix applied.

The only thing missing is a changelog fragment.

@aleksmaus aleksmaus merged commit c7abda1 into elastic:main Mar 29, 2023
mergify bot pushed a commit that referenced this pull request Mar 29, 2023
* Fix actions acks delay

* Make linter happy

* Add unit test replicating the issue and confirming that it works after the change

* Add changelog fragment

(cherry picked from commit c7abda1)
cmacknz pushed a commit that referenced this pull request Mar 30, 2023
* Fix actions acks delay

* Make linter happy

* Add unit test replicating the issue and confirming that it works after the change

* Add changelog fragment

(cherry picked from commit c7abda1)

Co-authored-by: Aleksandr Maus <aleksandr.maus@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.8-candidate backport-v8.7.0 Automated backport with mergify bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

V2 Agent actions acks/results are delayed by 5 minutes
4 participants