Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auditbeat] Fix up socket dataset runaway CPU usage #19764

Merged
merged 3 commits into from
Jul 9, 2020

Conversation

andrewstucki
Copy link
Contributor

@andrewstucki andrewstucki commented Jul 9, 2020

What does this PR do?

Fix for auditbeat runaway CPU usage: #19141

So, here's the explanation, basically everything was pretty much as described in the previous PR (#19033), the only additional things that I found were that:

  1. When a *socket is terminated by another socket with a different kernel tid it's moved to the closing LRU list.
  2. The new *socket is added to the state socks map with the ptr reference pointing to it
  3. The reaper comes along and hits the following code path:
	for item := s.closing.peek(); item != nil && item.Timestamp().Before(deadline); {
		if sock, ok := item.(*socket); ok {
			s.onSockTerminated(sock)
		} else {
			s.closing.get()
		}
		item = s.closing.peek()
	}
  1. The old "terminated" socket is now in a "closing" state, so onSockTerminated is called again
  2. In onSockTerminated the socket is pruned again from the socks map with the call to delete(s.socks, sock.sock)
  3. The problem is that the socks map now refers to the new *socket rather than the old one
  4. Eventually if the new *socket times out onSockDestroyed is called on it with the code that's doing the peek on the socketLRU in the reaper code
  5. That was taking a reference to the socket pointer that had been deleted from the socks map in step 5
  6. onSockDestroyed was running the following code:
	sock, found = s.socks[ptr]
	if !found {
		return nil
	}
  1. found was returning false and the function was returning
  2. Because of the call to s.socketLRU.peek() the same socket was getting returned over and over, resulting in the reaper routine getting wedged in a tight for loop (hence the high CPU usage).

The fix

Basically we pass a reference to the *socket object in the reaper's onSockDestroyed call, that way we don't have to look up the socket in s.socks and, instead handle the socket closure directly.

Related issues

@andrewstucki andrewstucki requested a review from a team as a code owner July 9, 2020 02:33
@andrewstucki andrewstucki self-assigned this Jul 9, 2020
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 9, 2020
@andrewstucki andrewstucki added Auditbeat bug needs_backport PR is waiting to be backported to other branches. Team:SIEM labels Jul 9, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/siem (Team:SIEM)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 9, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 9, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #19764 updated]

  • Start Time: 2020-07-09T11:17:41.326+0000

  • Duration: 31 min 29 sec

Test stats 🧪

Test Results
Failed 0
Passed 230
Skipped 49
Total 279

@adriansr
Copy link
Contributor

adriansr commented Jul 9, 2020

Great investigation @andrewstucki, thanks for fixing this!

It explains the profiles shared in the discuss thread:

image

With onSockDestroyed() taking most of the CPU and only doing a map lookup.

Copy link
Contributor

@adriansr adriansr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, needs a changelog entry

@andrewstucki
Copy link
Contributor Author

Just added on to your previous changelog entry and rebased from master to hopefully fix the weird CI mage issues

@adriansr
Copy link
Contributor

adriansr commented Jul 9, 2020

I've shared a custom 7.8.1 build with this fix on the discuss thread.

@andrewstucki andrewstucki merged commit cb4cedc into elastic:master Jul 9, 2020
@andrewstucki andrewstucki added v7.9.0 and removed needs_backport PR is waiting to be backported to other branches. labels Jul 9, 2020
andrewstucki added a commit to andrewstucki/beats that referenced this pull request Jul 9, 2020
* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit cb4cedc)
andrewstucki added a commit to andrewstucki/beats that referenced this pull request Jul 9, 2020
* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit cb4cedc)
andrewstucki added a commit to andrewstucki/beats that referenced this pull request Jul 9, 2020
* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit cb4cedc)
andrewstucki added a commit that referenced this pull request Jul 9, 2020
* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit cb4cedc)
andrewstucki added a commit that referenced this pull request Jul 9, 2020
* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit cb4cedc)
v1v added a commit to v1v/beats that referenced this pull request Jul 9, 2020
* upstream/master:
  Add `docker logs` support to the Elastic Log Driver (elastic#19531)
  [Elastic Agent] Fix saving of agent configuration on Windows to have proper ACLs (elastic#19793)
  Send the config revision down to the endpoint application. (elastic#19759)
  [Elastic Agent] Add support for multiple hosts in connection to kibana (elastic#19628)
  Remove the downloadConfig and retryConfig from plugin/process.Application and plugin/service.Application. (elastic#19603)
  Update go version to 1.14.4 (elastic#19753)
  ci: set builds as skipped when they do not match the trigger (elastic#19750)
  [Auditbeat] Fix up socket dataset runaway CPU usage (elastic#19764)
  Convert cloudfoundry input to v2 (elastic#19717)
andrewstucki added a commit that referenced this pull request Jul 9, 2020
…CPU usage (#19783)

* [Auditbeat] Fix up socket dataset runaway CPU usage (#19764)

* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit cb4cedc)

* fix up changelog

* Fix changelog
melchiormoulin pushed a commit to melchiormoulin/beats that referenced this pull request Oct 14, 2020
* Fix up socket dataset
* Add Changelog entry
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…unaway CPU usage (elastic#19783)

* [Auditbeat] Fix up socket dataset runaway CPU usage (elastic#19764)

* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit f1ef970)

* fix up changelog

* Fix changelog
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…lastic#19781)

* Fix up socket dataset
* Add Changelog entry

(cherry picked from commit f1ef970)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auditbeat 7.7.x Poor Performance: 100%+ CPU Usage with System Module Socket Dataset Enabled
3 participants