Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shipper] Make the memory queue accept opaque pointers #31356

Merged
merged 46 commits into from
Apr 29, 2022

Conversation

faec
Copy link
Contributor

@faec faec commented Apr 19, 2022

What does this PR do?

Refactors the memory queue internal data structures to accept opaque pointers (interface{}) for its events rather than an explicit publisher.Event. This is needed for the queue to store the event representations anticipated in https://github.com/elastic/elastic-agent-shipper.

This doesn't fully resolve #31307 because it doesn't yet expose a type-agnostic public interface. This PR is already pretty big and I don't want it to eat into ON week, so I'm deferring those questions until I can give them full attention.

This change should, in a perfect world, be a functional no-op: it changes internal handling but the exposed API is unchanged. The main changes are:

  • Merging of events and clients into queueEntry. The memory queue previously stored events as publisher.Event, and their metadata in clientState. These were stored in separate arrays with shared indices, propagated in various ways. The new code creates the type queueEntry as its underlying buffer type, which contains the event (an interface{} which in beats has underlying type *publisher.Event) and its metadata. This change had to be propagated through a number of internal helpers like memqueue.ringBuffer
  • Removal of unused fields and helpers. During the conversion I came across various fields that are initialized / propagated but unused, e.g. the ackState in memqueue.batch. There were also some fields that were duplicates of others -- in eventloop.go the event loops had pointers to their associated broker and their own unaltered copies of several of its fields. I removed these when I could.
  • Simplifying / localizing event loop state. Event loop API endpoints were previously enabled / disabled via state variables in the structs (channels accepting pushRequest, getRequest etc) which were selectively nulled-out on appropriate state changes (e.g. if the queue is full after a pushRequest then the push channel is set to nil to block additional requests). This got quite hard to follow during the changes, since the fields were mutated throughout the code and their semantics were undocumented. I moved the channels into local variables in the run loop, initializing them immediately before their use in select. This keeps the logic in one place, and it's clearer now what specific circumstances can enable / disable each channel.
  • Renaming / documenting many fields and objects. A lot of the structures had api control channels and various auxiliary data but sketchy or absent documentation of their semantics. I tried to give things names more explicit about their function, and to describe when and how they're used.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Performance tests

I ran some extra benchmarks using libbeat/publisher/pipeline/stress. The main configurations I tested were buffered vs direct event loop. The tests were 1-minute samples sending a continuous stream of events through the queue. They were run with:

go test -v -run TestPipeline/gen=wait_ack/pipeline=default_mem/out=default -duration 1m -memprofile memprofile-wait_ack-default.old.pprof

go test -v -run TestPipeline/gen=wait_ack/pipeline=direct_mem/out=default -duration 1m -memprofile memprofile-wait_ack-direct.old.pprof

and similarly after switching to the PR branch. The top-level results were:

Event throughput (per minute)

  • buffered old: 69424423
  • buffered new: 68942490 (-0.7%)
  • direct old: 30746313
  • direct new: 32339700 (+5.2%)

Total allocations:

  • buffered old: 33652 MB
  • buffered new: 40042 MB
  • direct old: 13266 MB
  • direct new: 18815 MB

In use allocations:

  • buffered old: 5125 kB, 10165 objects
  • buffered new: 8231 kB, 29488 objects
  • direct old: 3593 kB, 5500 objects
  • direct new: 4101 kB, 37238 objects

As expected for the nature of the change, the total allocations are noticeably higher, since a lot of the complexity of the publisher.Event handling was to avoid allocating temporary values. However, the throughput is fine. While in-use memory is up it is still reasonable (8MB to send 69 million events).

I also tested these configurations using the blocking output test (out=blocking in the test name) which adds a min_wait to the configuration. Remarkably, both old and new queues had exactly the same throughputs (tho on the order of 30K rather than 70M). The total allocations were up in the new version but in-use was slightly down.

Overall these results look to me like we are paying slightly for this simplification, but nothing that seems worrying. I'm expecting to do more pipeline performance work soon and this cleanup gives a good baseline for tracking down our real bottlenecks.

Related issues

@faec faec added enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Apr 19, 2022
@faec faec requested a review from a team as a code owner April 19, 2022 17:34
@faec faec self-assigned this Apr 19, 2022
@faec faec requested review from cmacknz and kvch and removed request for a team April 19, 2022 17:34
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

// there might also be free space before region A. In that
// case new events must be inserted in region B, but the
// queue isn't at capacity.
avail = len(b.entries) - b.regA.index - b.regA.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the follow up from your comment about this possibly not being right? Is it too much work to fix, or not worth fixing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the intention correctly then it's an easy fix, just removing b.regA.index from the right side. I've been leaving it for last cause I don't want to intentionally change the functional logic until everything is at full parity with the old version (which right now is just pending on the stress tests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah -- this turned out to be the cause of the test failure 😅 This is the same computation as the old version, but before it was only made on a specific state transition, and now that its checked every loop iteration it ended up blocking the queue. Switching to the correct calculation here makes the tests pass locally, so fingers crossed on the CI now.


for {
var pushChan chan pushRequest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this duplicated entirely between here and newDirectEventLoop? It is hard for me to spot if there is some subtle difference between the two just scrolling up and down.

Copy link
Contributor Author

@faec faec Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite duplicated -- it's the same logical sequence, but because the containing struct is different the conditions don't match (e.g.: here we check if the queue is full by comparing eventCount to maxEvents, but in the version above, directEventLoop has no field analogous to eventCount so it uses a different test).

I suspect that having these two almost-identical objects with such completely divergent implementations just for a special case doesn't help performance enough to justify the complexity, and if I get a chance I'd like to merge these into a single helper, but that seemed out of scope for now :-)

@@ -99,80 +83,73 @@ func (l *directEventLoop) run() {
)

for {
var pushChan chan pushRequest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is significantly more obvious than what was going on before. Nice!

@cmacknz
Copy link
Member

cmacknz commented Apr 28, 2022

Just looking at the diff I can't spot any major issues. I'll try to check this out and build more of an understanding of what this is doing later (after your refactoring, which is much easier to follow).

Also I had never seen the libbeat-stress-tests pipeline stage triggered before. I'll have to look at what it does.

@faec
Copy link
Contributor Author

faec commented Apr 28, 2022

Also I had never seen the libbeat-stress-tests pipeline stage triggered before. I'll have to look at what it does.

Yea, I've never triggered that one before either, but it's related to sending loads through the pipeline so it's almost certainly a real failure. Right now I'm debugging it expecting that I missed a race condition somewhere.

@cmacknz
Copy link
Member

cmacknz commented Apr 28, 2022

LGTM, give the rest of the time some time to look at it before merging though

Copy link
Contributor

@kvch kvch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, awesome as always!

@faec faec merged commit d6aeef5 into elastic:main Apr 29, 2022
@faec faec deleted the memqueue-cleanup-2 branch April 29, 2022 21:48
kush-elastic pushed a commit to kush-elastic/beats that referenced this pull request May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the memory queue work with types other than publisher.Event
4 participants