Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve StreamBuf append #35928

Merged
merged 4 commits into from
Jul 12, 2023
Merged

Conversation

jeniawhite
Copy link
Contributor

@jeniawhite jeniawhite commented Jun 27, 2023

What does this PR do?

Looking at the code, I saw that the streambuf has a method called doAppend.
This method gets bytes and appends them to the bytes that the streambuf instance holds.
I've noticed that the appending on the bytes was done using the append built-in function.
In order to gain performance I wanted to replace the whole bytes slice that is being held by streambuf with a bytes.Buffer implementation, but since we manipulate the cursor I've decided to skip it.
Changed only the append logic to use the bytes.Buffer.

Running the pre-existing benchmarks I was able to see improvement in performance for long-lines (from my understanding this is a common use-case for client of filebeat).

This is the old result:

Evgbs-MacBook-Pro:readfile evgb$ go test -benchmem -bench=.
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
BenchmarkEncoderReader/long_lines-10         	    1230	    878272 ns/op	   1000100 processed_bytes	       100.0 processed_lines	 7849648 B/op	    1289 allocs/op
PASS
ok  	github.com/elastic/beats/v7/libbeat/reader/readfile	1.876s

This is the new result with the changes:

Evgbs-MacBook-Pro:readfile evgb$ go test -benchmem -bench=.
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
BenchmarkEncoderReader/long_lines-10         	    1675	    680225 ns/op	   1000100 processed_bytes	       100.0 processed_lines	 5095600 B/op	     770 allocs/op
PASS
ok  	github.com/elastic/beats/v7/libbeat/reader/readfile	1.865s

Looking at the old profiling:
Screen Shot 2023-06-27 at 17 22 49

Looking at the new profiling:
Screen Shot 2023-06-27 at 17 23 10

Why is it important?

This is a component in the heart of libbeat that manages buffers and is being utilized by multiple products.
Usually, this component is utilized in the hottest paths (every line in filebeat is processed by this code using the linereader).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@jeniawhite jeniawhite added enhancement backport-skip Skip notification from the automated backport with mergify labels Jun 27, 2023
@jeniawhite jeniawhite requested a review from a team as a code owner June 27, 2023 15:24
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 27, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 27, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-07-12T20:11:20.044+0000

  • Duration: 69 min 24 sec

Test stats 🧪

Test Results
Failed 0
Passed 27511
Skipped 2029
Total 29540

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jul 10, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 10, 2023
@fearful-symmetry
Copy link
Contributor

Interestingly, the benchmark results across all the tests seem to vary.

Main:

goos: linux
goarch: amd64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
cpu: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
BenchmarkEncoderReader/buffer-sized_lines-12                6590            187860 ns/op            102100 processed_bytes             100.0 processed_lines      232970 B/op        317 allocs/op
BenchmarkEncoderReader/short_lines-12                      23444             57771 ns/op              1100 processed_bytes             100.0 processed_lines        9496 B/op        219 allocs/op
BenchmarkEncoderReader/long_lines-12                         217           5524922 ns/op           1000100 processed_bytes             100.0 processed_lines     7849637 B/op       1289 allocs/op
BenchmarkEncoderReader/skip_lines-12                        3865            450602 ns/op                 0 processed_bytes               0 processed_lines        220280 B/op        417 allocs/op
PASS
ok      github.com/elastic/beats/v7/libbeat/reader/readfile     6.898s

This PR:

goos: linux
goarch: amd64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
cpu: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
BenchmarkEncoderReader/buffer-sized_lines-12                5491            254946 ns/op            102100 processed_bytes             100.0 processed_lines      232969 B/op        317 allocs/op
BenchmarkEncoderReader/short_lines-12                      13902             76058 ns/op              1100 processed_bytes             100.0 processed_lines        9496 B/op        219 allocs/op
BenchmarkEncoderReader/long_lines-12                         366           3990415 ns/op           1000100 processed_bytes             100.0 processed_lines     5095585 B/op        770 allocs/op
BenchmarkEncoderReader/skip_lines-12                        3891            309291 ns/op                 0 processed_bytes               0 processed_lines        219257 B/op        417 allocs/op
PASS
ok      github.com/elastic/beats/v7/libbeat/reader/readfile     9.397s

Tested on another system, got the same results, with short_lines and buffer-sized_lines producing slightly slower results. pprof seems to show us spending slightly more time in bytes.growSlice as a percentage.

Also did some basic instrumenting, and the long_lines test results in significantly more calls to bytes.NewBuffer() with much larger initial buffer sizes, which is interesting. I guess this is a case of the bytes buffer only showing a performance advantage for larger buffer sizes?

I suppose this comes down to what use cases we hit most often in filebeat. Going to ask around.

@andrewkroh
Copy link
Member

To help compare:

$ go run golang.org/x/perf/cmd/benchstat@latest main-5.txt 35928-5.txt 
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
                                    │  main-5.txt  │             35928-5.txt             │
                                    │    sec/op    │    sec/op     vs base               │
EncoderReader/buffer-sized_lines-10   51.20µ ± ∞ ¹   53.35µ ± ∞ ¹   +4.20% (p=0.008 n=5)
EncoderReader/short_lines-10          13.99µ ± ∞ ¹   13.82µ ± ∞ ¹   -1.22% (p=0.008 n=5)
EncoderReader/long_lines-10           894.1µ ± ∞ ¹   683.3µ ± ∞ ¹  -23.58% (p=0.008 n=5)
EncoderReader/skip_lines-10           79.05µ ± ∞ ¹   82.70µ ± ∞ ¹   +4.63% (p=0.008 n=5)
geomean                               84.34µ         80.33µ         -4.75%
¹ need >= 6 samples for confidence interval at level 0.95

                                    │  main-5.txt   │              35928-5.txt               │
                                    │     B/op      │     B/op       vs base                 │
EncoderReader/buffer-sized_lines-10   227.5Ki ± ∞ ¹   227.5Ki ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/short_lines-10          9.273Ki ± ∞ ¹   9.273Ki ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/long_lines-10           7.486Mi ± ∞ ¹   4.860Mi ± ∞ ¹  -35.09% (p=0.008 n=5)
EncoderReader/skip_lines-10           215.1Ki ± ∞ ¹   214.1Ki ± ∞ ¹   -0.46% (p=0.008 n=5)
geomean                               242.9Ki         217.7Ki        -10.34%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                    │  main-5.txt  │             35928-5.txt              │
                                    │  allocs/op   │  allocs/op   vs base                 │
EncoderReader/buffer-sized_lines-10    317.0 ± ∞ ¹   317.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/short_lines-10           219.0 ± ∞ ¹   219.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/long_lines-10           1289.0 ± ∞ ¹   770.0 ± ∞ ¹  -40.26% (p=0.008 n=5)
EncoderReader/skip_lines-10            417.0 ± ∞ ¹   417.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
geomean                                439.5         386.4        -12.09%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

@fearful-symmetry
Copy link
Contributor

@andrewkroh 's comparisions kinda confirm what I was seeing just squinting at the different benchmarks, and that the delta in long-lines performance is pretty large compared to the sporadic slower results.

@cmacknz
Copy link
Member

cmacknz commented Jul 11, 2023

This is my attempt to explain what is happening and why bytes.Buffer is faster.

that the delta in long-lines performance is pretty large compared to the sporadic slower results.

I think the reason for this is that the slice here was starting with 0 allocated bytes, and append starts from this size and can allocate each time it is called. append uses the basic algorithm of doubling the allocation each time there isn't enough space. So if you start at 0 bytes, you allocate the slice again at 1, 2, 4, 8, 16, etc bytes. For large slices this is a lot of pointless allocations if you know how large you need to be in the end.

Here is the growSlice implementation used by append in Go 1.19.10 for reference, the key part is the doublecap = newcap + newcap. There is some more complexity to this (for large enough buffers the growth rate slows to 1.25 for example) but this is the basic idea.

https://github.com/golang/go/blob/7fe60b5df764f5a16a2c40e4412b5ed60f709192/src/runtime/slice.go#L200-L223

        newcap := old.cap
	doublecap := newcap + newcap
	if cap > doublecap {
		newcap = cap
	} else {

To contrast here is the implementation of the bytes.Buffer implementation:

NewBuffer does essentially nothing: https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L472

func NewBuffer(buf []byte) *Buffer { return &Buffer{buf: buf} }

The Write implementation is where the allocation happens, and is essentially just a more optimized allocation paired with a call to copy: https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L175-L182

func (b *Buffer) Write(p []byte) (n int, err error) {
	b.lastRead = opInvalid
	m, ok := b.tryGrowByReslice(len(p))
	if !ok {
		m = b.grow(len(p))
	}
	return copy(b.buf[m:], p), nil
}

The grow implementation in https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L126 has a few optimizations.

For small buffersbytes.Buffer improves slightly because of this part of grow:

https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L136-L138

// smallBufferSize is an initial allocation minimal capacity.
const smallBufferSize = 64

//...
	if b.buf == nil && n <= smallBufferSize {
		b.buf = make([]byte, n, smallBufferSize)
		return 0
	}

This trades space for speed by over-allocating assuming we'll either grow or rewrite the buffer enough times that it will be worth it. This skips all of the allocations from 2-64 by just doing a single allocation.

We eventually end up in growSlice which is doing a trick with an append to a nil slice. This is actually very similar to just calling append, which I assume is why it shows up as the hot path in the new profiles as well. I have no idea what the detailed performance comparison of this method vs pure append would be.

https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L229-L252

	// TODO(http://golang.org/issue/51462): We should rely on the append-make
	// pattern so that the compiler can call runtime.growslice. For example:
	//	return append(b, make([]byte, n)...)
	// This avoids unnecessary zero-ing of the first len(b) bytes of the
	// allocated slice, but this pattern causes b to escape onto the heap.
	//
	// Instead use the append-make pattern with a nil slice to ensure that
	// we allocate buffers rounded up to the closest size class.
	c := len(b) + n // ensure enough space for n elements
	if c < 2*cap(b) {
		// The growth rate has historically always been 2x. In the future,
		// we could rely purely on append to determine the growth rate.
		c = 2 * cap(b)
	}
	b2 := append([]byte(nil), make([]byte, c)...)
	copy(b2, b)
	return b2[:len(b)]

The net result of all this is that bytes.NewBuffer goes to much greater lengths to avoid allocating if it doesn't have to.

I almost wonder if the initial 64 byte allocation is doing most of the work for us here, but there are a lot of places in bytes.Buffer that try really hard not to allocate if it doesn't have to and then when it does allocate it tries to allocate in an optimal way.

@andrewkroh
Copy link
Member

Great analysis, @cmacknz.

I appeared to me like the LineReader implementation could be implemented with bytes.Buffer alone instead of StreamBuf. Our of curiosity I took a quick pass at swapping it over. Here's the patch: beats-readfile-stdlib.patch

There appears to be a lot opportunity for optimization in the LineReader if someone wanted to dive in.

% benchstat main-5.txt in-out-bytes-buffer.txt                                      
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
                                    │  main-5.txt  │       in-out-bytes-buffer.txt       │
                                    │    sec/op    │    sec/op     vs base               │
EncoderReader/buffer-sized_lines-10   51.20µ ± ∞ ¹   36.66µ ± ∞ ¹  -28.40% (p=0.008 n=5)
EncoderReader/short_lines-10          13.99µ ± ∞ ¹   13.34µ ± ∞ ¹   -4.63% (p=0.008 n=5)
EncoderReader/long_lines-10           894.1µ ± ∞ ¹   321.0µ ± ∞ ¹  -64.10% (p=0.008 n=5)
EncoderReader/skip_lines-10           79.05µ ± ∞ ¹   43.12µ ± ∞ ¹  -45.45% (p=0.008 n=5)
geomean                               84.34µ         51.00µ        -39.53%
¹ need >= 6 samples for confidence interval at level 0.95

                                    │   main-5.txt    │       in-out-bytes-buffer.txt        │
                                    │      B/op       │     B/op       vs base               │
EncoderReader/buffer-sized_lines-10     227.5Ki ± ∞ ¹   115.9Ki ± ∞ ¹  -49.04% (p=0.008 n=5)
EncoderReader/short_lines-10            9.273Ki ± ∞ ¹   9.195Ki ± ∞ ¹   -0.84% (p=0.008 n=5)
EncoderReader/long_lines-10             7.486Mi ± ∞ ¹   1.044Mi ± ∞ ¹  -86.06% (p=0.008 n=5)
EncoderReader/skip_lines-10           215.118Ki ± ∞ ¹   6.453Ki ± ∞ ¹  -97.00% (p=0.008 n=5)
geomean                                 242.9Ki         52.07Ki        -78.56%
¹ need >= 6 samples for confidence interval at level 0.95

                                    │  main-5.txt  │       in-out-bytes-buffer.txt        │
                                    │  allocs/op   │  allocs/op   vs base                 │
EncoderReader/buffer-sized_lines-10    317.0 ± ∞ ¹   222.0 ± ∞ ¹  -29.97% (p=0.008 n=5)
EncoderReader/short_lines-10           219.0 ± ∞ ¹   219.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/long_lines-10           1289.0 ± ∞ ¹   227.0 ± ∞ ¹  -82.39% (p=0.008 n=5)
EncoderReader/skip_lines-10            417.0 ± ∞ ¹   219.0 ± ∞ ¹  -47.48% (p=0.008 n=5)
geomean                                439.5         221.7        -49.55%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

andrewkroh added a commit to andrewkroh/beats that referenced this pull request Jul 11, 2023
@fearful-symmetry
Copy link
Contributor

Yah, never touched this chunk of code, but I'm not surprised there's room for optimization here. Unless we want to prioritize this right now, I feel like merging this so we at least get some performance delta is the way to go?

@jlind23
Copy link
Collaborator

jlind23 commented Jul 12, 2023

Yah, never touched this chunk of code, but I'm not surprised there's room for optimization here. Unless we want to prioritize this right now, I feel like merging this so we at least get some performance delta is the way to go?

I agree with @fearful-symmetry, let's merge it and follow up in another PR afterwards. @fearful-symmetry could you please review/approve this PR then?

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this appears to be safe and the microbenchmarks show a very obvious improvement. Let's get it merged.

I am still curious about exactly which piece of the bytes.Buffer implementation is helping us the most here, but I can experiment with that later. From my previous comment there are several paths that could be helping, so this isn't a fluke improvement there is concrete evidence in the code itself that this approach is better (even if it isn't obvious without looking much deeper).

@cmacknz
Copy link
Member

cmacknz commented Jul 12, 2023

@jeniawhite please as a changelog entry to https://github.com/elastic/beats/blob/main/CHANGELOG.next.asciidoc before merging this.

@mergify
Copy link
Contributor

mergify bot commented Jul 12, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b evgb-StreamBufAllocs upstream/evgb-StreamBufAllocs
git merge upstream/main
git push upstream evgb-StreamBufAllocs

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
@cmacknz cmacknz enabled auto-merge (squash) July 12, 2023 20:14
@fearful-symmetry
Copy link
Contributor

Do we want to deal with the linter errors while we're here?

@cmacknz cmacknz merged commit 91bbccf into elastic:main Jul 12, 2023
84 of 86 checks passed
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
* Improve streambuf append

* Adding changelog comment

* Update CHANGELOG.next.asciidoc

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify enhancement Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants