Improve StreamBuf append #35928

jeniawhite · 2023-06-27T15:24:04Z

What does this PR do?

Looking at the code, I saw that the streambuf has a method called doAppend.
This method gets bytes and appends them to the bytes that the streambuf instance holds.
I've noticed that the appending on the bytes was done using the append built-in function.
In order to gain performance I wanted to replace the whole bytes slice that is being held by streambuf with a bytes.Buffer implementation, but since we manipulate the cursor I've decided to skip it.
Changed only the append logic to use the bytes.Buffer.

Running the pre-existing benchmarks I was able to see improvement in performance for long-lines (from my understanding this is a common use-case for client of filebeat).

This is the old result:

Evgbs-MacBook-Pro:readfile evgb$ go test -benchmem -bench=.
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
BenchmarkEncoderReader/long_lines-10         	    1230	    878272 ns/op	   1000100 processed_bytes	       100.0 processed_lines	 7849648 B/op	    1289 allocs/op
PASS
ok  	github.com/elastic/beats/v7/libbeat/reader/readfile	1.876s

This is the new result with the changes:

Evgbs-MacBook-Pro:readfile evgb$ go test -benchmem -bench=.
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
BenchmarkEncoderReader/long_lines-10         	    1675	    680225 ns/op	   1000100 processed_bytes	       100.0 processed_lines	 5095600 B/op	     770 allocs/op
PASS
ok  	github.com/elastic/beats/v7/libbeat/reader/readfile	1.865s

Looking at the old profiling:

Looking at the new profiling:

Why is it important?

This is a component in the heart of libbeat that manages buffers and is being utilized by multiple products.
Usually, this component is utilized in the hottest paths (every line in filebeat is processed by this code using the linereader).

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

elasticmachine · 2023-06-27T16:46:31Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-07-12T20:11:20.044+0000
Duration: 69 min 24 sec

Test stats 🧪

Test	Results
Failed	0
Passed	27511
Skipped	2029
Total	29540

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-07-10T08:08:00Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

fearful-symmetry · 2023-07-10T20:11:54Z

Interestingly, the benchmark results across all the tests seem to vary.

Main:

goos: linux
goarch: amd64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
cpu: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
BenchmarkEncoderReader/buffer-sized_lines-12                6590            187860 ns/op            102100 processed_bytes             100.0 processed_lines      232970 B/op        317 allocs/op
BenchmarkEncoderReader/short_lines-12                      23444             57771 ns/op              1100 processed_bytes             100.0 processed_lines        9496 B/op        219 allocs/op
BenchmarkEncoderReader/long_lines-12                         217           5524922 ns/op           1000100 processed_bytes             100.0 processed_lines     7849637 B/op       1289 allocs/op
BenchmarkEncoderReader/skip_lines-12                        3865            450602 ns/op                 0 processed_bytes               0 processed_lines        220280 B/op        417 allocs/op
PASS
ok      github.com/elastic/beats/v7/libbeat/reader/readfile     6.898s

This PR:

goos: linux
goarch: amd64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
cpu: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
BenchmarkEncoderReader/buffer-sized_lines-12                5491            254946 ns/op            102100 processed_bytes             100.0 processed_lines      232969 B/op        317 allocs/op
BenchmarkEncoderReader/short_lines-12                      13902             76058 ns/op              1100 processed_bytes             100.0 processed_lines        9496 B/op        219 allocs/op
BenchmarkEncoderReader/long_lines-12                         366           3990415 ns/op           1000100 processed_bytes             100.0 processed_lines     5095585 B/op        770 allocs/op
BenchmarkEncoderReader/skip_lines-12                        3891            309291 ns/op                 0 processed_bytes               0 processed_lines        219257 B/op        417 allocs/op
PASS
ok      github.com/elastic/beats/v7/libbeat/reader/readfile     9.397s

Tested on another system, got the same results, with short_lines and buffer-sized_lines producing slightly slower results. pprof seems to show us spending slightly more time in bytes.growSlice as a percentage.

Also did some basic instrumenting, and the long_lines test results in significantly more calls to bytes.NewBuffer() with much larger initial buffer sizes, which is interesting. I guess this is a case of the bytes buffer only showing a performance advantage for larger buffer sizes?

I suppose this comes down to what use cases we hit most often in filebeat. Going to ask around.

andrewkroh · 2023-07-10T21:02:56Z

To help compare:

$ go run golang.org/x/perf/cmd/benchstat@latest main-5.txt 35928-5.txt 
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
                                    │  main-5.txt  │             35928-5.txt             │
                                    │    sec/op    │    sec/op     vs base               │
EncoderReader/buffer-sized_lines-10   51.20µ ± ∞ ¹   53.35µ ± ∞ ¹   +4.20% (p=0.008 n=5)
EncoderReader/short_lines-10          13.99µ ± ∞ ¹   13.82µ ± ∞ ¹   -1.22% (p=0.008 n=5)
EncoderReader/long_lines-10           894.1µ ± ∞ ¹   683.3µ ± ∞ ¹  -23.58% (p=0.008 n=5)
EncoderReader/skip_lines-10           79.05µ ± ∞ ¹   82.70µ ± ∞ ¹   +4.63% (p=0.008 n=5)
geomean                               84.34µ         80.33µ         -4.75%
¹ need >= 6 samples for confidence interval at level 0.95

                                    │  main-5.txt   │              35928-5.txt               │
                                    │     B/op      │     B/op       vs base                 │
EncoderReader/buffer-sized_lines-10   227.5Ki ± ∞ ¹   227.5Ki ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/short_lines-10          9.273Ki ± ∞ ¹   9.273Ki ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/long_lines-10           7.486Mi ± ∞ ¹   4.860Mi ± ∞ ¹  -35.09% (p=0.008 n=5)
EncoderReader/skip_lines-10           215.1Ki ± ∞ ¹   214.1Ki ± ∞ ¹   -0.46% (p=0.008 n=5)
geomean                               242.9Ki         217.7Ki        -10.34%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                    │  main-5.txt  │             35928-5.txt              │
                                    │  allocs/op   │  allocs/op   vs base                 │
EncoderReader/buffer-sized_lines-10    317.0 ± ∞ ¹   317.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/short_lines-10           219.0 ± ∞ ¹   219.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/long_lines-10           1289.0 ± ∞ ¹   770.0 ± ∞ ¹  -40.26% (p=0.008 n=5)
EncoderReader/skip_lines-10            417.0 ± ∞ ¹   417.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
geomean                                439.5         386.4        -12.09%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

fearful-symmetry · 2023-07-10T22:08:48Z

@andrewkroh 's comparisions kinda confirm what I was seeing just squinting at the different benchmarks, and that the delta in long-lines performance is pretty large compared to the sporadic slower results.

cmacknz · 2023-07-11T01:36:52Z

This is my attempt to explain what is happening and why bytes.Buffer is faster.

that the delta in long-lines performance is pretty large compared to the sporadic slower results.

I think the reason for this is that the slice here was starting with 0 allocated bytes, and append starts from this size and can allocate each time it is called. append uses the basic algorithm of doubling the allocation each time there isn't enough space. So if you start at 0 bytes, you allocate the slice again at 1, 2, 4, 8, 16, etc bytes. For large slices this is a lot of pointless allocations if you know how large you need to be in the end.

Here is the growSlice implementation used by append in Go 1.19.10 for reference, the key part is the doublecap = newcap + newcap. There is some more complexity to this (for large enough buffers the growth rate slows to 1.25 for example) but this is the basic idea.

https://github.com/golang/go/blob/7fe60b5df764f5a16a2c40e4412b5ed60f709192/src/runtime/slice.go#L200-L223

        newcap := old.cap
	doublecap := newcap + newcap
	if cap > doublecap {
		newcap = cap
	} else {

To contrast here is the implementation of the bytes.Buffer implementation:

NewBuffer does essentially nothing: https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L472

func NewBuffer(buf []byte) *Buffer { return &Buffer{buf: buf} }

The Write implementation is where the allocation happens, and is essentially just a more optimized allocation paired with a call to copy: https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L175-L182

func (b *Buffer) Write(p []byte) (n int, err error) {
	b.lastRead = opInvalid
	m, ok := b.tryGrowByReslice(len(p))
	if !ok {
		m = b.grow(len(p))
	}
	return copy(b.buf[m:], p), nil
}

The grow implementation in https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L126 has a few optimizations.

For small buffersbytes.Buffer improves slightly because of this part of grow:

https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L136-L138

// smallBufferSize is an initial allocation minimal capacity.
const smallBufferSize = 64

//...
	if b.buf == nil && n <= smallBufferSize {
		b.buf = make([]byte, n, smallBufferSize)
		return 0
	}

This trades space for speed by over-allocating assuming we'll either grow or rewrite the buffer enough times that it will be worth it. This skips all of the allocations from 2-64 by just doing a single allocation.

We eventually end up in growSlice which is doing a trick with an append to a nil slice. This is actually very similar to just calling append, which I assume is why it shows up as the hot path in the new profiles as well. I have no idea what the detailed performance comparison of this method vs pure append would be.

https://github.com/golang/go/blob/6a063b01b0b0effa787c5aed90b585b409fe4688/src/bytes/buffer.go#L229-L252

	// TODO(http://golang.org/issue/51462): We should rely on the append-make
	// pattern so that the compiler can call runtime.growslice. For example:
	//	return append(b, make([]byte, n)...)
	// This avoids unnecessary zero-ing of the first len(b) bytes of the
	// allocated slice, but this pattern causes b to escape onto the heap.
	//
	// Instead use the append-make pattern with a nil slice to ensure that
	// we allocate buffers rounded up to the closest size class.
	c := len(b) + n // ensure enough space for n elements
	if c < 2*cap(b) {
		// The growth rate has historically always been 2x. In the future,
		// we could rely purely on append to determine the growth rate.
		c = 2 * cap(b)
	}
	b2 := append([]byte(nil), make([]byte, c)...)
	copy(b2, b)
	return b2[:len(b)]

The net result of all this is that bytes.NewBuffer goes to much greater lengths to avoid allocating if it doesn't have to.

I almost wonder if the initial 64 byte allocation is doing most of the work for us here, but there are a lot of places in bytes.Buffer that try really hard not to allocate if it doesn't have to and then when it does allocate it tries to allocate in an optimal way.

andrewkroh · 2023-07-11T04:28:00Z

Great analysis, @cmacknz.

I appeared to me like the LineReader implementation could be implemented with bytes.Buffer alone instead of StreamBuf. Our of curiosity I took a quick pass at swapping it over. Here's the patch: beats-readfile-stdlib.patch

There appears to be a lot opportunity for optimization in the LineReader if someone wanted to dive in.

% benchstat main-5.txt in-out-bytes-buffer.txt                                      
goos: darwin
goarch: arm64
pkg: github.com/elastic/beats/v7/libbeat/reader/readfile
                                    │  main-5.txt  │       in-out-bytes-buffer.txt       │
                                    │    sec/op    │    sec/op     vs base               │
EncoderReader/buffer-sized_lines-10   51.20µ ± ∞ ¹   36.66µ ± ∞ ¹  -28.40% (p=0.008 n=5)
EncoderReader/short_lines-10          13.99µ ± ∞ ¹   13.34µ ± ∞ ¹   -4.63% (p=0.008 n=5)
EncoderReader/long_lines-10           894.1µ ± ∞ ¹   321.0µ ± ∞ ¹  -64.10% (p=0.008 n=5)
EncoderReader/skip_lines-10           79.05µ ± ∞ ¹   43.12µ ± ∞ ¹  -45.45% (p=0.008 n=5)
geomean                               84.34µ         51.00µ        -39.53%
¹ need >= 6 samples for confidence interval at level 0.95

                                    │   main-5.txt    │       in-out-bytes-buffer.txt        │
                                    │      B/op       │     B/op       vs base               │
EncoderReader/buffer-sized_lines-10     227.5Ki ± ∞ ¹   115.9Ki ± ∞ ¹  -49.04% (p=0.008 n=5)
EncoderReader/short_lines-10            9.273Ki ± ∞ ¹   9.195Ki ± ∞ ¹   -0.84% (p=0.008 n=5)
EncoderReader/long_lines-10             7.486Mi ± ∞ ¹   1.044Mi ± ∞ ¹  -86.06% (p=0.008 n=5)
EncoderReader/skip_lines-10           215.118Ki ± ∞ ¹   6.453Ki ± ∞ ¹  -97.00% (p=0.008 n=5)
geomean                                 242.9Ki         52.07Ki        -78.56%
¹ need >= 6 samples for confidence interval at level 0.95

                                    │  main-5.txt  │       in-out-bytes-buffer.txt        │
                                    │  allocs/op   │  allocs/op   vs base                 │
EncoderReader/buffer-sized_lines-10    317.0 ± ∞ ¹   222.0 ± ∞ ¹  -29.97% (p=0.008 n=5)
EncoderReader/short_lines-10           219.0 ± ∞ ¹   219.0 ± ∞ ¹        ~ (p=1.000 n=5) ²
EncoderReader/long_lines-10           1289.0 ± ∞ ¹   227.0 ± ∞ ¹  -82.39% (p=0.008 n=5)
EncoderReader/skip_lines-10            417.0 ± ∞ ¹   219.0 ± ∞ ¹  -47.48% (p=0.008 n=5)
geomean                                439.5         221.7        -49.55%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

Relates elastic#35928 (comment)

fearful-symmetry · 2023-07-11T16:45:30Z

Yah, never touched this chunk of code, but I'm not surprised there's room for optimization here. Unless we want to prioritize this right now, I feel like merging this so we at least get some performance delta is the way to go?

jlind23 · 2023-07-12T05:06:12Z

Yah, never touched this chunk of code, but I'm not surprised there's room for optimization here. Unless we want to prioritize this right now, I feel like merging this so we at least get some performance delta is the way to go?

I agree with @fearful-symmetry, let's merge it and follow up in another PR afterwards. @fearful-symmetry could you please review/approve this PR then?

cmacknz

Yes, this appears to be safe and the microbenchmarks show a very obvious improvement. Let's get it merged.

I am still curious about exactly which piece of the bytes.Buffer implementation is helping us the most here, but I can experiment with that later. From my previous comment there are several paths that could be helping, so this isn't a fluke improvement there is concrete evidence in the code itself that this approach is better (even if it isn't obvious without looking much deeper).

cmacknz · 2023-07-12T13:54:25Z

@jeniawhite please as a changelog entry to https://github.com/elastic/beats/blob/main/CHANGELOG.next.asciidoc before merging this.

mergify · 2023-07-12T19:57:51Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b evgb-StreamBufAllocs upstream/evgb-StreamBufAllocs
git merge upstream/main
git push upstream evgb-StreamBufAllocs

CHANGELOG.next.asciidoc

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

fearful-symmetry · 2023-07-12T21:20:49Z

Do we want to deal with the linter errors while we're here?

* Improve streambuf append * Adding changelog comment * Update CHANGELOG.next.asciidoc Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> --------- Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

Improve streambuf append

41e8854

jeniawhite added enhancement backport-skip Skip notification from the automated backport with mergify labels Jun 27, 2023

jeniawhite requested a review from a team as a code owner June 27, 2023 15:24

jeniawhite requested review from ycombinator and fearful-symmetry June 27, 2023 15:24

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 27, 2023

mergify bot assigned jeniawhite Jun 27, 2023

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jul 10, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 10, 2023

andrewkroh added a commit to andrewkroh/beats that referenced this pull request Jul 11, 2023

libbeat/reader/readfile/line.go - Replace StreamBuf with bytes.Buffer

95f61e0

Relates elastic#35928 (comment)

cmacknz approved these changes Jul 12, 2023

View reviewed changes

Adding changelog comment

5552e9b

Merge branch 'main' into evgb-StreamBufAllocs

2cd8ca0

cmacknz reviewed Jul 12, 2023

View reviewed changes

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved

Update CHANGELOG.next.asciidoc

dced40e

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

cmacknz enabled auto-merge (squash) July 12, 2023 20:14

cmacknz merged commit 91bbccf into elastic:main Jul 12, 2023
84 of 86 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve StreamBuf append #35928

Improve StreamBuf append #35928

jeniawhite commented Jun 27, 2023 •

edited

Loading

elasticmachine commented Jun 27, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

elasticmachine commented Jul 10, 2023

fearful-symmetry commented Jul 10, 2023

andrewkroh commented Jul 10, 2023

fearful-symmetry commented Jul 10, 2023

cmacknz commented Jul 11, 2023 •

edited

Loading

andrewkroh commented Jul 11, 2023

fearful-symmetry commented Jul 11, 2023

jlind23 commented Jul 12, 2023

cmacknz left a comment

cmacknz commented Jul 12, 2023

mergify bot commented Jul 12, 2023

fearful-symmetry commented Jul 12, 2023

Improve StreamBuf append #35928

Improve StreamBuf append #35928

Conversation

jeniawhite commented Jun 27, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

elasticmachine commented Jun 27, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Jul 10, 2023

fearful-symmetry commented Jul 10, 2023

andrewkroh commented Jul 10, 2023

fearful-symmetry commented Jul 10, 2023

cmacknz commented Jul 11, 2023 • edited Loading

andrewkroh commented Jul 11, 2023

fearful-symmetry commented Jul 11, 2023

jlind23 commented Jul 12, 2023

cmacknz left a comment

Choose a reason for hiding this comment

cmacknz commented Jul 12, 2023

mergify bot commented Jul 12, 2023

fearful-symmetry commented Jul 12, 2023

jeniawhite commented Jun 27, 2023 •

edited

Loading

elasticmachine commented Jun 27, 2023 •

edited by jenkins-beats-ci bot

Loading

cmacknz commented Jul 11, 2023 •

edited

Loading