Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reduce allocs when creating $_SERVER #540

Merged
merged 22 commits into from
Mar 12, 2024
Merged

Conversation

dunglas
Copy link
Owner

@dunglas dunglas commented Feb 1, 2024

The main idea is to allocate only one time (Go-side) the memory needed to populate $_SERVER environment variables.

All the strings.Clone() calls are a workaround for golang/go#65286 (comment), they will not be necessary anymore in Go 1.22.

On my machine, this implementation is around 5% faster than the current one (BenchmarkServerSuperGlobal, introduced in #539).

Please note that when running Go benchmarks, the alloc counter doesn't take into account C allocations.

@dunglas dunglas changed the title refactor: prevent C allocs when populating $_SERVER perf: reduce allocs when creating $_SERVER Feb 1, 2024
@dunglas
Copy link
Owner Author

dunglas commented Feb 1, 2024

I did some more benchmarks using the k6 script we provide, and they are non-conclusive (the gap is small and the winner varies on each run...). @withinboredom, would you mind checking if you notice any improvement or deterioration using your benchmarks?

In the meantime, let's mark this patch as a draft.

@dunglas dunglas marked this pull request as draft February 1, 2024 22:50
@withinboredom
Copy link
Collaborator

I also looked into optimizing this back when I was digging into the memory leak. Looking at flame graphs, most our overhead these days is in go -> c -> go stack switches and optimizing anything else will be negligible. Any way we could reduce that would speed up frankenphp by quite a bit.

I was using tinygo for a bit there, which can compile the entire stack onto the llvm and it sped things up quite a bit. However, we're using some incompatible cgo features now, so it won't compile.

@dunglas
Copy link
Owner Author

dunglas commented Feb 12, 2024

That's weird because according to recent benchmarks, cgo has now a negligible overhead as long as we batch calls (and we do).

@dunglas dunglas force-pushed the refactor/env-var-creation branch 2 times, most recently from f6aea4d to 9b05728 Compare February 12, 2024 10:46
@withinboredom
Copy link
Collaborator

Oh yeah, that's to say the stack switching is still ridiculously fast, not that it's slow. But back then that was literally the slowest part (which is a good thing).

@dunglas dunglas force-pushed the refactor/env-var-creation branch 2 times, most recently from cfc4cb9 to 3dfc3e2 Compare February 12, 2024 21:55
@dunglas
Copy link
Owner Author

dunglas commented Feb 12, 2024

With the latest changes, the gains are more significant: 32% fewer allocations and slightly less memory used in Go (probably much less in C but it's hard to measure).

Before:

goos: darwin
goarch: arm64
pkg: github.com/dunglas/frankenphp
BenchmarkServerSuperGlobal
BenchmarkServerSuperGlobal-10    	    6571	    178113 ns/op	   19977 B/op	      93 allocs/op
PASS
ok  	github.com/dunglas/frankenphp	2.435s

After:

goos: darwin
goarch: arm64
pkg: github.com/dunglas/frankenphp
BenchmarkServerSuperGlobal
BenchmarkServerSuperGlobal-10    	    6859	    173809 ns/op	   19955 B/op	      63 allocs/op
PASS
ok  	github.com/dunglas/frankenphp	2.325s

cgi.go Outdated Show resolved Hide resolved
frankenphp.c Outdated Show resolved Hide resolved
frankenphp.go Show resolved Hide resolved
cgi.go Show resolved Hide resolved
worker.go Show resolved Hide resolved
@dunglas dunglas force-pushed the refactor/env-var-creation branch 2 times, most recently from 45a1d3b to aeac7b1 Compare February 13, 2024 23:32
@dunglas
Copy link
Owner Author

dunglas commented Feb 15, 2024

K6 benchmark on a Macbook Pro (M1 Pro):

Before:

     execution: local
        script: load-test.js
        output: -

     scenarios: (100.00%) 1 scenario, 100 max VUs, 1m0s max duration (incl. graceful stop):
              * default: 100 looping VUs for 30s (gracefulStop: 30s)


     ✓ is status 200
     ✓ is echoed

     checks.........................: 100.00% ✓ 435614      ✗ 0     
     data_received..................: 910 MB  30 MB/s
     data_sent......................: 1.1 GB  36 MB/s
     http_req_blocked...............: avg=2.1µs    min=0s     med=0s      max=4.59ms  p(90)=1µs     p(95)=1µs    
     http_req_connecting............: avg=1.2µs    min=0s     med=0s      max=3.06ms  p(90)=0s      p(95)=0s     
     http_req_duration..............: avg=13.72ms  min=1.25ms med=13.43ms max=67.75ms p(90)=15.46ms p(95)=16.43ms
       { expected_response:true }...: avg=13.72ms  min=1.25ms med=13.43ms max=67.75ms p(90)=15.46ms p(95)=16.43ms
     http_req_failed................: 0.00%   ✓ 0           ✗ 217807
     http_req_receiving.............: avg=881.68µs min=8µs    med=527µs   max=40.65ms p(90)=1.68ms  p(95)=2.38ms 
     http_req_sending...............: avg=14.61µs  min=8µs    med=12µs    max=3.28ms  p(90)=19µs    p(95)=23µs   
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s      max=0s      p(90)=0s      p(95)=0s     
     http_req_waiting...............: avg=12.83ms  min=444µs  med=12.67ms max=63.73ms p(90)=14.21ms p(95)=14.82ms
     http_reqs......................: 217807  7256.993086/s
     iteration_duration.............: avg=13.77ms  min=1.36ms med=13.48ms max=67.8ms  p(90)=15.51ms p(95)=16.48ms
     iterations.....................: 217807  7256.993086/s
     vus............................: 100     min=100       max=100 
     vus_max........................: 100     min=100       max=100 


running (0m30.0s), 000/100 VUs, 217807 complete and 0 interrupted iterations
default ✓ [======================================] 100 VUs  30s

After:

     execution: local
        script: load-test.js
        output: -

     scenarios: (100.00%) 1 scenario, 100 max VUs, 1m0s max duration (incl. graceful stop):
              * default: 100 looping VUs for 30s (gracefulStop: 30s)


     ✓ is status 200
     ✓ is echoed

     checks.........................: 100.00% ✓ 437732      ✗ 0     
     data_received..................: 915 MB  31 MB/s
     data_sent......................: 1.1 GB  36 MB/s
     http_req_blocked...............: avg=2.2µs    min=0s     med=0s      max=4.43ms  p(90)=1µs     p(95)=1µs    
     http_req_connecting............: avg=1.21µs   min=0s     med=0s      max=2.97ms  p(90)=0s      p(95)=0s     
     http_req_duration..............: avg=13.65ms  min=1.33ms med=13.38ms max=60.16ms p(90)=15.36ms p(95)=16.25ms
       { expected_response:true }...: avg=13.65ms  min=1.33ms med=13.38ms max=60.16ms p(90)=15.36ms p(95)=16.25ms
     http_req_failed................: 0.00%   ✓ 0           ✗ 218866
     http_req_receiving.............: avg=878.69µs min=8µs    med=522µs   max=23.27ms p(90)=1.63ms  p(95)=2.32ms 
     http_req_sending...............: avg=14.71µs  min=7µs    med=12µs    max=3.78ms  p(90)=19µs    p(95)=23µs   
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s      max=0s      p(90)=0s      p(95)=0s     
     http_req_waiting...............: avg=12.76ms  min=305µs  med=12.63ms max=57.49ms p(90)=14.17ms p(95)=14.71ms
     http_reqs......................: 218866  7292.430161/s
     iteration_duration.............: avg=13.7ms   min=1.37ms med=13.43ms max=60.21ms p(90)=15.41ms p(95)=16.31ms
     iterations.....................: 218866  7292.430161/s
     vus............................: 100     min=100       max=100 
     vus_max........................: 100     min=100       max=100 


running (0m30.0s), 000/100 VUs, 218866 complete and 0 interrupted iterations
default ✓ [======================================] 100 VUs  30s

(0,5% improvement). Memory usage seems improved too.

It would be nice if someone could try the benchmark on Linux.

I will also try to use some sync.Pool to prevent memory allocations.

@ChrisRiddell
Copy link

(0,5% improvement). Memory usage seems improved too.

It would be nice if someone could try the benchmark on Linux.

Where about is the k6 file located to run the same benchmark but on linux?

@withinboredom
Copy link
Collaborator

In test-data/load-test.js is usually the one I use. I won't be able to test it out until after I get back from vacation at the end of the week.

@dunglas
Copy link
Owner Author

dunglas commented Mar 5, 2024

@maypok86 sorry to bother you again (and I hope that it's not for nothing) but it looks like the latest failure is also related to Otter. It may be this known Go bug: https://pkg.go.dev/sync/atomic#pkg-note-BUG

@maypok86
Copy link

maypok86 commented Mar 5, 2024

@maypok86 sorry to bother you again (and I hope that it's not for nothing) but it looks like the latest failure is also related to Otter. It may be this known Go bug: https://pkg.go.dev/sync/atomic#pkg-note-BUG

@dunglas Damn, yeah, I completely forgot about that when refactoring and there were no tests on 32-bit archs.

Try the go get -u github.com/maypok86/otter@dev version, it seems to pass tests on 32-bit architecture.

@dunglas
Copy link
Owner Author

dunglas commented Mar 5, 2024

@maypok86 thanks for this swift fix!! I bumped Otter, let's see if the tests are green.

@maypok86
Copy link

maypok86 commented Mar 5, 2024

The tests seem to have passed, then I'll create a release with the bug fix now.

@maypok86
Copy link

maypok86 commented Mar 5, 2024

Done.

@dunglas dunglas marked this pull request as ready for review March 11, 2024 15:42
@dunglas dunglas merged commit 07a74e5 into main Mar 12, 2024
41 checks passed
@dunglas dunglas deleted the refactor/env-var-creation branch March 12, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants