Customization of Netty "initialBufferSize" #1750 #1881

leogomes · 2017-03-16T23:16:29Z

Motivation:
Under load and with HTTP requests or responses with multiple headers, a
lot of CPU-time can be spent on "AppendableCharSequence.append", which
has to always grow and so has to copy char arrays.

Modifications:
Changed HttpClientOptions and HttpServerOptions to allow the user to
pass an initial buffer size for Netty’s HTTP decoder that fits her
needs when not happy with the default initial buffer size, i.e. 128
bytes. Also, fixed some typos in documentation.

Result:
Initial buffer size for HTTP decoder is customisable.

leogomes · 2017-03-16T23:21:53Z

This PR adds the new option suggested here: #1750

@tsegismont mentioned in the issue that performance tests would be a plus, but I haven't put them in place yet. I was thinking of modifying https://github.com/vert-x3/vertx-perf to add a JMH test there to validate performance gains brought by this PR. Is it a good strategy?

Motivation: Under load and with HTTP requests or responses with multiple headers, a lot of CPU-time can be spent on "AppendableCharSequence.append", which has to always grow and so has to copy char arrays. Modifications: Changed HttpClientOptions and HttpServerOptions to allow the user to pass an initial buffer size for Netty’s HTTP decoder that fits her needs when not happy with the default initial buffer size, i.e. 128 bytes. Also, fixed some typos in documentation. Result: Initial buffer size for HTTP decoder is customisable. Signed-off-by: Leonardo FREITAS GOMES <leonardo.f.gomes@gmail.com>

tsegismont · 2017-03-20T15:15:00Z

I don't believe it's a good way to test your changes. You'd better create a simple application which replies to requests and measure the response time improvement (with Gatling or JMeter). The injector and the app should run on separate machines. 2017-03-17 0:21 GMT+01:00 Leo Gomes <notifications@github.com>:

…

This PR adds the new option suggested here: #1750 <#1750> @tsegismont <https://github.com/tsegismont> mentioned in the issue that performance tests would be a plus, but I haven't put them in place yet. I was thinking of modifying https://github.com/vert-x3/vertx-perf to add a JMH test there to validate performance gains brought by this PR. Is it a good strategy? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1881 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbltisXBLBag-WV_Vt1-i6JpltZW-AIks5rmcQVgaJpZM4Mf_zm> .

leogomes · 2017-03-20T15:29:30Z

@tsegismont OK, I will put that in place. Should it be on a specific existing project or should I create a project on my side to demonstrate that and provide you guys a way to execute it?

Just for the note, when I mentioned JMH I was thinking on something on the likes of what's done here: https://github.com/netty/netty/tree/4.1/microbench/src/main/java/io/netty/handler/codec/http2

tsegismont · 2017-03-20T16:32:14Z

A project of your own on GH would be perfect. 2017-03-20 16:29 GMT+01:00 Leo Gomes <notifications@github.com>:

…

@tsegismont <https://github.com/tsegismont> OK, I will put that in place. Should it be on a specific existing project or should I create a project on my side to demonstrate that and provide you guys a way to execute it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1881 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbltpraJ96rtneTFqaX6H24bzNddeT4ks5rnptfgaJpZM4Mf_zm> .

leogomes · 2017-03-24T05:17:31Z

I have done some tests with Gatling. Results are available here: https://github.com/leogomes/vertx-benchmark/tree/master/results. I also profiled the app with JFR and generated flamegraphs, which you can find on the same folder, for the different initial buffer sizes.

While I couldn't notice any significant difference in response time (or throughput) in my particular tests, the flamegraphs show that HttpObjectDecoder$HeaderParser.parse() goes down from 69.49% of total CPU time to 51.22%, when you change the initial buffer from 128 to 4096.

If you guys have any suggestions for a better test case, please let me know. I've written a very basic app and a Gatling scenario that sends a request with a couple of quite big headers:
https://github.com/leogomes/vertx-benchmark/blob/master/src/gatling/vertx/VertxInitialBufferSimulation.scala

gmagniez · 2017-03-24T15:12:29Z

Seems really great to me.
I think you did not notice a big difference because you have only tested the HttpServer side.
My use case was an HTTP Proxy handling thousand of requests with 15+ headers with some quite large (at least 500 chars) in both ways. Client <-> (HttpServer) VertX (HttpClient) <-> Backend server
Ending with near 2K of headers

tsegismont · 2017-03-24T16:07:27Z

@gmagniez could you try the change and give some feedback? 2017-03-24 16:12 GMT+01:00 gmagniez <notifications@github.com>:

…

Seems really great to me. I think you did not notice a big difference because you have only tested the HttpServer side. My use case was an HTTP Proxy handling thousand of requests with 15+ headers with some quite large (at least 500 chars) in both ways. Client <-> (HttpServer) VertX (HttpClient) <-> Backend server Ending with near 2K of headers — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1881 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbltjLNypy6tBunI8yUrMax7j3kL178ks5ro91igaJpZM4Mf_zm> .

gmagniez · 2017-03-24T20:07:39Z

I already have patched VertX to implement this behavior and as leogomes pointed it out, I have also noticed a lower cpu consumption due to request parsing. Even with a bigger scale as my proxy was using both of HttpServer and HttpClient.
After that, as I remember AppendableCharSequence.append is not the top consumer when I do some sampling. It is now exceeded by the char to char writing of headers to the socket, but I did not find already a way to keep headers, netty's AsciiString are always converted to String at some point of parsing. (I would prefer to have the possibility to only handle AsciiString's with my Proxy use case as my app is "just" copying headers in both ways). But it is another "problem"

leogomes · 2017-03-27T22:35:55Z

In a couple of weeks, I should have time to improve my current test scenario (write a proxy-like app with the server and client part, put 15+ headers as described by @gmagniez) and run a better performance test. I will keep you posted.

leogomes · 2017-04-08T13:30:25Z

So, my new scenario has:

A proxy like application that forwards HTTP requests to an Echo Server. Both run on the same machine, while the Client (Gatling) is sitting on another one;
My Gatling scenario now has 15 HTTP headers with ~500 bytes each;
I now trigger 1000 users (i.e. 1000 channels between Gatling and the Proxy);
I initialized my Proxy with 128 bytes for the initial buffer (i.e. default case) and 10240 bytes (so there should be no need to resize within the HttpDecoder) and compared both.

With that scenario, the difference in performance starts be more visible:

HeaderParser.parse accounts for ~52% of CPU time, according to the flamegraph below, for the default case:
It goes down to ~21% when the initial buffer is set to 10240 bytes:

Throughput and response time are better as well (as one would expect):

128 bytes

10240 bytes:

Source code and results are available here: https://github.com/leogomes/vertx-benchmark/tree/master/results

Gatling scenario: https://github.com/leogomes/vertx-benchmark/blob/master/src/gatling/vertx/VertxInitialBufferSimulation.scala

Cheers,
Leo.

vietj · 2017-04-08T17:32:43Z

@leogomes the results are interesting : the latency improvement at the 99th percentile + the throughput improvement looks good. Can you give some details of the actual workload for this benchmark (machines, network) ?

leogomes · 2017-04-10T08:27:59Z

@vietj Here are the CPU details for the two machines:
Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K

They're running Linux 64 bits 3.0.101-63-default

They're on the same rack and have a 1 Gbps network adapter.

The request that I'm sending is the one you can find on the Gatling scenario.

The measurement itself is pretty short (< 3min), because I haven't had much time to dedicate to it and also because here we're not actually modifying a default value, but only exposing the underlying Netty setting, so that people with this sort of workload (heavy HTTP headers) can customise the HttpDecoder. So, I was kind of just trying to build an example that could easily reproduce the scenario. But for sure, a more serious performance test would need to be run for much longer. With that setting exposed, people out there may be able to run it with different values and their own use-cases to find their sweet spot.

Finally, the last results I had put were taken while profiling with JFR to be able to get the Flamegraphs. I have now run a new test without any profiling and have been able to reproduce the same results.

vietj · 2017-04-10T08:46:13Z

Looks good and elaborated to me @leogomes

leogomes · 2017-04-10T08:49:49Z

@tsegismont WDYT?

tsegismont · 2017-04-10T10:26:41Z

@leogomes LGTM

vietj added the to review label Mar 16, 2017

leogomes force-pushed the issue/1750 branch from 3c270f5 to 0b13b42 Compare March 16, 2017 23:38

leogomes mentioned this pull request Mar 24, 2017

Customization of Netty "initialBufferSize" #1750

Closed

vietj merged commit 0b13b42 into eclipse-vertx:master Apr 27, 2017

vietj removed the to review label Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customization of Netty "initialBufferSize" #1750 #1881

Customization of Netty "initialBufferSize" #1750 #1881

leogomes commented Mar 16, 2017

leogomes commented Mar 16, 2017

tsegismont commented Mar 20, 2017 via email

leogomes commented Mar 20, 2017 •

edited

Loading

tsegismont commented Mar 20, 2017 via email

leogomes commented Mar 24, 2017 •

edited

Loading

gmagniez commented Mar 24, 2017

tsegismont commented Mar 24, 2017 via email

gmagniez commented Mar 24, 2017

leogomes commented Mar 27, 2017

leogomes commented Apr 8, 2017 •

edited

Loading

vietj commented Apr 8, 2017

leogomes commented Apr 10, 2017

vietj commented Apr 10, 2017

leogomes commented Apr 10, 2017

tsegismont commented Apr 10, 2017

Customization of Netty "initialBufferSize" #1750 #1881

Customization of Netty "initialBufferSize" #1750 #1881

Conversation

leogomes commented Mar 16, 2017

leogomes commented Mar 16, 2017

tsegismont commented Mar 20, 2017 via email

leogomes commented Mar 20, 2017 • edited Loading

tsegismont commented Mar 20, 2017 via email

leogomes commented Mar 24, 2017 • edited Loading

gmagniez commented Mar 24, 2017

tsegismont commented Mar 24, 2017 via email

gmagniez commented Mar 24, 2017

leogomes commented Mar 27, 2017

leogomes commented Apr 8, 2017 • edited Loading

vietj commented Apr 8, 2017

leogomes commented Apr 10, 2017

vietj commented Apr 10, 2017

leogomes commented Apr 10, 2017

tsegismont commented Apr 10, 2017

leogomes commented Mar 20, 2017 •

edited

Loading

leogomes commented Mar 24, 2017 •

edited

Loading

leogomes commented Apr 8, 2017 •

edited

Loading