Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customization of Netty "initialBufferSize" #1750 #1881

Merged
merged 1 commit into from
Apr 27, 2017

Conversation

leogomes
Copy link
Contributor

Motivation:
Under load and with HTTP requests or responses with multiple headers, a
lot of CPU-time can be spent on "AppendableCharSequence.append", which
has to always grow and so has to copy char arrays.

Modifications:
Changed HttpClientOptions and HttpServerOptions to allow the user to
pass an initial buffer size for Netty’s HTTP decoder that fits her
needs when not happy with the default initial buffer size, i.e. 128
bytes. Also, fixed some typos in documentation.

Result:
Initial buffer size for HTTP decoder is customisable.

@leogomes
Copy link
Contributor Author

This PR adds the new option suggested here: #1750

@tsegismont mentioned in the issue that performance tests would be a plus, but I haven't put them in place yet. I was thinking of modifying https://github.com/vert-x3/vertx-perf to add a JMH test there to validate performance gains brought by this PR. Is it a good strategy?

Motivation:
Under load and with HTTP requests or responses with multiple headers, a
lot of CPU-time can be spent on "AppendableCharSequence.append", which
has to always grow and so has to copy char arrays.

Modifications:
Changed HttpClientOptions and HttpServerOptions to allow the user to
pass an initial buffer size for Netty’s HTTP decoder that fits her
needs when not happy with the default initial buffer size, i.e. 128
bytes. Also, fixed some typos in documentation.

Result:
Initial buffer size for HTTP decoder is customisable.

Signed-off-by: Leonardo FREITAS GOMES <leonardo.f.gomes@gmail.com>
@tsegismont
Copy link
Contributor

tsegismont commented Mar 20, 2017 via email

@leogomes
Copy link
Contributor Author

leogomes commented Mar 20, 2017

@tsegismont OK, I will put that in place. Should it be on a specific existing project or should I create a project on my side to demonstrate that and provide you guys a way to execute it?

Just for the note, when I mentioned JMH I was thinking on something on the likes of what's done here: https://github.com/netty/netty/tree/4.1/microbench/src/main/java/io/netty/handler/codec/http2

@tsegismont
Copy link
Contributor

tsegismont commented Mar 20, 2017 via email

@leogomes
Copy link
Contributor Author

leogomes commented Mar 24, 2017

I have done some tests with Gatling. Results are available here: https://github.com/leogomes/vertx-benchmark/tree/master/results. I also profiled the app with JFR and generated flamegraphs, which you can find on the same folder, for the different initial buffer sizes.

While I couldn't notice any significant difference in response time (or throughput) in my particular tests, the flamegraphs show that HttpObjectDecoder$HeaderParser.parse() goes down from 69.49% of total CPU time to 51.22%, when you change the initial buffer from 128 to 4096.

If you guys have any suggestions for a better test case, please let me know. I've written a very basic app and a Gatling scenario that sends a request with a couple of quite big headers:
https://github.com/leogomes/vertx-benchmark/blob/master/src/gatling/vertx/VertxInitialBufferSimulation.scala

@gmagniez
Copy link

Seems really great to me.
I think you did not notice a big difference because you have only tested the HttpServer side.
My use case was an HTTP Proxy handling thousand of requests with 15+ headers with some quite large (at least 500 chars) in both ways. Client <-> (HttpServer) VertX (HttpClient) <-> Backend server
Ending with near 2K of headers

@tsegismont
Copy link
Contributor

tsegismont commented Mar 24, 2017 via email

@gmagniez
Copy link

I already have patched VertX to implement this behavior and as leogomes pointed it out, I have also noticed a lower cpu consumption due to request parsing. Even with a bigger scale as my proxy was using both of HttpServer and HttpClient.
After that, as I remember AppendableCharSequence.append is not the top consumer when I do some sampling. It is now exceeded by the char to char writing of headers to the socket, but I did not find already a way to keep headers, netty's AsciiString are always converted to String at some point of parsing. (I would prefer to have the possibility to only handle AsciiString's with my Proxy use case as my app is "just" copying headers in both ways). But it is another "problem"

@leogomes
Copy link
Contributor Author

In a couple of weeks, I should have time to improve my current test scenario (write a proxy-like app with the server and client part, put 15+ headers as described by @gmagniez) and run a better performance test. I will keep you posted.

@leogomes
Copy link
Contributor Author

leogomes commented Apr 8, 2017

So, my new scenario has:

  • A proxy like application that forwards HTTP requests to an Echo Server. Both run on the same machine, while the Client (Gatling) is sitting on another one;
  • My Gatling scenario now has 15 HTTP headers with ~500 bytes each;
  • I now trigger 1000 users (i.e. 1000 channels between Gatling and the Proxy);
  • I initialized my Proxy with 128 bytes for the initial buffer (i.e. default case) and 10240 bytes (so there should be no need to resize within the HttpDecoder) and compared both.

With that scenario, the difference in performance starts be more visible:

  • HeaderParser.parse accounts for ~52% of CPU time, according to the flamegraph below, for the default case:
    image

  • It goes down to ~21% when the initial buffer is set to 10240 bytes:
    image

Throughput and response time are better as well (as one would expect):

128 bytes
image

10240 bytes:
image

Source code and results are available here: https://github.com/leogomes/vertx-benchmark/tree/master/results

Gatling scenario: https://github.com/leogomes/vertx-benchmark/blob/master/src/gatling/vertx/VertxInitialBufferSimulation.scala

Cheers,
Leo.

@vietj
Copy link
Member

vietj commented Apr 8, 2017

@leogomes the results are interesting : the latency improvement at the 99th percentile + the throughput improvement looks good. Can you give some details of the actual workload for this benchmark (machines, network) ?

@leogomes
Copy link
Contributor Author

@vietj Here are the CPU details for the two machines:
Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K

They're running Linux 64 bits 3.0.101-63-default

They're on the same rack and have a 1 Gbps network adapter.

The request that I'm sending is the one you can find on the Gatling scenario.

The measurement itself is pretty short (< 3min), because I haven't had much time to dedicate to it and also because here we're not actually modifying a default value, but only exposing the underlying Netty setting, so that people with this sort of workload (heavy HTTP headers) can customise the HttpDecoder. So, I was kind of just trying to build an example that could easily reproduce the scenario. But for sure, a more serious performance test would need to be run for much longer. With that setting exposed, people out there may be able to run it with different values and their own use-cases to find their sweet spot.

Finally, the last results I had put were taken while profiling with JFR to be able to get the Flamegraphs. I have now run a new test without any profiling and have been able to reproduce the same results.

@vietj
Copy link
Member

vietj commented Apr 10, 2017

Looks good and elaborated to me @leogomes

@leogomes
Copy link
Contributor Author

@tsegismont WDYT?

@tsegismont
Copy link
Contributor

@leogomes LGTM

@vietj vietj merged commit 0b13b42 into eclipse-vertx:master Apr 27, 2017
@vietj vietj removed the to review label Apr 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants