Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible HTTP load problem #17395

Closed
rkuhn opened this issue May 6, 2015 · 60 comments
Closed

possible HTTP load problem #17395

rkuhn opened this issue May 6, 2015 · 60 comments

Comments

@rkuhn
Copy link
Contributor

rkuhn commented May 6, 2015

reported on twitter by Niko Will (@win1imb), reproducer is at https://github.com/win1imb/akka-poc

@rkuhn rkuhn added this to the http-1.0 milestone May 6, 2015
@rkuhn rkuhn added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:http labels May 6, 2015
@n1ko-w1ll
Copy link
Contributor

Hi and thanks for opening this issue. To reproduce the problem it should be enough to run the SearchService class and the corresponding Gatling test. On my machine (Windows 7, 64 Bit, 16GB RAM) around 40% of the requests end up in an error message saying

java.net.ConnectException: Connection refused: no further information: localhost/127.0.0.1:8080

@rkuhn
Copy link
Contributor Author

rkuhn commented May 6, 2015

Would you mind using the akka.http.javadsl.Http extension directly and specifying a larger value for the backlog parameter? This one defaults to 100 and if connections cannot be accepted quickly enough this will give you the observed result.

@jrudolph
Copy link
Member

jrudolph commented May 6, 2015

@rkuhn that's also a good reminder to add the missing Java overloads for specifying settings.

@win1imb The result of such a load test will very much depend on benchmarking tools and setup.

"10000 concurrent users" can mean a lot of different things. E.g. if you setup wrk to use 10000 connections, it will try to create all those (persistent HTTP) connections but at the same time it already starts hammering the server with requests. Normally wrk will be able to completely max out server CPU by sending just on 8 connections (or at least as many as CPUs are available) if handling the request is completely CPU-bound which is the usual benchmarking case.

Increasing the number of connections from this point will only mean that connection establishment and request handling will battle for resources which means that timeouts are likely.

So, in summary, try to separate testing for performance (How many RPS are possible?) and scalability over connections (What is the performance impact of running requests over more connections?). From what I've seen scalability is ok while performance still needs to be improved.

/cc @sirthias

@n1ko-w1ll
Copy link
Contributor

@rkuhn I had a quick look but I'm familiar with the internals of akka. If I've understood it correctly, I have to use the Http.get(system).bind(....) method, right? The one where I can specify the backlog size as parameter has a lot of other parameters I have no clue about, what to pass here.

@jrudolph you're right, that's why I provided the stress test setup with gatling in my example project, too. As you said, increasing the number of connections will end up in a battle for resources. But if you have to choose a technology, one can cope with that and answer the requests even if not very fast and the other refuses the connections because it's not able to manage the battle for resources, which one would you use? In a world with SLAs and zero downtime, answering a request with 150ms (which was the mean of spring boot in my comparison) compared to 40% connection refused, I would choose the first option ;)

Don't get me wrong, I like akka and actor based programming, especially what it does to handle concurrency and resilience in a very concise way. As I said, possibly the error sits in front of the computer. But at the end of the day, we have to be confident with the technology we use.

@n1ko-w1ll
Copy link
Contributor

BTW, had to change my username, so the code examples are now at https://github.com/n1ko-w1ll/akka-poc

@rkuhn
Copy link
Contributor Author

rkuhn commented May 7, 2015

My plan was to look into this today, but might not be able to, depending on how the traveling goes. I’ll definitely try out your project to see what’s wrong.

@jrudolph
Copy link
Member

jrudolph commented May 8, 2015

@n1ko-w1ll

But if you have to choose a technology, one can cope with that and answer the requests even if not very fast and the other refuses the connections because it's not able to manage the battle for resources, which one would you use?

This assumes that the problem is related to latency (i.e. the processing pipeline is too long but the server has still some CPU capacity left). If you assume the problem is throughput related (i.e. the server just cannot keep up with work) then the result is not slow responses but a DOS scenario where response times will increase and eventually the server won't answer any more at all. If the problem is throughput-related then in spirit of reactivity you would actually choose the server that backpressures new connections (e.g. by letting connection attempts timeout and rely on them reconnecting with a backoff strategy) instead of a server which lets itself get overwhelmed with connections/requests.

What I meant before with "From what I've seen scalability is ok while performance still needs to be improved." is that akka-http may currently be just too slow to run your example but it doesn't really depend on the number of connections. So, I suspect that all that you are seeing is a consequence of akka-http being too slow.

That said, I agree with you that akka-http should be at least in the same ballpark performance-wise as spring. :)

Btw. thanks for including the gatling script and sorry that I overlooked it before.

@n1ko-w1ll
Copy link
Contributor

@jrudolph okay, I agree with that. But I don't think akka-http is too slow. If I reduce the number of concurrent users to 1000 in the given test, akka-http is even a bit faster than spring.

@drewhk
Copy link
Member

drewhk commented May 8, 2015

@n1ko-w1ll well, it is slow :) We first want to make it stable with the minimal set of features to be usable, but we will need to optimize after. I hope we get there soon!

@n1ko-w1ll
Copy link
Contributor

@drewhk but still faster than spring ;)

@jrudolph
Copy link
Member

jrudolph commented May 8, 2015

@n1ko-w1ll As @rkuhn suggested Increasing the backlog will then probably help, but unfortunately it is next to impossible to do it in Java code right now. If you can create a Scala class in your project you could create your own version of HttpApp where you override the handleConnectionsWithRoute method with a copy that sets the backlog parameter to a higher value:

https://github.com/akka/akka/blob/release-2.3-dev/akka-http-java/src/main/scala/akka/http/javadsl/server/HttpService.scala#L34

@n1ko-w1ll
Copy link
Contributor

@jrudolph that could be an option. Have to see when I can find time to test it.

@He-Pin
Copy link
Member

He-Pin commented May 8, 2015

@jrudolph does the io for akka-http plug-able?

@jrudolph
Copy link
Member

jrudolph commented May 8, 2015

@hepin1989 There is a certain kind of plugability that allows you to use the HTTP layers without TCP, if that's what you mean. This mechanism is (will be) used to support HTTPS. You can even use a transport layer that is implemented with other implementations of reactive-streams. The implementation itself depends on akka-stream. What kind of application do you have in mind?

@He-Pin
Copy link
Member

He-Pin commented May 8, 2015

@jrudolph that really what I have in mind.I am thinking how about using https://github.com/netty/netty-tcnative for ssl and netty/aio for tcp but using akka-http for the programming.

so all left to do is a reactive-streams around this and then could connect them all together right?
that's really great!.

@jrudolph
Copy link
Member

jrudolph commented May 8, 2015

@hepin1989 in theory, it could work and I'd be interested about the results. From our experience, "just a reactive-streams API around this" will still be lots of work and optimizing performance while mixing several stacks together will probably be even harder than trying to optimize just the single stack...

@He-Pin
Copy link
Member

He-Pin commented May 8, 2015

@jrudolph yes that's it is,so currently we are building our game server plus gm around akka and spray.
will move on akka-http later.the first thing is make it up and run,and then optimize.:) thanks for your works.

@rkuhn
Copy link
Contributor Author

rkuhn commented May 8, 2015

Currently it is extremely ugly to set the backlog parameter from Java, but I have hacked it together here. Running this will not succeed unless setting a few kernel parameters:

sudo sysctl -w kern.ipc.somaxconn=12000
sudo sysctl -w kern.maxfilesperproc=1048576
sudo sysctl -w kern.maxfiles=1148576

When doing so, the test result is quite different:

================================================================================
---- Global Information --------------------------------------------------------
> request count                                     298274 (OK=292862 KO=5412  )
> min response time                                      0 (OK=0      KO=60002 )
> max response time                                  60032 (OK=59997  KO=60032 )
> mean response time                                  1518 (OK=437    KO=60005 )
> std deviation                                       8843 (OK=3908   KO=2     )
> response time 50th percentile                         20 (OK=18     KO=60005 )
> response time 75th percentile                         38 (OK=36     KO=60007 )
> mean requests/sec                                3723.584 (OK=3656.022 KO=67.562)
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        287834 ( 96%)
> 800 ms < t < 1200 ms                                 506 (  0%)
> t > 1200 ms                                         4522 (  2%)
> failed                                              5412 (  2%)
---- Errors --------------------------------------------------------------------
> java.util.concurrent.TimeoutException: Request timed out to lo   5412 (100,0%)
calhost/127.0.0.1:8080 of 60000 ms
================================================================================

What we see here is that some requests time out because the server is being overloaded, and we possibly don’t handle all non-nominal processing conditions with perfect grace yet, but it is quite clear that Akka HTTP can handle quite some load already—we will definitely improve, though! Johannes, it might be interesting how only some of the requests get “stuck”, there is a clear latency gap between almost everything going nice and fast and some requests seem to starve.

I don’t have the Spring comparison data, so it would be great if Niko could comment on these findings.

@He-Pin
Copy link
Member

He-Pin commented May 8, 2015

@rkuhn maybe we could add some doc about the server side env setup. like maxfiles.

@n1ko-w1ll
Copy link
Contributor

Hm, that does not look very impressive yet. I mean, I don't know your hardware and mine is quite powerful for a laptop (i7-4800MQ @ 2.70 GHz with 16 GB RAM). The results for spring on my machine are quite similar and my spring application already implements a query parser with parboiled, translates the query to mongo criterias and retrieves the results from MongoDB.

================================================================================
---- Global Information --------------------------------------------------------
> request count                                     289354 (OK=289354 KO=0     )
> min response time                                      0 (OK=0      KO=-     )
> max response time                                   1877 (OK=1877   KO=-     )
> mean response time                                    22 (OK=22     KO=-     )
> std deviation                                         12 (OK=12     KO=-     )
> response time 50th percentile                         21 (OK=21     KO=-     )
> response time 75th percentile                         28 (OK=28     KO=-     )
> mean requests/sec                                3534.526 (OK=3534.526 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        289353 (100%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                            1 (  0%)
> failed                                                 0 (  0%)
================================================================================

@n1ko-w1ll
Copy link
Contributor

By the way... the application will be deployed on cloudfoundry. I'm not sure if we can edit these environment variables there somehow :(

@rkuhn
Copy link
Contributor Author

rkuhn commented May 8, 2015

@n1ko-w1ll That was what @drewhk was trying to say all along and what we have been very vocal about: you are testing a pre-release version of the first iteration of a new HTTP stack that has not yet been optimized at all. Of course this will not be representative of the performance when it reaches production quality.

Concerning the parameters: you cannot possibly have tested successfully with 10000 clients with the MacOS X default limit of 256 open file descriptors per process and 12000 file descriptors in the whole system, independent of which HTTP stack is used, and to my knowledge none of the popular operating system kernels come preconfigured with limits that are suitable for your specific test case. Doing load tests and optimization will always require dedicated configuration of all aspects of the deployment.

Another note on my measurement: the point was to show that the refused connections are indeed due to the limited queue of incoming connection requests, I have not tuned the JVM settings at all—in fact I don’t even know how much memory it had available. Finding out how to do these things with Maven goes beyond what time I can spend on this issue today, perhaps you can repeat the measurement with the patched version to get an even comparison.

@n1ko-w1ll
Copy link
Contributor

I'm not sure if it makes any difference, but as mentioned in my initial post, I'm running Windows 7, not MacOS X ;)

I know that all of this is not released yet. I just want to point out, that it works out of the box without any adjustments on operating system level or any special configuration with spring boot and I wondered why this is not possible with akka when I initially tried it.

It's the same for me, I don't think I have the time today to test this again. Maybe next week.

@He-Pin
Copy link
Member

He-Pin commented May 8, 2015

Maybe we could make a online chart which show the benchmark for release iterations.cause akka have websocket and http now,best for microservice and rest facade.

@rkuhn
Copy link
Contributor Author

rkuhn commented May 8, 2015

From this I conclude that on Windows these kernel settings are not needed, and all that was required was to switch the higher backlog parameter value in your test case.

@n1ko-w1ll
Copy link
Contributor

@rkuhn I tested your code on Windows and still received the "Connection refused" errors :(

@rkuhn
Copy link
Contributor Author

rkuhn commented May 8, 2015

We’ll have to revisit this when we are in a position to actually perform real benchmarks and performance analyses, thanks for providing the test case.

@johanandren johanandren removed this from the 2.4.7 milestone Jun 3, 2016
@ktoso ktoso modified the milestones: 2.4.8, 2.4.9 Jul 8, 2016
@ktoso ktoso modified the milestones: 2.4.9-RC1, 2.4.9 Aug 2, 2016
@ktoso ktoso removed the 2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding label Aug 2, 2016
@ktoso ktoso modified the milestones: 2.4.9-RC2, 2.4.9 Aug 5, 2016
@johanandren johanandren modified the milestones: 2.4.9, 2.4.10 Aug 19, 2016
@2m 2m modified the milestones: 2.4.10, 2.4.11 Sep 7, 2016
@ktoso ktoso removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Sep 8, 2016
@ktoso
Copy link
Member

ktoso commented Sep 12, 2016

Closing as obsolete, we did a lot of work in the area recently, reopen if needed please

@ktoso ktoso closed this as completed Sep 12, 2016
@swachter
Copy link
Contributor

Just for the recored (ticket can stay closed): I repeated the test with the current Akka version (2.4.10) on the same Linux box I used on March 25th. Again the performance increased significantly (about ~30 %):

================================================================================
---- Global Information --------------------------------------------------------
> request count                                      44114 (OK=44114  KO=0     )
> min response time                                   2009 (OK=2009   KO=-     )
> max response time                                   2223 (OK=2223   KO=-     )
> mean response time                                  2018 (OK=2018   KO=-     )
> std deviation                                         14 (OK=14     KO=-     )
> response time 50th percentile                       2011 (OK=2012   KO=-     )
> response time 75th percentile                       2021 (OK=2021   KO=-     )
> mean requests/sec                                1379.856 (OK=1379.856 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                             0 (  0%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                        44114 (100%)
> failed                                                 0 (  0%)
================================================================================

This result is even better as the result using the Spring version:

================================================================================
---- Global Information --------------------------------------------------------
> request count                                      41625 (OK=41625  KO=0     )
> min response time                                   2000 (OK=2000   KO=-     )
> max response time                                   3694 (OK=3694   KO=-     )
> mean response time                                  2173 (OK=2173   KO=-     )
> std deviation                                        271 (OK=271    KO=-     )
> response time 50th percentile                       2059 (OK=2059   KO=-     )
> response time 75th percentile                       2221 (OK=2221   KO=-     )
> mean requests/sec                                1304.573 (OK=1304.573 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                             0 (  0%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                        41625 (100%)
> failed                                                 0 (  0%)
================================================================================

TODO repeat the runs on a Windows box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants