Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server connection accepting is not performant, drops connections. #429

Closed
ssoroka opened this issue Mar 17, 2015 · 30 comments
Closed

Server connection accepting is not performant, drops connections. #429

ssoroka opened this issue Mar 17, 2015 · 30 comments

Comments

@ssoroka
Copy link

ssoroka commented Mar 17, 2015

Under moderate load, the lumberjack server drops connections. I've seen issues sustaining more than 100 servers trying to connect at the same time. Clients get i/o timeouts and failed ssl handshake errors.

The root cause is that the connection-accepting code is using a single thread to both accept tcp connections and negotiate the ssl protocol handshake. This blocks the server from accepting further connections for an unreasonably-long time.

I believe the following unresolved bug reports may actually be this issue above: #360 #91

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

I've attached a pull request for this issue. I've gotten at least 7x better performance on accepting connections with the patch, no more i/o timeouts, and no more failed ssl handshakes. Performance may be better than 7x, but I ran out of clients to connect to it.

@jordansissel
Copy link
Contributor

@ssoroka what kind of hardware are you experiencing this problem on? What size SSL certificate (2048 bit? 4096?)

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

quad core, hyperthreading, 16gb ram, bare metal box. 2048 bit cert.

@jordansissel
Copy link
Contributor

@ssoroka what's the output of openssl speed rsa on that box?

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

I'm happy to share that privately. Needless to say, doing ssl negotiation in a single thread while the server is blocked from accepting connections is a design flaw.

@jordansissel
Copy link
Contributor

If I gave the impression that this was a debate about the correctness of your assertion, I apologize. I am not debating nor combating your proposal that there is a problem with single-threaded handshaking.

My belief is that if you are having trouble performing 100 tls handshakes in some short time period, then this indicates a performance bug regardless of the handshake work distribution.

My laptop reports 309.3 for the sign/s column of 2048 bit RSA. I'd use this number to roughly estimate how many TLS handshakes per second I can do with a 2048-bit certificate using OpenSSL. Logstash uses JRuby (javax.net.ssl, etc), so the performance may vary, although I haven't confirmed anything.

If you cannot provide any details on the peak handshakes/sec you are able to accomplish with the lumberjack input and also the 2048bit RSA signs/sec rate given by openssl speed rsa, then we'll probably just end up waiting until I make time to test handshake throughput myself, which may take some time.

I will respect your decision not to share such information, but you should know that this will slow resolution of your concerns because I believe there to be a bug specific to handshake performance. Throwing more threads at the problem is a workaround, not a solution, to this bug, even if I agree that we should do handshaking within the connection thread.

@jordansissel
Copy link
Contributor

To summarize:

  • Yes, we should probably be doing TLS handshaking in a non-single-threaded way
  • I believe that your observed performance description (no data given, though) of TLS handshaking on a single thread to be drastically below what I expect for your hardware. More research is required to determine where this bug is.

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

Great, I can add more details in respect to those points:

The server in question reports 874.9 sign/s for rsa 2048, under various other load (ie non-ideal benchmarking conditions).

I think it's more just a matter of how long the SSL negotiation takes in general, as network speed can affect it. To complete the handshake, it has to send multiple query-responses back-and-forth. Counting network communication time only, 9 SSL negotiation steps X a low network latency of 10ms = 90ms. That's 11 requests per second before the server starts dropping requests. On a medium latency network of 100ms, that's more like 1 per second.

@jordansissel
Copy link
Contributor

@ssoroka I am a familiar networking aspects of ssl: http://www.semicomplete.com/blog/geekery/ssl-latency.html :)

before the server starts dropping requests

This assumes the server accept backlog is low. We can tune this higher.

Given logstash-forwarder will detect timeouts and retry, I'm not entirely sure what the you observe to be the negative impact. Error messages about network instability should not be permanent, and I would expect any observation that some connections are rejected due to load to wash out gradually as the forwarders back off and reconnect due to handshake/connection failures. Do you experience permanent failure of due to this thundering herd problem? Can you provide more details about the longevity of the connectivity problems?

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

I do see continuous and permanent failures. I would typically not see more than 100 clients connected at a time, with hundreds more unable to connect. Given more time to wait, they do not connect. Sometimes specific clients will manage to get connected, only to have others drop off and maintain the 100 client ceiling. They never do sort out their issues and connect.

With the patch I have no problem connecting over 700, even if they're all started at once. This makes a lot of sense, since most of the ssl handshake is waiting on network i/o.

@jordansissel
Copy link
Contributor

Given more time to wait, they do not connect

This sounds like a bug in the forwarder, as well, if this never resolves.

@jordansissel
Copy link
Contributor

This sounds like a bug in the forwarder

Clarifying:

If 100 clients do a thundering herd and hit a single lumberjack server, and each handshake takes 100ms, and it is single threaded, I would expect, best case, 100 clients * 0.100 seconds == 10 seconds for all clients to connect - this ignores some round-trip and retry delays, but the point is I expect within a short period for all 100 clients to back-off-and-retry after experiencing network connectivity problems. If they are not doing this, then there is something wrong beyond simply starving the single-threaded handshake mechanism.

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

With logging on in the client, I get lines like the following:

{timestamp} Failure connecting to {ip} dial tcp {ip}:{port}: i/o timeout
{timestamp} Failed to tls handshake with {ip} read tcp {ip}:{port}: connection reset by peer

It'll cycle between those two endlessly, though not in that specific order.

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

I agree there's likely some other bug at play, or you would think eventually it would stabilize regardless of how long it took to accept the connections.

@driskell
Copy link
Contributor

I should add here, when I hit this problem some time ago I had a rogue server somewhere that prevented the handshake completing. The accept thread would be permanently blocked until I killed it as the handshake was perpetually waiting for data that was never arriving. I was unable to ascertain root cause in the forwarder but the single threaded accept was the exacerbating factor. I'm unsure if OpenSSL a has handshake timeout settings by default.

Hopefully this information proves useful. If I can find any old notes I'll let you know.

@jordansissel
Copy link
Contributor

@ssoroka for context, when I ran logstash-forwarder in the past, I had ~8000 lsf instances talking to 7 logstash servers (ratio of 1000:1 clients:servers) and I never experienced any persistent failures like this. The retry and random-server-selection features in lsf are there to combat the thundering herd problems I experienced with this 1000:1 ratio, myself, and they solved my problems. You having trouble with 100:1 client:server ratio is pretty worrying to me, which is why I believe there to be other bugs at play than just the "roflscale" (your words) concern you have.

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

Well, I hope you're not taking too much offense to the jab. Obviously we're experiencing very different realities; 1000:1 is reasonable, and I would have loved to experience that.

Regardless of the existence of another bug, it shouldn't really hold up this pull request, which I think we should both be able to agree is goodness. It's entirely possible that a few slow clients could hold the connection open for an unreasonable amount of time trying to negotiate the ssl handshake, which could be what's causing my problem, and this patch resolves any such set of network-related issues.

@driskell
Copy link
Contributor

@jordansissel All it takes is ONE bad client to trigger this bug. I've seen in my AWS platforms (100s not 1000s) connections just drop and then packet loss occur, leaving a dead connection to Logstash. Admittedly, these AWS machines are legacy and old m1s running on old AWS hardware (I get reboot notifications for Xen upgrades ha!) This single dead connection that doesn't timeout, destroys Logstash lumberjack input and blocks it completely.

Reproduction is quite easy:

Telnet to an external lumberjack port, then turn off your Wifi. Then close your telnet terminal and a few seconds later connect back to Wifi. This will leave a connection on the remote side which never shuts down.

Now restart any logstash-forwarder - it will never connect back in. The lumberjack input is dead.

This patch will mitigate the issue but having looked further into this - the worker thread will never shutdown so any bad connectivity would, over time, increase the number of dead threads. Seems you have to wrap the handshake in your own Timeout::timeout (http://jira.codehaus.org/browse/JRUBY-6975)

@jordansissel
Copy link
Contributor

@driskell I think I understand how this bug is triggered, but I'm not convinced the implementation proposed actually solves @ssoroka's "roflscale" concerns. One-connection-per-thread is still used after the patch, and this has bad performance concerns outside of handshaking, so I'm trying to explore with y'all what the actual problems observed are, and what specifically needs to be fixed. The attitude in the "roflscale" tweet and the problem reported are in conflict, and I'm trying to find out more information.

@jordansissel
Copy link
Contributor

Of course, if you want me to ignore the tweet in the context of this issue, I am happy to focus on the incremental improvement provided by @ssoroka's patch.

@ssoroka
Copy link
Author

ssoroka commented Mar 17, 2015

Let's solve the one issue #429 was opened for. Feel free to open new issues for other concerns. This request is about a specific problem I'm experiencing based on empirical evidence of connection acceptance contention. The patch addresses it with measured improvement, and I'd like it to be evaluated on its own merit, exclusive of other possible bugs.

@driskell
Copy link
Contributor

Hi @ssoroka

Just be aware you may just have offloaded your problem to a thread so it doesn't block acceptance, and not fixed it. So you may find over time dead threads and a crash after a month or so. @jordansissel will be wanting to fully fix it.

I'm all for incremental and this should be done. Adding timeouts around the handshake will, I think, help further and potentially be the fix that's needed though will need testing. Though looking I think it may also apply to existing connections if they drop while idle due to firewall table expiry.

@jordansissel
Copy link
Contributor

@driskell yeah, a timeout on the handshake process on the server side could help with resource exhaustion. There's some accept_nonblock ssl bugs in JRuby that may make this a challenge, but maybe give cause to fix them :)

@jordansissel
Copy link
Contributor

The patch addresses it with measured improvement, and I'd like it to be evaluated on its own merit, exclusive of other possible bugs.

I think I know what you mean, but my point is this patch could both fix your problem and cause another. I'm trying to think through things.

@driskell
Copy link
Contributor

This patch should improve the situation at least until a full fix is formed. if we ignore everything outside of Logstash and aim to make Logstash unaffected by any issues outside of it, then we already know what to do - it will just take some time. Worst case now is Logstash will crash running out of threads rather than block. This I would see as better, and something one can monitor via thread count. Regular restarts will help too. Blocking is hard to workaround.

Timeline for full fix I guess is:

  • This patch will parallel handshake. Mitigating issue by delaying it and making it no longer block, but leak resource (better IMO)
  • Then need timeout of some sort on handshake (and data read potentially) to prevent leaked threads
  • To further improve overall scalability (beyond current), I would guess handshake (and data read potentially) should be on non blocking IO select loops which feeds a processing thread pool

@richardgoetz
Copy link

I think I am having same problem. I am on Elasticsearch 1.3.2 and Logstash 1.5rc - it was working up until I finally added a cluster member. Now all the logstash forwarding clients connects to logstash sends initial set of events and then stops. nohup.out shows it just waiting and when you shut down down logstash 1.5rc2 it hangs and doesn't allow you to shutdown and manual kill -9 PID is always needed for logstash when it stops accepting connections.

@knalli
Copy link

knalli commented Apr 14, 2015

Is this issue really about Read error looking for ack: read tcp $host:$port: i/o timeout? I do not want highjacking it / create duplicates..

@pjaffe
Copy link

pjaffe commented Jun 19, 2015

I apologize for potentially adding more confusion to the intended focus of this issue, but the comment by @driskell on Mar 17 seems to clearly point out an easily reproducible problem that is almost definitely periodically affecting my environment.

I followed the steps he suggested - I telnet'd to my logstash server, abruptly terminated my network connection, and my logstash-forwarders subsequently fail with the following error after restarting them:

Connecting to 10.199.10.248:5043 (<host name>)
Failed to tls handshake with 10.199.10.248 read tcp 10.199.10.248:5043: i/o timeout

I'm struggling with the following:

  1. Is there an issue tracking that problem (whether it is this issue or another)?
  2. Is it entirely an issue with logstash, or is it the combination of logstash and logstash-forwarder? I assume it isn't purely a logstash-forwarder problem.
  3. Has it been addressed in some subsequent version of logstash?

I reviewed #438 that @knalli mentioned in the previous comment, and it seemed to imply that upgrading to logstash 1.5.0 (GA) might address it. But I couldn't convince myself of that by reading the comments. The fact that I'm currently running an ancient logstash v1.1.12 and elastic search v0.90.10, means that I would need to upgrade a few things in order to test this myself, which adds a barrier to me just trying to empirically verify it.

Sorry again if I'm abusing this discussion thread. I understand that @ssoroka was very reasonably trying to focus this on a candidate performance improvement.

@jordansissel
Copy link
Contributor

Reproduction is quite easy:
Telnet to an external lumberjack port, then turn off your Wifi. Then close your telnet terminal and a few seconds later connect back to Wifi. This will leave a connection on the remote side which never shuts down.

@driskell ahh, so I think the bug is on the server (lumberjack input) side of things where a misbehaving client can cause major (or permanent) stalls accepting new connections. I agree this should probably have a timeout on it.

I filed a ticket on the ruby server-side of things (elastic/ruby-lumberjack#10) so we can discuss it further and work towards a fix.

Sorry for the delays in attending to this ticket.

@jordansissel
Copy link
Contributor

Iet's continue the discussion here: elastic/ruby-lumberjack#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants