Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openssl ssl_accept sometimes does not return, causes server to hang permanently #8108

Closed
rdp opened this issue Aug 22, 2019 · 5 comments · Fixed by #9177
Closed

openssl ssl_accept sometimes does not return, causes server to hang permanently #8108

rdp opened this issue Aug 22, 2019 · 5 comments · Fixed by #9177

Comments

@rdp
Copy link
Contributor

rdp commented Aug 22, 2019

Requisite code sample:

require "socket"
require "openssl"

tcp_server = TCPServer.new(8081)

ssl_context = OpenSSL::SSL::Context::Server.new
ssl_context.certificate_chain = "cert.pem"
ssl_context.private_key = "_key.pem"
ssl_server = OpenSSL::SSL::Server.new(tcp_server, ssl_context)

puts "SSL Server listening on #{tcp_server.local_address}"
while true
    begin
      connection = ssl_server.accept
      connection.close rescue nil
    rescue ex
       print "unable to accept", ex
    end
end

I discovered that my server "in production", with similar code underneath, will sometimes hang and stop accepting connections seemingly randomly (incoming connections enter SYN_RECV [TCP backlog]) but no new connections are accepted by the crystal app.

My hypothesis from adding some puts calls is that it enters the call to SSL_Accept and never returns. Turns out there there are several reads and writes within SSL_Accept so if you have a "rogue connection" (or malicious?) that does the TLS handshake "half way" (or even, just connects and does nothing), SSL_Accept won't exit (it's waiting on reads that never come). since the servers are single threaded through the "wrapped accept" portion, this effectively hangs the server. This affects the HTTP server if it uses SSL, etc. HTTP works fine.

Seemingly related discussion here: https://openssl-users.openssl.narkive.com/5QNG9OWX/ssl-accept-hang

AFAICT you can repro it by running the snippet above and connecting and sending nothing, ex: this:

require "socket"
TCPSocket.open("localhost", 8081) { |socket|
  puts "connected #{socket}, sleeping"
  sleep
}

After running both then all future connections to the server "aren't accepted", ex: openssl s_client -connect localhost:8081 works before you run that, and afterward hangs on connect.

crystal 0.30.1

Ubuntu 19,04

@rdp
Copy link
Contributor Author

rdp commented Aug 22, 2019

current workaround, in case interesting:

diff --git a/src/openssl/ssl/server.cr b/src/openssl/ssl/server.cr
index b31e53090..1280bc786 100644
--- a/src/openssl/ssl/server.cr
+++ b/src/openssl/ssl/server.cr
@@ -60,7 +60,10 @@ class OpenSSL::SSL::Server
   #
   # This method calls `@wrapped.accept` and wraps the resulting IO in a SSL socket (`OpenSSL::SSL::Socket::Server`) with `context` configu
ration.
   def accept : OpenSSL::SSL::Socket::Server
-    OpenSSL::SSL::Socket::Server.new(@wrapped.accept, @context, sync_close: @sync_close)
+    socket = @wrapped.accept
+    socket.as(TCPSocket).read_timeout = 60.seconds
+    socket.as(TCPSocket).write_timeout = 60.seconds
+    OpenSSL::SSL::Socket::Server.new(socket, @context, sync_close: @sync_close)
   end
 
   # Implements `::Socket::Server#accept?`.
@@ -68,6 +71,9 @@ class OpenSSL::SSL::Server
   # This method calls `@wrapped.accept?` and wraps the resulting IO in a SSL socket (`OpenSSL::SSL::Socket::Server`) with `context` configuration.
   def accept? : OpenSSL::SSL::Socket::Server?
     if socket = @wrapped.accept?
+      socket.as(TCPSocket).read_timeout = 60.seconds
+      socket.as(TCPSocket).write_timeout = 60.seconds
+
       OpenSSL::SSL::Socket::Server.new(socket, @context, sync_close: @sync_close)
     end
   end

There are some other questions at play, like the fact that it can only accept one client "at a time" so if one client has a slow TLS handshake, the others have to wait for that handshake to fully complete before being accepted (sometimes takes seconds)?

@RX14
Copy link
Contributor

RX14 commented Sep 5, 2019

This is actually a case of over-abstraction harming us

SSL_Accept goes through a BIO which uses normal crystal IO. So if the SSL::*::Socket.new is called in a fiber per-request, then I'm sure that a single slow client could not block the server, since other fibers could continue working - just like normal crystal IO.

The problem is that the ssl_server.accept is called from the server fiber, not the per-request fiber. Since we have to accept the socket before having a connection to spawn the fiber.

So the only correct way to have a SSL::Server is to accept a TCP socket, spawn a fiber for the connection, then handle TLS.

There's no way to do that with a SSL::Server API which is compatible with TCPServer.

@jhass
Copy link
Member

jhass commented Sep 5, 2019

Otoh I guess we don't want to accept an infinite amount of hanging clients anyway, so we don't exhaust our FDs or something else. So the solution seems two fold, first we want to make sure to not hang indefinitely by making sure the IO timeout is set and works. Second, can we just spawn a pool of fibers to accept and send the socket back through a channel?

clients = Channel(IO).new(MAX_PENDING_CONNECTIONS)
MAX_PENDING_CONNECTIONS.times do
  spawn do
    loop do
      channel.send accept
    end
  end
end

loop do
  spawn handle_client(clients.receive)
end

Idk, seems too easy, I'm probably missing something.

@RX14
Copy link
Contributor

RX14 commented Sep 16, 2019

@jhass the number of pending clients can be controlled through read_timeout and write_timeout, if these are set before passing the socket to SSL::*::Socket.new.

I don't like the idea of adding all these fibers and complexity just to maintain the API style. To me, it seems much simpler to embrace the fact that SSL::*::Socket.new is a blocking call and remove SSL::Server.

@rdp
Copy link
Contributor Author

rdp commented Feb 21, 2020

unrelated: Crystal maybe should set so_keepalive by default so that people who "forget" to set a read timeout (and the connection is lost somehow) don't accidentally orphan fibers...forever...just an idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants