-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nginx not closing persistent WebSocket connections of unavailable/unhealthy upstream server #1063
Comments
I don't have experience with nginx configured for this setup, have you tried HAProxy? |
Excuse me for the late response. I didnt get a chance to try HAProxy. I didnt find any documentation that states HAProxy closes persistent connections once the node fails health check. What I learned from my research is that the new traffic/connections will not be made to an unhealthy node. At this moment, I am trying to handle this at application level. Our application knows when we make it unavailable for traffic, in a way it makes it to fail health check so that LoadBalancer will mark it as unhealthy. We were thinking to trigger a code path to drain the connections and make the clients to retry so that the load balancer will route the new requests to healthy nodes. We tried following approaches, approach 2 is working the way I want. But not sure whether that is the best approach to drain cometdsessions on a node.
Thanks |
If you fail the health check from the application, meaning that there is no real problem (e.g. OS crash or network down), then naturally the WebSocket connections will stay open. This means that if you fail the health check from the application, you should disconnect the clients using an application message. What I would do would be the following:
That should be enough. Note that the I think this solution is simpler than your solution 2 above; it's much faster as the disconnect+handshake are almost immediate (no need to wait 30+10 seconds) and it's fully under your control because you use an application message on |
My understanding is when ever a client stops sending heart beat, server will clean up the session associated to the client. I am assuming the WebSocket connection will be closed. With my approach I am simulating the long network failure to remove the session from server as client wont receive meta/connect response from server and it will wait for 30 seconds + 10 seconds time out before making a new request to server. Frankly the server holding of WebSocket connections of clients which are not connected any more scares me. I like your approach and I did think about it. In our environment even though the application node knows when it should start failing health checks, but the load balancer is configured to consider the server is unavailable/unhealthy after it fails three health check. Each health check is configured for an interval of 8 seconds. Which means 24 seconds is needed for load balancer to determine an application node is un healthy. Even if we make the server to send shutdown message over a channel immediately after the node becomes unavailable for traffic the load balancer still may route the traffic to the same application node (sticky sessions). I made some more optimization to my code by only silencing connections that are on WebSocket transport. For long polling the connection will move to healthy node as they are not persistent connections with respect to load balancer. As I am going through the code, I see ServerSessionImpl.cancelSchedule that seems to do the same. It is cancelling the scheduler so that the client will not get any response for meta/connect. By the way we are using 3.1 version of CometD. CancelSchedule is used in destroy method of CometDServlet Thanks |
That's correct. However, my suggestion is that you should not be using a failure&cleanup mechanism to piggyback the failover mechanism from one server to another.
The server does not do that, I don't know where did you get that impression.
Well then my solution is trivial to implement: wait for 3 failures, then send the application shutdown message. Immediate failover and you are done.
I really hope this is not true, and you have persistent connection between the load balancer and the backend server. I also recommend that you don't go too much into the CometD code to hook up to internal methods, because you will have a really hard time when you need to upgrade CometD. I think it is possible for you to stick with the official, standard, CometD APIs to implement failover in a fully controlled and fast way.
Time to upgrade to CometD 5.0.x, see the versions table in this section of the documentation: https://docs.cometd.org/current5/reference/#_preface. |
May be I mis understood your previous response. If I make the server not to respond to meta/connect, the client will wait for 30 seconds (heart beat) + 10 Seconds time out, it will send a meta/connect heart beat message on a new connection. In our environment we clean up a client session after 35 seconds from the time the client sent a meta/connect message to server. In a way the server will think that the client is not active, even though the server made the client not to be active by not responding to meta/connect. **_If you fail the health check from the application, meaning that there is no real problem (e.g. OS crash or network down), then naturally the WebSocket connections will stay open. This means that if you fail the health check from the application, you should disconnect the clients using an application message._**
For HTTP Traffic Nginx is configured to maintain a pool of persistent connection between LB and Backend server. These connections are shared with the incoming requests (multiplexing). But for WebSocket from client to server there is a persistent connection. For HTTP from LB to BackEnd servers it is persistent connection but not from client to LB. As LongPolling is over HTTP, it makes a request and once there is a response, it makes a new request to server through LB. As there is a sequence of request and response over HTTP, LoadBalancer gets an opportunity to route to healthy node. In WebSockets as there is a persistent connection the Nginx is not closing them if an application node fails health check. If a new connection/request comes it is able to route to healthy node. So, This problem is not applicable for HTTP Traffic (LongPolling). We made a lot of customizations on top of CometD. Upgrading to a newer version will come with some cost. At this moment this is not prioritized. I have to make a case to the my management about the cost vs benefits. Is there any better way to figure out what are the new features added in 5 vs 3 or 4? This will be a separate ask for sure. Ideally I want to do what you have proposed. I will try a quick spike. I am considering the cost of development as it needs some client side code changes as the client may not understand the new message (shutdown). One more challenge is the app node doesnt have any knowledge whether it failed a health check or not by LB. As that knowledge is not available to app node. I will give it a shot. Thank you once again for the support. |
@rameshmurthy did you get any solution from nginx support? |
Hello Cometd Community,
This is not a problem with Cometd.
I am reaching out for help for a problem currently we are facing with Nginx (used as a load balancer and configured to support WebSockets).
Problem Statement: Looking for a configuration or setup that can sever persistent websocket connections by Nginx on a server that failed health check.
In our infrastructure Nginx is used as a load balancer. We have configured Nginx to support WebSockets.
We use Nginx to perform a health check on the servers and if any server fails the health check it will not route traffic to the un healthy server. This is working well for HTTP Traffic. The new HTTP Requests are getting routed to a healthy server. Similarly new WebSocket connections are routed to a healthy server. But Nginx is not closing the Persistent connections it has established for WebSocket traffic. This is leaving the existing WebSocket connections to send traffic to the unhealthy node. We reached out to Nginx for help, I am waiting to hear from them.
I want to reach out to CometD community to know whether you have faced any issues with Loadbalancers and WebSockets with respect to marking a server in the pool as unhealthy, even though the server is up, running and responding to cometd heartbeat messages (meta/connect).
Thanks in advance.
The text was updated successfully, but these errors were encountered: