Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connections in CLOSE_WAIT are not closed and hang around seemingly forever #2169

Closed
janbartel opened this issue Jan 30, 2018 · 37 comments
Closed

Comments

@janbartel
Copy link
Contributor

Alex Uhl recently commented on old jetty issue https://bugs.eclipse.org/bugs/show_bug.cgi?id=482117 that was closed in 2015 as a WONT_FIX - or really a "can't fix in java". Here's his comments copied from bugzilla, but the reader should ensure to follow the link above to the original issue to obtain all the necessary context:

We've been facing a very similar issue. For reasons too long to explain here we're still running 9.2.2 in our production environment. An Amazon Application Load Balancer (ALB), more specifically one of its target groups, is doing health checks on our Jetty instances. When under heavy load, we recently "lost" two instances to the CLOSE_WAIT issue, with ~30k sockets in that state, running into the limits for file descriptors, stalling the instance.

Maybe worth noting: the lsof output lists those sockets as all IPv6:

java 22664 xxxx 570u IPv6 57747443 0t0 TCP localhost:ddi-tcp-1->localhost:60836 (CLOSE_WAIT)

When we put an Apache reverse proxy between the ALB and the instance and have the ALB run its health check through the reverse proxy then the problem doesn't seem to occur. The Apache server will use IPv4 to connect to the Jetty instance.

It seems weird that Apache would be able to handle this properly while Jetty chokes an entire server based on this. I would really hope there are ways to get this fixes and would be happy if this bug could be re-opened. In our case it poses a severe risk for our production environment in case anyone tries to set up health checks straight to the Jetty port.

@axeluhl
Copy link

axeluhl commented Jan 30, 2018

According to http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html I suppose the problem is that the AWS Application Load Balancer Target Group's health check has sent the FIN, and the Jetty server may have sent the ACK, however, nothing ever gets the Jetty socket out of CLOSE_WAIT anymore. If I understand things correctly, it would be up to the server (Jetty in this case) to call close() on the socket after having received the FIN and having sent the ACK. Only then would it send the FIN to the health checker, and the socket could reach the CLOSED state.

Under these assumptions this really sounds like a Jetty issue to me. How come Jetty wouldn't close() the CLOSE_WAIT sockets after having received the FIN from the health checker?

@joakime
Copy link
Member

joakime commented Jan 30, 2018

@axeluhl perform a server dump (see jmx) when the server is in this state.
That will hopefully get us the report of the connections that Jetty knows about, and associated selector states.

Are you using SSL?
If so, are you up to date with your JVM (a requirement is to have an up to date and non-expired JVM).

@gregw
Copy link
Contributor

gregw commented Jan 30, 2018

@axeluhl,

I agree it sounds like a jetty issue, but we've never been able to reproduce it. Typically we'd not look at a 9.2.2 issue without commercial support, but I have also seen reports of this on later versions, so I would really like to get to the bottom of it.

We need a reproduction so we can verify what is going on... or at least a bit more info. Is there any chance that you can get a tcpdump/wireshark trace of the health check conversations that end up in close wait. I know it sounds like it should be obvious... ie AWS sends a FIN, we send ACK, but then never send a FIN.... but it is key to understand exactly where in the conversation this is happening, because Jetty should read -1 when the FIN is received and close the connection, thus sending the FIN... but obviously we are not doing this for these particular requests, so it would be good to get a trace so that we can try playing back a near identical conversation and see if we can reproduce.

Edit: also the server dump that @joakime asked for would be good... although I think it has less info in 9.2.2 than now.

@gregw gregw self-assigned this Jan 30, 2018
@axeluhl
Copy link

axeluhl commented Jan 30, 2018

We're using the following VM (in reply to @joakime 's question):
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 8.1.032)
SAP Java Server VM (build 8.1.032 25.51-b02, Jul 27 2017 16:40:00 - 81_REL - optU - linux amd64 - 6 - bas2:292506 (mixed mode))

I also managed to connect to the JVM using JConsole, navigate to the org.eclipse.jetty.jmx/mbeancontainer/0/Operations/dump node and create a dump output. What to look for in the output? I cannot find anything about "selector" or "Selector" in the dump.

@axeluhl
Copy link

axeluhl commented Jan 30, 2018

And no, our Jetty installation doesn't use SSL. It runs in an Equinox OSGi environment, together with several web bundles that it serves. The load balancer's target group was configured to obtain a static HTML resource from one of these web bundles in its health check using HTTP against our Jetty port (8888 in this case).

@axeluhl
Copy link

axeluhl commented Jan 30, 2018

Ah, now I think I see. org.eclipse.jetty.server/serverconnector/HTTP/1.1@1e35587a/0/Operations has the dump() method you were probably looking for. I've attached the output.
jetty_jmx_serverconnector_dump.txt

Here is what the netstat -anp output looks like when grepped for port 8888:

tcp 0 0 :::8888 :::* LISTEN 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.30.64:57477 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:10562 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.30.64:5980 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.28.186:58519 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.12.130:3295 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.28.186:59858 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:47281 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:31901 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.28.186:22707 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.12.130:9390 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:17100 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.12.130:53936 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.30.64:18720 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:53830 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:7410 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.30.64:1888 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.28.186:52451 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.28.186:5643 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.30.64:43964 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.28.186:48242 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.12.130:37038 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.30.64:52054 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.30.64:39976 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:6930 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:3658 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.14.28:4697 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.12.130:49455 CLOSE_WAIT 22664/java
tcp 144 0 ::ffff:172.31.24.254:8888 ::ffff:172.31.28.186:51287 CLOSE_WAIT 22664/java
.... (24000 more entries...)

The output of lsof -p 22664 grepped for ddi-tcp-1 gives:

java 22664 sailing 79u IPv6 57717081 0t0 TCP localhost:ddi-tcp-1->localhost:56842 (CLOSE_WAIT)
java 22664 sailing 148u IPv6 57777407 0t0 TCP localhost:ddi-tcp-1->localhost:59172 (CLOSE_WAIT)
java 22664 sailing 164u IPv6 57734908 0t0 TCP localhost:ddi-tcp-1->localhost:43452 (CLOSE_WAIT)
java 22664 sailing 172u IPv6 682108910 0t0 TCP ip-172-31-24-254.eu-west-1.compute.internal:ddi-tcp-1->ip-172-31-28-186.eu-west-1.compute.internal:61774 (CLOSE_WAIT)
java 22664 sailing 179u IPv6 57761230 0t0 TCP localhost:ddi-tcp-1->localhost:52958 (CLOSE_WAIT)
java 22664 sailing 181u IPv6 57771350 0t0 TCP localhost:ddi-tcp-1->localhost:59126 (CLOSE_WAIT)
java 22664 sailing 182u IPv6 57747370 0t0 TCP localhost:ddi-tcp-1->localhost:59558 (CLOSE_WAIT)
java 22664 sailing 183u IPv6 57699074 0t0 TCP localhost:ddi-tcp-1->localhost:56334 (CLOSE_WAIT)
java 22664 sailing 184u IPv6 57756826 0t0 TCP localhost:ddi-tcp-1->localhost:50854 (CLOSE_WAIT)
java 22664 sailing 185u IPv6 57728970 0t0 TCP localhost:ddi-tcp-1->localhost:46646 (CLOSE_WAIT)
java 22664 sailing 186u IPv6 57741100 0t0 TCP localhost:ddi-tcp-1->localhost:46478 (CLOSE_WAIT)
java 22664 sailing 189u IPv6 57693207 0t0 TCP localhost:ddi-tcp-1->localhost:43752 (CLOSE_WAIT)
java 22664 sailing 190u IPv6 57700058 0t0 TCP localhost:ddi-tcp-1->localhost:55922 (CLOSE_WAIT)
java 22664 sailing 193u IPv6 520895331 0t0 TCP ip-172-31-24-254.eu-west-1.compute.internal:ddi-tcp-1->ip-172-31-30-64.eu-west-1.compute.internal:32114 (CLOSE_WAIT)
java 22664 sailing 196u IPv6 57756684 0t0 TCP localhost:ddi-tcp-1->localhost:47362 (CLOSE_WAIT)
java 22664 sailing 197u IPv6 57737757 0t0 TCP localhost:ddi-tcp-1->localhost:47524 (CLOSE_WAIT)
java 22664 sailing 198u IPv6 56462726 0t0 TCP localhost:ddi-tcp-1->localhost:55548 (CLOSE_WAIT)
java 22664 sailing 199u IPv6 57705241 0t0 TCP localhost:ddi-tcp-1->localhost:57876 (CLOSE_WAIT)
java 22664 sailing 200u IPv6 57737973 0t0 TCP localhost:ddi-tcp-1->localhost:53606 (CLOSE_WAIT)
java 22664 sailing 201u IPv6 57747372 0t0 TCP localhost:ddi-tcp-1->localhost:59568 (CLOSE_WAIT)
java 22664 sailing 202u IPv6 57772224 0t0 TCP localhost:ddi-tcp-1->localhost:58284 (CLOSE_WAIT)
java 22664 sailing 205u IPv6 57747371 0t0 TCP localhost:ddi-tcp-1->localhost:59562 (CLOSE_WAIT)
java 22664 sailing 206u IPv6 57752398 0t0 TCP localhost:ddi-tcp-1->localhost:59214 (CLOSE_WAIT)
java 22664 sailing 207u IPv6 57714357 0t0 TCP localhost:ddi-tcp-1->localhost:59654 (CLOSE_WAIT)
java 22664 sailing 208u IPv6 57756733 0t0 TCP localhost:ddi-tcp-1->localhost:48590 (CLOSE_WAIT)
... (and around 24k more)

@gregw
Copy link
Contributor

gregw commented Jan 31, 2018

@axeluhl thanks for that, it does confirm that this is a jetty bug as you can see the state of the endpoints is io=0,kio=0,kro=1 ie we want to be interested in no IO events, the selectionKey is interested in no IO events, but there is a read event (the -1 close) ready to be read.

I'll do you a deal! If you can get me a tcpdump or wireshark capture that gives me enough information to reproduce on 9.2.2 (so I can then test if 9.4.x is vulnerable), I'll produce a patch for 9.2.2 (even though it is end-of-life) for you... even if 9.4.x is not vulnerable to this.

@axeluhl
Copy link

axeluhl commented Jan 31, 2018

@gregw, cool stuff. Are you saying 9.4.x wouldn't be susceptible to this type of issue anymore? That would be reason enough for us to upgrade.
As for reproduction: probably not too easy. I guess I would somehow have to simulate the "high load" scenario. The easiest I could think of is putting the VM into debug mode and pausing where Jetty is trying to respond to the health checker's request. Would you happen to have a good hint for where to place such a breakpoint? I would then hope that the ALB times out, closes the connection, and then we could see if the issue reproduces this way. If it does, I can work on wiresharking this.
However, I have to admit, if 9.4.x solves the problem I'd rather work on the upgrade than invest the time in reliably reproducing the issue on 9.2.2...

@axeluhl
Copy link

axeluhl commented Jan 31, 2018

Here's a tcpdump of what the load balancer tries when health-checking the broken instance:
[root@ip-172-31-24-254 ~]# tcpdump 'tcp port 8888'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
10:10:43.564650 IP ip-172-31-28-241.eu-west-1.compute.internal.34954 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 3353627164, win 26883, options [mss 8961,sackOK,TS val 44130671 ecr 0,nop,wscale 8], length 0
10:10:43.564744 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.34954: Flags [S.], seq 2489394674, ack 3353627165, win 26847, options [mss 8961,sackOK,TS val 168740666 ecr 44130671,nop,wscale 7], length 0
10:10:43.564921 IP ip-172-31-28-241.eu-west-1.compute.internal.34954 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 44130671 ecr 168740666], length 0
10:10:43.565030 IP ip-172-31-28-241.eu-west-1.compute.internal.34954 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 44130671 ecr 168740666], length 143
10:10:43.565039 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.34954: Flags [.], ack 144, win 219, options [nop,nop,TS val 168740666 ecr 44130671], length 0
10:10:43.565167 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.34954: Flags [R.], seq 1, ack 144, win 219, options [nop,nop,TS val 168740666 ecr 44130671], length 0
10:10:43.574008 IP ip-172-31-15-197.eu-west-1.compute.internal.9794 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 1989954654, win 26883, options [mss 8961,sackOK,TS val 95979199 ecr 0,nop,wscale 8], length 0
10:10:43.574031 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.9794: Flags [S.], seq 3652515805, ack 1989954655, win 26847, options [mss 8961,sackOK,TS val 168740668 ecr 95979199,nop,wscale 7], length 0
10:10:43.576822 IP ip-172-31-15-197.eu-west-1.compute.internal.9794 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 95979200 ecr 168740668], length 0
10:10:43.577103 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.9794: Flags [F.], seq 1, ack 1, win 210, options [nop,nop,TS val 168740669 ecr 95979200], length 0
10:10:43.578185 IP ip-172-31-15-197.eu-west-1.compute.internal.9794 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 95979200 ecr 168740668], length 143
10:10:43.578207 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.9794: Flags [R], seq 3652515806, win 0, length 0
10:10:43.580445 IP ip-172-31-15-197.eu-west-1.compute.internal.9794 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 2, win 106, options [nop,nop,TS val 95979201 ecr 168740669], length 0
10:10:43.580460 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.9794: Flags [R], seq 3652515807, win 0, length 0
10:10:43.581347 IP ip-172-31-15-197.eu-west-1.compute.internal.9794 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [F.], seq 144, ack 2, win 106, options [nop,nop,TS val 95979201 ecr 168740669], length 0
10:10:43.581354 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.9794: Flags [R], seq 3652515807, win 0, length 0
10:10:48.542141 IP ip-172-31-28-241.eu-west-1.compute.internal.35244 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 117847519, win 26883, options [mss 8961,sackOK,TS val 44131915 ecr 0,nop,wscale 8], length 0
10:10:48.542176 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35244: Flags [S.], seq 3621088146, ack 117847520, win 26847, options [mss 8961,sackOK,TS val 168741910 ecr 44131915,nop,wscale 7], length 0
10:10:48.542346 IP ip-172-31-28-241.eu-west-1.compute.internal.35244 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 44131915 ecr 168741910], length 0
10:10:48.542374 IP ip-172-31-28-241.eu-west-1.compute.internal.35244 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 44131915 ecr 168741910], length 143
10:10:48.542381 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35244: Flags [.], ack 144, win 219, options [nop,nop,TS val 168741910 ecr 44131915], length 0
10:10:48.542627 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35244: Flags [R.], seq 1, ack 144, win 219, options [nop,nop,TS val 168741910 ecr 44131915], length 0
10:10:48.545986 IP ip-172-31-15-197.eu-west-1.compute.internal.emcrmirccd > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 208667035, win 26883, options [mss 8961,sackOK,TS val 95980443 ecr 0,nop,wscale 8], length 0
10:10:48.545997 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.emcrmirccd: Flags [S.], seq 1120497861, ack 208667036, win 26847, options [mss 8961,sackOK,TS val 168741911 ecr 95980443,nop,wscale 7], length 0
10:10:48.546523 IP ip-172-31-15-197.eu-west-1.compute.internal.emcrmirccd > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 95980443 ecr 168741911], length 0
10:10:48.546676 IP ip-172-31-15-197.eu-west-1.compute.internal.emcrmirccd > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 95980443 ecr 168741911], length 143
10:10:48.546690 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.emcrmirccd: Flags [.], ack 144, win 219, options [nop,nop,TS val 168741911 ecr 95980443], length 0
10:10:48.546761 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.emcrmirccd: Flags [R.], seq 1, ack 144, win 219, options [nop,nop,TS val 168741911 ecr 95980443], length 0
10:10:53.540376 IP ip-172-31-28-241.eu-west-1.compute.internal.35510 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 2999687365, win 26883, options [mss 8961,sackOK,TS val 44133165 ecr 0,nop,wscale 8], length 0
10:10:53.540428 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35510: Flags [S.], seq 3347643045, ack 2999687366, win 26847, options [mss 8961,sackOK,TS val 168743160 ecr 44133165,nop,wscale 7], length 0
10:10:53.540617 IP ip-172-31-28-241.eu-west-1.compute.internal.35510 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 44133165 ecr 168743160], length 0
10:10:53.540654 IP ip-172-31-28-241.eu-west-1.compute.internal.35510 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 44133165 ecr 168743160], length 143
10:10:53.540666 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35510: Flags [.], ack 144, win 219, options [nop,nop,TS val 168743160 ecr 44133165], length 0
10:10:53.540921 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35510: Flags [R.], seq 1, ack 144, win 219, options [nop,nop,TS val 168743160 ecr 44133165], length 0
10:10:53.543544 IP ip-172-31-15-197.eu-west-1.compute.internal.10192 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 1627859977, win 26883, options [mss 8961,sackOK,TS val 95981692 ecr 0,nop,wscale 8], length 0
10:10:53.543572 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10192: Flags [S.], seq 2779160040, ack 1627859978, win 26847, options [mss 8961,sackOK,TS val 168743160 ecr 95981692,nop,wscale 7], length 0
10:10:53.544171 IP ip-172-31-15-197.eu-west-1.compute.internal.10192 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 95981692 ecr 168743160], length 0
10:10:53.544339 IP ip-172-31-15-197.eu-west-1.compute.internal.10192 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 95981692 ecr 168743160], length 143
10:10:53.544341 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10192: Flags [F.], seq 1, ack 1, win 210, options [nop,nop,TS val 168743161 ecr 95981692], length 0
10:10:53.544361 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10192: Flags [R], seq 2779160041, win 0, length 0
10:10:53.545035 IP ip-172-31-15-197.eu-west-1.compute.internal.10192 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 2, win 106, options [nop,nop,TS val 95981692 ecr 168743161], length 0
10:10:53.545052 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10192: Flags [R], seq 2779160042, win 0, length 0
10:10:53.545057 IP ip-172-31-15-197.eu-west-1.compute.internal.10192 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [F.], seq 144, ack 2, win 106, options [nop,nop,TS val 95981692 ecr 168743161], length 0
10:10:53.545061 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10192: Flags [R], seq 2779160042, win 0, length 0
10:10:58.544661 IP ip-172-31-15-197.eu-west-1.compute.internal.10386 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 2248858369, win 26883, options [mss 8961,sackOK,TS val 95982942 ecr 0,nop,wscale 8], length 0
10:10:58.544692 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10386: Flags [S.], seq 346325407, ack 2248858370, win 26847, options [mss 8961,sackOK,TS val 168744411 ecr 95982942,nop,wscale 7], length 0
10:10:58.545241 IP ip-172-31-28-241.eu-west-1.compute.internal.35796 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [S], seq 295675558, win 26883, options [mss 8961,sackOK,TS val 44134416 ecr 0,nop,wscale 8], length 0
10:10:58.545252 IP ip-172-31-15-197.eu-west-1.compute.internal.10386 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 95982942 ecr 168744411], length 0
10:10:58.545257 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35796: Flags [S.], seq 2804354528, ack 295675559, win 26847, options [mss 8961,sackOK,TS val 168744411 ecr 44134416,nop,wscale 7], length 0
10:10:58.545411 IP ip-172-31-28-241.eu-west-1.compute.internal.35796 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [.], ack 1, win 106, options [nop,nop,TS val 44134416 ecr 168744411], length 0
10:10:58.545422 IP ip-172-31-15-197.eu-west-1.compute.internal.10386 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 95982942 ecr 168744411], length 143
10:10:58.545431 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10386: Flags [.], ack 144, win 219, options [nop,nop,TS val 168744411 ecr 95982942], length 0
10:10:58.545449 IP ip-172-31-28-241.eu-west-1.compute.internal.35796 > ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1: Flags [P.], seq 1:144, ack 1, win 106, options [nop,nop,TS val 44134416 ecr 168744411], length 143
10:10:58.545456 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35796: Flags [.], ack 144, win 219, options [nop,nop,TS val 168744411 ecr 44134416], length 0
10:10:58.545466 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-15-197.eu-west-1.compute.internal.10386: Flags [R.], seq 1, ack 144, win 219, options [nop,nop,TS val 168744411 ecr 95982942], length 0
10:10:58.545575 IP ip-172-31-24-254.eu-west-1.compute.internal.ddi-tcp-1 > ip-172-31-28-241.eu-west-1.compute.internal.35796: Flags [R.], seq 1, ack 144, win 219, options [nop,nop,TS val 168744411 ecr 44134416], length 0
^C
56 packets captured
56 packets received by filter
0 packets dropped by kernel

@gregw
Copy link
Contributor

gregw commented Jan 31, 2018

@axeluhl I would like to think that 9.4.x is not vulnerable to this, but I cannot say for sure as I can't reproduce in 9.2.2, so I can't test the same scenario in 9.4.x. However, a lot of the work we have done in 9.4 has been on the whole close conversation, so there is a good chance it is fixed. 9.4.x is also going to be maintained for several years to come, while we are trying to drop 9.2.x as best we can.

So if you can upgrade, do upgrade... for many reasons not least that it may fix this bug.

Thanks for the tracedump, but I'd like to get that as a pcap file. Can you use the -w option to write it to a file and then attach the file here.

thanks

@axeluhl
Copy link

axeluhl commented Jan 31, 2018

There you go. HTH
tcpdump.zip

@gregw
Copy link
Contributor

gregw commented Feb 1, 2018

@axeluhl thanks for that... but small complication is that I go on vacation for 2 weeks from tomorrow... so I wont get a chance to look in detail until mid feb. Sorry for the delay

@mohsinbeg
Copy link

mohsinbeg commented Feb 7, 2018

Looks like I'm also running into this issue on jetty9.3.14 that is packaged into solr6.6.2

solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c 
   8425 CLOSE_WAIT - 
     92 ESTABLISHED - 
      1 FIN_WAIT2 - 
      1 Foreign Address 
      6 LISTEN - 
    712 TIME_WAIT - 
      1 established) 

solr [ /opt/solr ]$ echo "run -b 
org.eclipse.jetty.server:context=HTTP/1.1@63e2203c,id=0,type=serverconnector dump " | java -jar jmxterm-1.0.0-uber.jar -l localhost:18983 -v silent -n > jettyJmx.out 

solr [ /opt/solr ]$ netstat -anp | grep CLOSE_WAIT | head -n1 
tcp        1      0 10.xxx.x.xx:8983        10.xxx.x.xxx:53873     CLOSE_WAIT  1/java 

solr [ /opt/solr ]$ grep "10.xxx.x.xxx:53873" jettyJmx.out 
 |   |   +- SelectionKey@5ee4ef9f{i=0}->SelectChannelEndPoint@69feb476{/10.xxx.x.xxx:53873<->8983,Open,in,out,-,-,1712/5000,HttpConnection@c93d7fa}{io=0/0,kio=0,kro=1}

solr [ /opt/solr ]$ cat jettyJmx.out | grep 8983 | grep SelectChannelEndPoint | grep "{io=0/0,kio=0,kro=1}" | wc -l 
8220 

@sbordet
Copy link
Contributor

sbordet commented Feb 7, 2018

@axeluhl I'm looking at this issue. From the pcap I can see the following pattern repeated over and over:

c: SYN
s: SYN+ACK
c: ACK
c: GET /index.html HTTP/1.1
   Host: x.y.w.z
   Connection: close
   User-Agent: ELB-HealthChecker/2.0
   Accept-Encoding: gzip, compressed
s: ACK
s: RST

Jetty typically does not reply this abruptly with a RST to a good request like that.

Can you enable the server DEBUG logs ?
Do you have any evidence that this request reaches Jetty at all ?
Do you have a stack trace on the server that is printed when receiving such requests ?
Do you have SO_LINGER configured differently from the default in the ServerConnector ?
Do you have any lower level TCP handling on that server that may send the RST back (e.g. iptables, ufw, etc.) ?
What is the limits of you server with respect to number of open files ?

What we need to understand here is whether it's Jetty that sends that RST. We know from the pcap that's the server machine, now we need to know if it's really Jetty or something in between the server and Jetty (OS and other programs) - the Jetty DEBUG logs will tell us that.

@mohsinbeg
Copy link

@sbordet - is there something I can provide since this issue is impacting our production systems ?

@sbordet
Copy link
Contributor

sbordet commented Feb 7, 2018

@mohsinbeg we need to verify that your case is exactly the same as reported by @axeluhl.
Do you also have Jetty behind ELB ?
Do you have the same pcap behavior ?

If so, what are your answers to my questions above ?

@mohsinbeg
Copy link

mohsinbeg commented Feb 8, 2018

Do you have any evidence that this request reaches Jetty at all ?

Yes.

Do you have a stack trace on the server that is printed when receiving such requests ?

Yes, solr is logging all POST requests in it's log

Do you have SO_LINGER configured differently from the default in the ServerConnector ?

solr [ /opt/solr/server/etc ]$ grep soLingerTime jetty-http.xml
  <Set name="soLingerTime"><Property name="solr.jetty.http.soLingerTime" default="-1"/ </Set>

Do you have any lower level TCP handling on that server that may send the RST back (e.g. iptables, ufw, etc.) ?

It's running inside kubernetes cluster in docker on AWS EC2 instance.

What is the limits of you server with respect to number of open files ?

solr [ /opt/solr ]$ sudo sysctl -a | grep net.core.somaxconn
net.core.somaxconn = 4096

solr [ /opt/solr ]$ sudo sysctl -a | grep fs.file-max
fs.file-max = 262144

solr [ /opt/solr ]$ sudo sysctl -a | grep vm.max_map_count
vm.max_map_count = 262144

Do you also have Jetty behind ELB ?

No

Do you have the same pcap behavior ?

tcpdump is attached.
mycap.zip

netstat before and after tcpdump capture

solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c; date
      1 CLOSE_WAIT -
   5095 CLOSE_WAIT 1/java
      2 ESTABLISHED -
    174 ESTABLISHED 1/java
      1 FIN_WAIT2 1/java
      1 Foreign Address
      2 LISTEN -
      4 LISTEN 1/java
    497 TIME_WAIT -
      1 established)
Thu Feb  8 01:22:11 UTC 2018

solr [ /opt/solr ]$ sudo tcpdump 'tcp port 8983' -w mycap.pcap
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C1787 packets captured
1957 packets received by filter
55 packets dropped by kernel

solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c; date
      1 CLOSE_WAIT -
   5108 CLOSE_WAIT 1/java
      2 ESTABLISHED -
    270 ESTABLISHED 1/java
      1 FIN_WAIT2 -
      1 Foreign Address
      2 LISTEN -
      4 LISTEN 1/java
    611 TIME_WAIT -
      1 established)
Thu Feb  8 01:22:18 UTC 2018```

@Hi-Fi
Copy link

Hi-Fi commented Feb 13, 2018

At least in our case we saw a lot of EOFExceptions when traffic was going through Netscalers (NS). From NS end it seemed that NS closed idling connections and sent RST to socket client. But that connection was used over and over again by jetty, as there's no default idle connection timeout set.

When we changed clientside idle connection timeout to smaller value than NS idle connection timeout, issue was gone.

This happened with jetty-client 9.4.7.v20170914.

@sbordet
Copy link
Contributor

sbordet commented Feb 22, 2018

@mohsinbeg the TCP dump (well it's missing 55 frames) shows that there are 3 FIN sent by the client that are immediately acknowledged by the server at around 2:22:15.8959.
So the reason for your CLOSE_WAIT to increase in the period the TCP dump was taken does not seem to be related to the traffic on port 8983.
Don't know what else to tell you, but it's not the same problem as @axeluhl and from the data you provided is not a problem in Jetty/Solr either.

@mohsinbeg
Copy link

@sbordet - this issue was related to solr v6.6.2 indexing settings that were impacting jetty. Once solr's solrconfig.xml was tuned for the ingestion load, jetty issues went away.

http://lucene.472066.n3.nabble.com/9000-CLOSE-WAIT-connections-in-solr-v6-2-2-causing-it-to-quot-die-quot-td4373765.html has the details.

FYI. This issue went away after solrconfig.xml was tuned.

"Hard commits blocked | non-solrcloud v6.6.2" thread has the details.
http://lucene.472066.n3.nabble.com/Hard-commits-blocked-non-solrcloud-v6-6-2-td4374386.html

@sbordet
Copy link
Contributor

sbordet commented Mar 9, 2018

@mohsinbeg thanks for the feedback on this issue.

@axeluhl do you have more information about this issue ?

@axeluhl
Copy link

axeluhl commented Mar 9, 2018

We've upgraded our Jetty installation now to 9.4.8. I'll report back here how it goes after the upgrade the next time our server instances get under a lot of pressure. Thanks for now!

@anfabetagama
Copy link

@axeluhl we have got the same issue, could you please report yours when you upgraded. Thank you.

@axeluhl
Copy link

axeluhl commented May 17, 2018

The issue hasn't occurred again after we upgraded. However, we've also been cautious to configure the health checks through the Apache reverse proxy. Since we've never run into the issue when the health check went through Apache, I cannot really say whether the Jetty upgrade fixed the issue.

@hagrrwal
Copy link

hagrrwal commented Jul 5, 2018

I am facing the above issue in jetty 9.3.14. Has this issue has been resolved and in which version of jetty it has been resolved? If not is there any hack to resolve it? Its been a blocker on my production environment.

@sbordet
Copy link
Contributor

sbordet commented Jul 5, 2018

@hagrrwal so far this issue has been proven to not be a Jetty problem. Please update to the latest Jetty version (9.4.11 at this time) and report back if you still have the issue.

@hagrrwal
Copy link

hagrrwal commented Jul 6, 2018

@sbordet I have updated jetty version with 9.4.11 and deployed it however issue still persist. TCP Connection is going in CLOSE_WAIT forever and not getting closed.

@sbordet
Copy link
Contributor

sbordet commented Jul 6, 2018

@hagrrwal how do you know it's Jetty that generates those CLOSE_WAIT?

I ask because in all other cases, it turned out to be some other library used by the application deployed in Jetty.

@hagrrwal
Copy link

hagrrwal commented Jul 6, 2018

@sbordet Embedded jetty is used with Glassfish for container & dependency injection.
Also Ngnix load balancer is forwarding request to jetty acting as a client. At Ngnix server the jetty TCP request/connection is going in FIN_WAIT2 state.

@joakime joakime changed the title Jetty-9.2 connections in CLOSE_WAIT are not closed Connections in CLOSE_WAIT are not closed and hang around seemingly forever Jul 16, 2018
@joakime
Copy link
Member

joakime commented Jul 16, 2018

@hagrrwal in that dependency tree, you have 2 servers, and 1 client.
Of the CLOSE_WAIT state connections, which server (jetty or glassfish/jersey)? and what client (jersey or nginx frontend)?
What version of nginx btw?
Also, what is your nginx configuration?
Also, what OS are you running this on? (if Linux, what kernel version?)

@joakime
Copy link
Member

joakime commented Jul 16, 2018

Make sure your nginx configuration does not use any sort of connection keep-alive configuration (that's for HTTP/1.0, not HTTP/1.1) and nothing really uses HTTP/1.0 anymore.

@joakime
Copy link
Member

joakime commented Jul 16, 2018

OP has filed 2 stackoverflow issues around this ...

  1. https://stackoverflow.com/questions/51179169/jetty-connection-going-in-close-wait
  2. https://stackoverflow.com/questions/51346101/jetty-connections-in-close-wait-are-not-closed

@hagrrwal
Copy link

hagrrwal commented Jul 16, 2018

@sbordet and @joakime Thank you for looking into it, I have found the cause for my issue. By analysing thread dump I figured out application threads are going in waiting state (java.lang.Thread.State: WAITING (parking)) causing total lockup for entire application and Jetty as well. Basically some other library used by the application deployed in Jetty caused thread wait.

Server is jetty and client is ngnix.
What version of nginx btw? - nginx/1.10.3 (Ubuntu)
Also, what is your nginx configuration?
upstream web_backend {
server 10.10.10.10:8443;
}
server {
listen 80;
location / {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass http://web_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
Also, what OS are you running this on? (if Linux, what kernel version?) -Ubuntu 16.04.4 LTS

@sbordet
Copy link
Contributor

sbordet commented Jul 17, 2018

Closing the issue as the 3 reports turned out to not be Jetty issues.

@sbordet sbordet closed this as completed Jul 17, 2018
@mrossi975
Copy link

hi,
we are having the same problem here and I am pretty sure it's not due to an external library.
This is bringing down our production system for the second time in a week :(

I am running the latest Jetty 9.4.14.v20181114 on Java 11.
Right now I see 2600 connections in CLOSE_WAIT using lsof.
They are all from the external https port.
Not sure what I could do to help diagnose the problem properly.

It's possible that something has gone wrong in my application code leading to requests never returning but even if that was case Jetty could probably do something better about it.

You could have a sweeper or something that kills off connections after a certain timeout logging that stacktrace that lead to that connection being stuck? (not sure if possible).

Unfortunately now I can't even get a thread dump out of the process as all its files are exhausted.

thanks!

@joakime
Copy link
Member

joakime commented Jan 8, 2019

@mrossi975 please capture a Jetty Server dump and report back.
See https://www.eclipse.org/jetty/documentation/9.4.x/jetty-dump-tool.html (and the comments above in this issue)

We'll be happy to reopen this issue once the dump has been provided.

@beatofandhra
Copy link

@mrossi975 Did you fix your issue? We are having the same problem in our production. Any help is appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests