Tight exception loop in AbstractConnector when TCP/IP is reset #283

jmcc0nn3ll · 2016-02-16T23:46:12Z

migrated from Bugzilla #485974
status ASSIGNED severity normal in component server for 9.3.x
Reported in version unspecified on platform Other
Assigned to: Project Inbox

On 2016-01-15 16:04:32 -0500, Bob Bennett wrote:

We are running dropwizard with embedded jetty in a z/OS environment. When the TCPIP process is recycled, we see a very tight loop of the following messages in our log (the exception message is produced around 4000 times per second):

WARN .2016-01-13 20:05:26,322. org.eclipse.jetty.server.ServerConnector:
! com.ibm.net.NetworkRecycledException: Network Recycled while accepting connection
! at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:306) ~.na:1.7.0.
! at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377) ~.daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT.
! at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500) ~.daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAP
SHOT.
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) .daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT
.
! at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) .daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT.
! at java.lang.Thread.run(Thread.java:798) .na:1.7.0.

I'm looking at the code in AbstractConnector.java, and in Accept.run() there is this loop that apparently continues to keep running accept without caring that it is not working and throwing the same exception repeatedly:
        try
        {
            while (isAccepting())
            {
                try
                {
                    accept(_id);
                }
                catch (Throwable e)
                {
                    if (isAccepting())
                        LOG.warn(e);
                    else
                        LOG.ignore(e);
                }
            }
        }
        finally
        {
            thread.setName(name);
            if (_acceptorPriorityDelta!=0)
                thread.setPriority(priority);

            synchronized (AbstractConnector.this)
            {
                _acceptors[_id] = null;
            }
            CountDownLatch stopping=_stopping;
            if (stopping!=null)
                stopping.countDown();
        }
This situation does not occur very often, but when it does the result is severe, as the server log can generate a few million lines of output by the time we realize the situation is happening and manually stop/restart the server.

On 2016-01-15 17:03:40 -0500, Greg Wilkins wrote:

I think for most deployments, it is desirable for the connector to try to continue and accept further connections - it is hard for the connector to distinguish between an exception for a particular accept vs some kind of systematic failure.

Would could perhaps add an optional accept error count (or rate?), which if exceeded aborts - but then should we just stop this connector or stop the server? or stop the entire JVM?

It is difficult for such low level components to make correct system level decisions when confronted with persistent failures.

Is there a way to can make whatever is "recycling the process" first stop the dropwizard process?

On 2016-01-15 18:14:09 -0500, Bob Bennett wrote:

(In reply to Greg Wilkins from comment # 1)

I think for most deployments, it is desirable for the connector to try to
continue and accept further connections - it is hard for the connector to
distinguish between an exception for a particular accept vs some kind of
systematic failure.

Would could perhaps add an optional accept error count (or rate?), which if
exceeded aborts - but then should we just stop this connector or stop the
server? or stop the entire JVM?

I think maybe an acceptable error limit, or a way to name specific exceptions that should cause a restart of the server, would be appropriate. In this case there are two exceptions that we are observing - java.io.IOException when TCPIP is down, and com.ibm.net.NetworkRecycledException when TCPIP comes back up. An appropriate reaction to the NetworkRecycledException would be to restart the server, which would close the socket connection and reestablish it.

It is difficult for such low level components to make correct system level
decisions when confronted with persistent failures.

Yes, but the current behavior is like the classical definition of insanity - repeating the same thing over and over and expecting different results. There's got to come a time when you make a determination that the network is just not coming back.

Is there a way to can make whatever is "recycling the process" first stop
the dropwizard process?

Normally, in a z/OS environment, the operator who is cycling the network process is not aware of what other servers are using the network. It might be helpful if there were some callback method that we could set up to look at the exception and have the option of resetting the server ourselves.

The text was updated successfully, but these errors were encountered:

gregw · 2016-02-17T13:56:55Z

Changing this to an enhancement request, as I don't think it is a bug. Still not really sure what is the best thing to do here.

dave-griffiths · 2016-05-26T17:01:00Z

Hi, I think this is a bug, not an enhancement. We are seeing it and if you google "jetty NetworkRecycledException" you will see others have hit it also. It is impossible for us to workaround as the loop is in the jetty code. I suggest catching the NetworkRecycledException in AbstractConnector$Acceptor.run and reopen the socket (if possible).

It's a bit unfortunate because the only way you know it is happening is when you see your log file has become huge! Our customer has seen this even after recycling TCPIP and then starting the server.

sbordet · 2016-05-26T17:09:10Z

I think this is a duplicate of #354, which has been solved by allowing to override AbstractConnector.handleAcceptFailure() to let applications decide what to do in case of an exception during accept().

sbordet · 2016-05-26T17:12:23Z

We cannot obviously catch NetworkRecycledException because that is an IBM specific exception.

Unfortunately accept() is under-specified in case of exceptions, so there is no easy way for Jetty to tell, in a portable way, whether it should abort accepting, whether it should just retry immediately, whether it should retry after a while, etc. Hence the introduction of handleAcceptFailure() that can be overridden.

dave-griffiths · 2016-05-26T17:37:22Z

Ok thanks for that. It looks like we use org.mortbay.jetty.nio.SelectChannelConnector so we would need to subclass that and override handleAcceptFailure?

sbordet · 2016-05-26T17:39:47Z

@hugograffiti what Jetty version are you on ? The fix is in 9.3.x.

dave-griffiths · 2016-05-26T17:47:35Z

Er, 6.1.25 :-) If we fix we will just have to do our own thing. Doesn't look like the override thing would work so would need to modify AbstractConnector.

sbordet · 2016-05-26T18:52:49Z

All right, then closing this issue. Thanks !

jmcc0nn3ll added the Bug For general bugs on Jetty side label Feb 16, 2016

gregw added Enhancement Help Wanted and removed Bug For general bugs on Jetty side labels Feb 17, 2016

sbordet closed this as completed May 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tight exception loop in AbstractConnector when TCP/IP is reset #283

Tight exception loop in AbstractConnector when TCP/IP is reset #283

jmcc0nn3ll commented Feb 16, 2016

gregw commented Feb 17, 2016

dave-griffiths commented May 26, 2016

sbordet commented May 26, 2016

sbordet commented May 26, 2016

dave-griffiths commented May 26, 2016

sbordet commented May 26, 2016

dave-griffiths commented May 26, 2016

sbordet commented May 26, 2016

Tight exception loop in AbstractConnector when TCP/IP is reset #283

Tight exception loop in AbstractConnector when TCP/IP is reset #283

Comments

jmcc0nn3ll commented Feb 16, 2016

gregw commented Feb 17, 2016

dave-griffiths commented May 26, 2016

sbordet commented May 26, 2016

sbordet commented May 26, 2016

dave-griffiths commented May 26, 2016

sbordet commented May 26, 2016

dave-griffiths commented May 26, 2016

sbordet commented May 26, 2016