Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tight exception loop in AbstractConnector when TCP/IP is reset #283

Closed
jmcc0nn3ll opened this issue Feb 16, 2016 · 8 comments
Closed

Tight exception loop in AbstractConnector when TCP/IP is reset #283

jmcc0nn3ll opened this issue Feb 16, 2016 · 8 comments

Comments

@jmcc0nn3ll
Copy link
Contributor

migrated from Bugzilla #485974
status ASSIGNED severity normal in component server for 9.3.x
Reported in version unspecified on platform Other
Assigned to: Project Inbox

On 2016-01-15 16:04:32 -0500, Bob Bennett wrote:

We are running dropwizard with embedded jetty in a z/OS environment. When the TCPIP process is recycled, we see a very tight loop of the following messages in our log (the exception message is produced around 4000 times per second):

WARN .2016-01-13 20:05:26,322. org.eclipse.jetty.server.ServerConnector:
! com.ibm.net.NetworkRecycledException: Network Recycled while accepting connection
! at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:306) ~.na:1.7.0.
! at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377) ~.daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT.
! at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500) ~.daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAP
SHOT.
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) .daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT
.
! at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) .daas-engine-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT.
! at java.lang.Thread.run(Thread.java:798) .na:1.7.0.

I'm looking at the code in AbstractConnector.java, and in Accept.run() there is this loop that apparently continues to keep running accept without caring that it is not working and throwing the same exception repeatedly:

        try
        {
            while (isAccepting())
            {
                try
                {
                    accept(_id);
                }
                catch (Throwable e)
                {
                    if (isAccepting())
                        LOG.warn(e);
                    else
                        LOG.ignore(e);
                }
            }
        }
        finally
        {
            thread.setName(name);
            if (_acceptorPriorityDelta!=0)
                thread.setPriority(priority);

            synchronized (AbstractConnector.this)
            {
                _acceptors[_id] = null;
            }
            CountDownLatch stopping=_stopping;
            if (stopping!=null)
                stopping.countDown();
        }

This situation does not occur very often, but when it does the result is severe, as the server log can generate a few million lines of output by the time we realize the situation is happening and manually stop/restart the server.

On 2016-01-15 17:03:40 -0500, Greg Wilkins wrote:

I think for most deployments, it is desirable for the connector to try to continue and accept further connections - it is hard for the connector to distinguish between an exception for a particular accept vs some kind of systematic failure.

Would could perhaps add an optional accept error count (or rate?), which if exceeded aborts - but then should we just stop this connector or stop the server? or stop the entire JVM?

It is difficult for such low level components to make correct system level decisions when confronted with persistent failures.

Is there a way to can make whatever is "recycling the process" first stop the dropwizard process?

On 2016-01-15 18:14:09 -0500, Bob Bennett wrote:

(In reply to Greg Wilkins from comment # 1)

I think for most deployments, it is desirable for the connector to try to
continue and accept further connections - it is hard for the connector to
distinguish between an exception for a particular accept vs some kind of
systematic failure.

Would could perhaps add an optional accept error count (or rate?), which if
exceeded aborts - but then should we just stop this connector or stop the
server? or stop the entire JVM?

I think maybe an acceptable error limit, or a way to name specific exceptions that should cause a restart of the server, would be appropriate. In this case there are two exceptions that we are observing - java.io.IOException when TCPIP is down, and com.ibm.net.NetworkRecycledException when TCPIP comes back up. An appropriate reaction to the NetworkRecycledException would be to restart the server, which would close the socket connection and reestablish it.

It is difficult for such low level components to make correct system level
decisions when confronted with persistent failures.

Yes, but the current behavior is like the classical definition of insanity - repeating the same thing over and over and expecting different results. There's got to come a time when you make a determination that the network is just not coming back.

Is there a way to can make whatever is "recycling the process" first stop
the dropwizard process?

Normally, in a z/OS environment, the operator who is cycling the network process is not aware of what other servers are using the network. It might be helpful if there were some callback method that we could set up to look at the exception and have the option of resetting the server ourselves.

@jmcc0nn3ll jmcc0nn3ll added the Bug For general bugs on Jetty side label Feb 16, 2016
@gregw gregw added Enhancement Help Wanted and removed Bug For general bugs on Jetty side labels Feb 17, 2016
@gregw
Copy link
Contributor

gregw commented Feb 17, 2016

Changing this to an enhancement request, as I don't think it is a bug. Still not really sure what is the best thing to do here.

@dave-griffiths
Copy link

Hi, I think this is a bug, not an enhancement. We are seeing it and if you google "jetty NetworkRecycledException" you will see others have hit it also. It is impossible for us to workaround as the loop is in the jetty code. I suggest catching the NetworkRecycledException in AbstractConnector$Acceptor.run and reopen the socket (if possible).

It's a bit unfortunate because the only way you know it is happening is when you see your log file has become huge! Our customer has seen this even after recycling TCPIP and then starting the server.

@sbordet
Copy link
Contributor

sbordet commented May 26, 2016

I think this is a duplicate of #354, which has been solved by allowing to override AbstractConnector.handleAcceptFailure() to let applications decide what to do in case of an exception during accept().

@sbordet
Copy link
Contributor

sbordet commented May 26, 2016

We cannot obviously catch NetworkRecycledException because that is an IBM specific exception.

Unfortunately accept() is under-specified in case of exceptions, so there is no easy way for Jetty to tell, in a portable way, whether it should abort accepting, whether it should just retry immediately, whether it should retry after a while, etc. Hence the introduction of handleAcceptFailure() that can be overridden.

@dave-griffiths
Copy link

Ok thanks for that. It looks like we use org.mortbay.jetty.nio.SelectChannelConnector so we would need to subclass that and override handleAcceptFailure?

@sbordet
Copy link
Contributor

sbordet commented May 26, 2016

@hugograffiti what Jetty version are you on ? The fix is in 9.3.x.

@dave-griffiths
Copy link

Er, 6.1.25 :-) If we fix we will just have to do our own thing. Doesn't look like the override thing would work so would need to modify AbstractConnector.

@sbordet
Copy link
Contributor

sbordet commented May 26, 2016

All right, then closing this issue. Thanks !

@sbordet sbordet closed this as completed May 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants