Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages timeout when log level is changed #820

Closed
boris-petrov opened this issue Jan 17, 2019 · 48 comments
Closed

Messages timeout when log level is changed #820

boris-petrov opened this issue Jan 17, 2019 · 48 comments

Comments

@boris-petrov
Copy link

Not sure how to explain this. It's very weird. Here goes. Using CometD 4.0.2.

Having the following logback.xml:

<?xml version="1.0" encoding="UTF-8"?>
<configuration
    xmlns="http://ch.qos.logback/xml/ns/logback"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://ch.qos.logback/xml/ns/logback https://raw.githubusercontent.com/enricopulatzo/logback-XSD/master/src/main/xsd/logback.xsd">

    <appender name="consoleAppender" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%green(%d{HH:mm:ss.SSS}) [%yellow(%thread)] %highlight(%-5level) %cyan(%logger{36}) - %msg %n</pattern>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="consoleAppender"/>
    </root>
</configuration>

Causes published messages from the frontend to timeout. These messages reach neither my @Listener handlers, nor the SecurityPolicy's can{Publish/whatever} methods on the Java backend.

Changing the level in the logback.xml file to DEBUG fixes that issue and all messages are sent and received.

Both the timeouts and the "fix" happen every single time I try. I reproduce them consistently by just changing the log level. I've done this at least 10 times each.

No idea what is going on. I've been banging my head on this for a few hours now.

How do I proceed with debugging this? That's what I wanted to do by changing the log level... but the issue disappeared. :D The only thing I can imagine right now as a possible source of the issue is some race condition in CometD which causes messages to be "dropped" but when there's a bunch of logging, everything is slower and that's why it works. Any other ideas/suggestions? How do I approach this? Creating a reproduction will probably be an impossible task so I prefer to aid with debugging in any way I can.

@sbordet
Copy link
Member

sbordet commented Jan 17, 2019

Looks very strange. Can you pack a simple reproducible case and attach it here?

@boris-petrov
Copy link
Author

Well, as I said, creating a reproduction is an impossible task, so we'll have to approach this in another way.

After a little bit of trying, I found that downgrading only the Java version of CometD to 4.0.0 makes the problem go away. 4.0.1 and 4.0.2 don't work. So something in Java changed between 4.0.0 and 4.0.1 which broke it.

I'm willing to try patches/new versions on my setup to help you debug.

@sbordet
Copy link
Member

sbordet commented Jan 18, 2019

@boris-petrov I have changed the logging of the CometD Demo to Logback, used your logback.xml and everything works fine for me.
Can you do the same and check if it works for you?

I don't clearly understand what you mean by "front-end". Are you using BayeuxClient to publish from a front-end server machine to a back-end server machine?
Where do you have your LogBack configuration?

Or you mean publishing from the browser using cometd.js?

You can search and replace in the whole CometD source isDebugEnabled() with isInfoEnabled() and .debug(...) with .info(...) - if it's a logging problem at least we will have some "debug" logging.

What if you use Log4J instead of LogBack?

@boris-petrov
Copy link
Author

Sorry, I should have been more specific. I'm using cometd.js version 4.0.2 which talks with a Java backend on which there is CometD 4.0.2.

I don't think it has anything to do with Logback. I think it is a timing issue which the logging just hides as it takes a lot of time to print a million things to the console. I will try what you suggest if all else fails as debugging.

Can we start with what I mentioned - that something broke between 4.0.0 and 4.0.1 in the Java code. What changed there? Can you point me to the code that was modified so we can start digging from there?

@sbordet
Copy link
Member

sbordet commented Jan 18, 2019

@boris-petrov Here is the list of issues fixed: https://github.com/cometd/cometd/milestone/14?closed=1.

git diff 4.0.0..4.0.1 gives you the diff for all the files.

I don't see how what you experience would not have been caught by our continuous integration... it may be a very rare case, but you seem to be able to reproduce at will.

You seem to have a very fundamental problem (you said you can't even get the SecurityPolicy invoked), a simple example (JS file with just handshake and publish + hello world server) should reproduce the issue.
We have tons of those in our test suite and I don't see any problem.

Can you grab a network trace with Wireshark?
What browser/OS/JDK?

@boris-petrov
Copy link
Author

Yes, I reproduce it consistently.

Browser is latest Chromium 71.0.3578.98, Arch Linux, OpenJDK 11.0.1.

How I do it is I run our integration tests, put a big sleep somewhere, then when it stops there I open Chromium, navigate to the page, it opens fine and the websocket works fine, but then when I click somewhere and something over the websocket is supposed to happen, nothing does and the messages timeout.

I think the traffic from the browser is fine as I see in the Network tab in Chromium in the Frames section that a correct message is sent. It's just that I don't think it is handled properly in the backend.

Perhaps I can put println statements or something? Where is the code that gets the request? Tell me where to put logging and what exactly and also how to build the project and I'll start debugging. I have no other idea what to do or how more to help.

@sbordet
Copy link
Member

sbordet commented Jan 18, 2019

Show me the code.

I put a big sleep somewhere

That is too generic - I cannot help you if I don't know the details.

Big sleeps in WebSocket will halt all CometD processing, reading from the network, etc.
See https://docs.cometd.org/current/reference/#_concepts_threading.
If you have a "big sleep" it's normal that messages timeout.

To build CometD: https://docs.cometd.org/current/reference/#_build.

I don't understand how your case will work with log DEBUG but not with log INFO.
If you have "big sleeps" the log level should not matter.

@boris-petrov
Copy link
Author

The big sleep is just to pause the test. We are using Capybara and in the middle of one of the tests I just put a sleep 1000000 (in the JRuby code) so the test stops and I can play with the running system. This doesn't block the execution of the system in any way - it's still up and running. I just mentioned that to illustrate how I reproduce it.

I also don't understand how any of this is possible but the good part is that I reproduce it consistently and this means easier debugging. That's why I need your help to know where to start from and help you figure it out. Thanks for the link for building CometD, please tell me where to put some initial logging where we are expecting this message to get to CometD in the beginning.

@sbordet
Copy link
Member

sbordet commented Jan 18, 2019

Assuming you are using the javax.websocket (i.e. JSR 356) implementation of WebSocket, the messages first land here: https://github.com/cometd/cometd/blob/4.0.2/cometd-java/cometd-java-websocket/cometd-java-websocket-javax-server/src/main/java/org/cometd/websocket/server/WebSocketEndPoint.java#L53.
From there, there is parsing of messages, and processing of messages, which eventually end up here: https://github.com/cometd/cometd/blob/4.0.2/cometd-java/cometd-java-server/src/main/java/org/cometd/server/BayeuxServerImpl.java#L614.
Start adding logging (or modifying the existing logging to be at INFO level) for every message you receive, and let's see where it stops.

@boris-petrov
Copy link
Author

OK, I added these loggings. I'm seeing:

[{"id":"1","version":"1.0","minimumVersion":"1.0","channel":"/meta/handshake","supportedConnectionTypes":["websocket","long-polling","callback-polling"],"advice":{"timeout":60000,"interval":0},"ext":{"ack":true}}]

As the first message, then the second one contains a bunch of subscriptions and messages which are application specific (they are batched), the third message resubscribes for the subscriptions (as now I see that the app logic does that) and then only:

[{"id":"21","channel":"/meta/connect","connectionType":"websocket","advice":{"timeout":0},"clientId":"719ai9t29azxo6ux5c18g8t1w","ext":{"ack":0}}]

In the frontend, the handshake happens successfully but then after a while I start receiving on /meta/connect message timeouts. I do not receive even one successful one - from the very first one they are timeouts. I guess that's the reason why nothing else works after that. Any ideas why that happens? Sending messages after this doesn't reach the println's I've put at all.

@sbordet
Copy link
Member

sbordet commented Jan 18, 2019

@boris-petrov the /meta/connect message is expected, has advice:{timeout=0}, also expected, and the server must reply to this message immediately.
So: either the server does not reply to this message correctly, or the client receives the reply, but then has problems and cannot send the next /meta/connect.

Can you enable DEBUG logging on the client (https://docs.cometd.org/current/reference/#_troubleshooting_logging_javascript)?

Can you try to use the HTTP transport, rather than WebSocket?
On the JavaScript client you can do cometd.websocketEnabled=false to disable WebSocket and therefore use HTTP.
Do you have the same problems?

Can you take a network trace via Wireshark to know exactly what passes through the network?

@boris-petrov
Copy link
Author

OK, debug logging shows something like:

11:09:40.955 Function scheduled in 0 ms, interval = 0 backoff = 0 ƒ (){!function(){if(!V()){var t={id:Z(),channel:"/meta/connect",connectionType:o.getType()};q||(t.advice={timeout:0}),$("connecting"),h._debug("Connect sent",t),nt([t],!0,"connect"),$("connected")}}()}
11:09:41.093 Invoking timed function ƒ (){!function(){if(!V()){var t={id:Z(),channel:"/meta/connect",connectionType:o.getType()};q||(t.advice={timeout:0}),$("connecting"),h._debug("Connect sent",t),nt([t],!0,"connect"),$("connected")}}()}
11:09:41.093 Connect sent {id: "21", channel: "/meta/connect", connectionType: "websocket", advice: {…}}
11:09:41.095 AckExtension: client sending batch 0
11:09:41.095 Send {url: "http://192.168.88.172:8081/integration-tests/streaming", sync: false, messages: Array(1), onSuccess: ƒ, onFailure: ƒ}
11:09:41.095 Transport websocket sending {url: "http://192.168.88.172:8081/integration-tests/streaming", sync: false, messages: Array(1), onSuccess: ƒ, onFailure: ƒ} /meta/connect = true
11:09:41.096 Transport websocket stored envelope, envelopes {21: Array(2), 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16: Array(2), 17,18,19,20: Array(2)}
11:09:41.097 Transport websocket sent {url: "http://192.168.88.172:8081/integration-tests/streaming", sync: false, messages: Array(1), onSuccess: ƒ, onFailure: ƒ} /meta/connect = true
11:09:41.098 Transport websocket waiting at most 60000 ms for messages ["21"] maxNetworkDelay 30000 , timeouts: {2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 22}
11:09:41.100 Status connecting -> connected
11:10:41.098 Invoking timed function ƒ (){t._debug("Transport",s.getType(),"timing out message",r.id,"after",a,"on",e),p.call(s,e,{code:1e3,reason:"Message Timeout"})}
11:10:41.100 Transport websocket timing out message 21 after 60000 on {envelopes: {…}, timeouts: {…}, webSocket: WebSocket}
11:10:41.103 Transport websocket closed {envelopes: {…}, timeouts: {…}, webSocket: WebSocket} {code: 1000, reason: "Message Timeout"}
11:10:41.107 handleFailure {envelopes: {…}, timeouts: {…}, webSocket: WebSocket} [{…}] {websocketCode: 1000, reason: "Message Timeout", connectionType: "websocket"}
11:10:41.112 Transport failure {cause: "failure", action: "retry", transport: null} for {id: "21", successful: false, channel: "/meta/connect", failure: {…}}
11:10:41.113 Transport websocket accept, supported: true
11:10:41.114 Transport failure handling {cause: "failure", action: "retry", transport: null, delay: 1000}
11:10:41.114 Status connected -> connecting

This is for the 21st message, as you can see. The first 20 are sent and received fine. They also get responses from the server. The 21st, which is the /meta/connect call fails.

Using long polling seems to fix the problem, although it does work very slowly. Like, it takes 10 seconds for some responses to come through but I guess that's normal. Nevertheless it does work in the end.

I can't seem to figure out how to work with Wireshark. I capture all traffic and set a display filter for websocket as is written here but it shows nothing... any ideas?

@sbordet
Copy link
Member

sbordet commented Jan 21, 2019

Using long polling seems to fix the problem, although it does work very slowly.

I don't understand this sentence, what do you mean by "slowly"? Taking 10 seconds for some responses is normal if you have a timeout=10000 configured on server (long poll).
The same goes for WebSocket.

Can you please try without minifying the JavaScript?

Do you see message 21 arriving on server?

As for Wireshark, you should capture traffic on port 8081 which is where you have configured CometD, so tcp.port==8081 as the filter in the Wireshark UI.

@boris-petrov
Copy link
Author

This is what I have as a server configuration:

@WebListener
public class BayeuxInitializer implements ServletContextAttributeListener {

	@Override
	public void attributeAdded(ServletContextAttributeEvent event) {
		if (BayeuxServer.ATTRIBUTE.equals(event.getName())) {
			BayeuxServer bayeux = (BayeuxServer) event.getValue();
			bayeux.setSecurityPolicy(new StreamingSecurityPolicy());

			// https://docs.cometd.org/current/reference/#_java_server_configuration
			bayeux.setOption("ws.messagesPerFrame", 5);
			bayeux.setOption("ws.maxMessageSize", 655_200);
			// "ws.enableExtension.permessage-deflate" is true by default so compression is enabled for WebSockets

			bayeux.addListener(new StreamingService.SessionListenerImp());
			bayeux.addListener(new StreamingService.SubscriptionListenerImp());

			// https://docs.cometd.org/current/reference/#_extensions_acknowledge
			bayeux.addExtension(new AcknowledgedMessagesExtension());
		}
	}

web.xml:

    <servlet>
        <servlet-name>cometd</servlet-name>
        <servlet-class>org.cometd.annotation.AnnotationCometDServlet</servlet-class>
        <init-param>
            <param-name>services</param-name>
            <param-value>com.company.StreamingService</param-value>
        </init-param>
        <init-param>
            <param-name>ws.cometdURLMapping</param-name>
            <param-value>/streaming/*</param-value>
        </init-param>
        <init-param>
            <param-name>jsonContext</param-name>
            <param-value>com.company.JsonSerializer</param-value>
        </init-param>
        <load-on-startup>1</load-on-startup>
        <async-supported>true</async-supported>
    </servlet>

On the frontend:

    this.comet.configure({
      appendMessageTypeToURL: false, // https://groups.google.com/d/msg/cometd-users/DMTZ-Y2bOSg/EmxFJJQICwAJ
      autoBatch: true,
      logLevel,
      maxNetworkDelay: 30000,
      url,
    });

I understand the timeouts you mention on the longpolling but what do timeouts have to do with WebSockets?

What I showed was without minifying JS. All the tests I do with unminified JS.

I see message 21 on the server, yes:

-------------------- Message received - [{"id":"21","channel":"/meta/connect","connectionType":"websocket","advice":{"timeout":0},"clientId":"5rvzy5p3zmsk0sn16w9e3vuwv","ext":{"ack":0}}]
========== message got to here

The first line is from the WebSocketEndPoint.java file onMessage method (I've printed the data argument), the second log is from BayeuxServerImpl.java handle method beginning. So the message gets there just no response from the server I guess.

In Wireshark I see 4 messages sent from the client to the server - as much as I see in the server logs. I see the server responses for the first 3 (which I explained before are an initial handshake message, then a bunch of subscribes + publishes, then a couple of resubscriptions (due to stupidity of our code) and then the /meta/connect message) but for the last one I see no message reaching the client. Wireshark is running on the client, by the way, the server is a remote machine right now. Then, a minute later, I see the client sending a "message timeout" message, closing the connection and then reconnecting.

What next?

@sbordet
Copy link
Member

sbordet commented Jan 21, 2019

bayeux.setOption("ws.maxMessageSize", 655_200);

Whoa, this is big. AFAIR browsers will choke with messages larger than 8 KiB.

      this.comet.configure({
      appendMessageTypeToURL: false, // https://groups.google.com/d/msg/cometd-users/DMTZ-Y2bOSg/EmxFJJQICwAJ
      autoBatch: true,
      logLevel,
      maxNetworkDelay: 30000,
      url,
    });

logLevel and url are wrong: they lack the value.

The server side timeout is a heartbeat, and it is run with any transport.

Seems like something is blocking the processing of the server-side /meta/connect messages.
Double check that you don't have a /meta/connect listener or extension sleeping or otherwise blocking for long times.
Other than that, when you see that log line on the server, take a server thread dump using ctrl+\ or jstack.
The thread dump should show where the processing of the /meta/connect has halted.

@boris-petrov
Copy link
Author

OK, I'll leave the default value for maxMessageSize. Not sure where I read that it could be as big as I've put it. Maybe in Google Groups?

logLevel and url are not wrong - it's just ES6 syntax for defining a hash - they are defined as variables above this code. They're not relevant so I omitted them from the example, sorry.

As for the thread dump - you're suggesting to get a thread dump when the server receives the messages before it timeouts? We're expecting there to be a thread that is blocked?

@boris-petrov
Copy link
Author

In the thread dump I searched for comet and this is the only thread that had cometd inside it:

"my thread name" #199 daemon prio=5 os_prio=0 cpu=244.99ms elapsed=167.23s tid=0x00007fd17e1d8000 nid=0x43c3 waiting on condition  [0x00007fd12f2be000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.2/Native Method)
        - parking to wait for  <0x000000073ad44578> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.2/LockSupport.java:194)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.2/CompletableFuture.java:1796)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.2/ForkJoinPool.java:3128)
        at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.2/CompletableFuture.java:1823)
        at java.util.concurrent.CompletableFuture.get(java.base@11.0.2/CompletableFuture.java:1998)
        at org.cometd.websocket.server.WebSocketEndPoint.onMessage(WebSocketEndPoint.java:65)
        at org.cometd.websocket.server.WebSocketEndPoint.onMessage(WebSocketEndPoint.java:37)
        at org.apache.tomcat.websocket.WsFrameBase.sendMessageText(WsFrameBase.java:395)
        at org.apache.tomcat.websocket.server.WsFrameServer.sendMessageText(WsFrameServer.java:119)
        at org.apache.tomcat.websocket.WsFrameBase.processDataText(WsFrameBase.java:495)
        at org.apache.tomcat.websocket.WsFrameBase.processData(WsFrameBase.java:294)
        at org.apache.tomcat.websocket.WsFrameBase.processInputBuffer(WsFrameBase.java:133)
        at org.apache.tomcat.websocket.server.WsFrameServer.onDataAvailable(WsFrameServer.java:82)
        at org.apache.tomcat.websocket.server.WsFrameServer.doOnDataAvailable(WsFrameServer.java:171)
        at org.apache.tomcat.websocket.server.WsFrameServer.notifyDataAvailable(WsFrameServer.java:151)
        at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.upgradeDispatch(WsHttpUpgradeHandler.java:148)
        at org.apache.coyote.http11.upgrade.UpgradeProcessorInternal.dispatch(UpgradeProcessorInternal.java:54)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:53)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1417)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        - locked <0x0000000726bf91b0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.2/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.2/ThreadPoolExecutor.java:628)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(java.base@11.0.2/Thread.java:834)

A few minutes later there are a few of those with the same stacktrace due to the frontend reconnecting a bunch of times.

Any ideas?

@sbordet
Copy link
Member

sbordet commented Jan 21, 2019

So you are using Tomcat, not Jetty.

I need all the stack traces, not just one.

Also, try Jetty instead of Tomcat. Do you have the same issue?

@boris-petrov
Copy link
Author

Using Jetty instead of Tomcat is going to be a bit tricky as we use a couple of Tomcat-specific things. Let's leave it as a last resort for now.

Please give me an email to send you the thread dump.

@boris-petrov
Copy link
Author

Sorry, I saw it in your profile. Sending now.

@sbordet
Copy link
Member

sbordet commented Jan 21, 2019

Zip and attach the thread dump here for reference, and remember to obfuscate things you don't want everybody to see (passwords, IP addresses, etc.)

@boris-petrov
Copy link
Author

Here is the dump. threads.zip

Ping me when you're done analyzing and/or need more information.

@sbordet
Copy link
Member

sbordet commented Jan 21, 2019

A WebSocket message is received, and passed along for processing. From the logging you've got, seems to be a /meta/connect message.
Then the thread is waiting on the CompletableFuture to be completed, but it's not.
This is only possible if the processing of the /meta/connect is asynchronous, because I don't see any other thread processing that message.

Seems to be a case of the application not completing a promise.
Do you have listeners for /meta/connect messages? What do they do?

Try to run the CometD Demo in Tomcat (just deploy the CometD Demo war in Tomcat).
If it works, it's your application.
If it does not work, it's Tomcat.

@boris-petrov
Copy link
Author

I don't have any listeners on /meta/connect. I have one SessionListenerImpl implements BayeuxServer.SessionListener and one SubscriptionListenerImpl implements BayeuxServer.SubscriptionListener neither of which are called when the message timeouts.

I debugged a bit more and I'm seeing that this call:

        boolean queued = flusher.queue(new Entry(context, messages, Promise.from(y -> {
            promise.succeed(null);
            writeComplete(context, messages);
        }, promise::fail)));

Never executes the promise. This is inside the flush method in AbstractWebSocketEndPoint. flusher.iterate(); is called and finishes but nevertheless the queued task never completes.

Check out this repo. I've added a bunch of println's there. Here is the output when I refresh the webpage. Please check it out and compare with the loggings that I've put. I don't understand the code too much so I need some help. In the output after line 139 you can see that it only processes messages and never gets to a point where this Flusher is IDLE in order to poll _entries and that's why the promise never completes. I don't understand the code in case MESSAGES at all but I guess the problem happens somewhere there.

Is that enough information for you to go on with debugging? I can put more printing anywhere you want and help you figure it out. Just tell me what to print and where so we can get to the bottom of it.

@sbordet
Copy link
Member

sbordet commented Jan 22, 2019

@boris-petrov I tried the CometD demo with both Jetty and Tomcat 9.0.14 and they work fine for me.
I checked out your repo, and works fine for me in Jetty and Tomcat.

You need to detail exactly what you are doing, what is the problem, what do you expect instead, what is your client and server configuration.
I need to understand what you are doing and how I can reproduce it exactly.
So far this issue started with a logging issue, went to Tomcat (of which we don't know what version), now we have your repo that works fine for me, so I have no idea of what you're doing.
What browser are you using?
What Tomcat version?

Is the CometD demo working for you in Tomcat?

@boris-petrov
Copy link
Author

In this previous message of mine I mentioned the Chromium version that I use (71.0.3578.98) and the OpenJDK version (I've since updated to 11.0.2). Tomcat is the latest 9.0.14.

I never said that the repo that I gave doesn't work in some way. It is the original CometD 4.0.2 code plus some println statements. Nothing more. What I said in my previous post is that with this repo and the output.txt file that I've attached (which contain the execution of my program) you can probably begin to see that there is something wrong. And perhaps deduce where the issue is.

I already pasted the server and client configuration above. I also mentioned a few times that I cannot prepare you a reproduction easily. I've already spent many hours on this and I cannot spend full days on it to create a reproduction.

I also think I explained multiple times what the problem is. A message timeouts as the server doesn't send a response. The logging issue, as I mentioned, is just a side effect - if there is some race-condition somewhere, the logging will just mask it and that's why it works. When logging is disabled, the code runs fast and the deadlock/whatever happens. This, of course, is just a guess.

I never said there is an issue in Tomcat, neither that this repo that I created doesn't work. I built CometD from that repo with the println statements and run our code with that version of CometD. And saved the output in the output.txt file from my previous post. I was hoping you could see the output, compare it against what you would expect the println statements in the repo to print and try to figure out the issue or tell me to put more logging somewhere.

I'm trying hard to give you all the needed information to find the problem (besides me giving you access to our codebase). Not sure how more I can help.

@sbordet
Copy link
Member

sbordet commented Jan 22, 2019

Is the CometD demo working for you in Tomcat?

@sbordet
Copy link
Member

sbordet commented Jan 22, 2019

Can you attach a network trace with Wireshark?

@sbordet
Copy link
Member

sbordet commented Jan 22, 2019

@boris-petrov can you please put another log line in WebSocketEndPoint.Delegate.send(...), inside the SendHandler lambda?

From your output.txt you seem to send 23 messages, and you have configured 5 messages per frame, so these should be sent as 5-5-5-5-3.
However, it appears that the first write never returns.

Can you please check what is the default max message size for Tomcat?

It could be that Tomcat has problem writing the messages, but it does not notify the SendHandler with a failure, so CometD is stuck.

Another simple try you want to do is to remove the configuration bayeux.setOption("ws.messagesPerFrame", 5), leaving it at the default of 1 and see if it works.

@boris-petrov
Copy link
Author

Thank you for the support.

The CometD demo works fine as well as our application - we have hundreds of tests that run many times a day and all works fine. It's just this corner case that is a problem.

I'll send a Wireshark trace later if needed as it does contain app-specific data which I'm not sure I can mangle.

I put a log line as you requested. Here is the output. The last few lines are:

WebSocketEndPoint.Delegate.send 1
WebSocketEndPoint.Delegate.send 2 - null
>>>>>>> process 1
>>>>>>> process messages
WebSocketEndPoint.Delegate.send 1

So your guess is correct - javax.websocket.RemoteEndpoint.Async.sendTextdoesn't complete. Not sure whose code is this? Do you have an idea why this might happen?

I'm not sure what "default max message size for Tomcat" means. I haven't changed any Tomcat configuration. What should I look for?

I already tried removing the ws.messagesPerFrame setting and it didn't help.

By the way, please keep in mind what I said a few days ago - CometD 4.0.0 works and this problem doesn't happen. 4.0.1 and 4.0.2 "break". Not sure what changed there, just to note it if you have any idea.

@sbordet
Copy link
Member

sbordet commented Jan 23, 2019

So your guess is correct - javax.websocket.RemoteEndpoint.Async.sendTextdoesn't complete. Not sure whose code is this? Do you have an idea why this might happen?

This is the Tomcat implementation of the javax.websocket classes.

My theory was that Tomcat was choking and forgetting to notify the callback because the message was too large.
However, this problem happens with both messagesPerFrame=1 and your initial setting of maxMessageSize=655200, so it probably is something else.

A number of things have changed between CometD 4.0.0 and 4.0.1, but I don't think they are causing this issue, which seems related to the SendHandler not being completed by Tomcat.

Can you enable the Tomcat DEBUG logging and see if there are exceptions or errors when the messages are being sent?

We are now into debugging Tomcat, where I can give limited help.
If you could reproduce the issue for me to try, I could do some Tomcat debugging.

Alternatively, if you could switch to Jetty and verify that this issue does not happen, then we have a confirmation that is a Tomcat bug.

@sbordet
Copy link
Member

sbordet commented Jan 24, 2019

@boris-petrov news on this issue?

@boris-petrov
Copy link
Author

Yes, sorry, I had other things to do today. I'll look at this again either tomorrow, or first thing next week. I'm also trying to figure out how to setup the embedded Tomcat's logging preferences - if you have any ideas, you might save me some time. :)

Another thing I noticed yesterday - the same issue happened in production. I was getting message timeouts, I refreshed the page 3 times, the fourth time did a clear-all-caches-hard-refresh and it started working. But on the backend side if I do a thread dump I can see that there are 4 "deadlocked" threads with the same stacktrace as above. So this issue does happen in production too. Just mentioning that, I'm not saying anything else.

So, again, I'll write as soon as I can. Thanks!

@sbordet
Copy link
Member

sbordet commented Jan 24, 2019

@boris-petrov for Tomcat Logging you want to read this: https://tomcat.apache.org/tomcat-9.0-doc/logging.html.

@boris-petrov
Copy link
Author

boris-petrov commented Jan 25, 2019

OK, so I managed to enable logging in Tomcat and enabled it for the org.apache.tomcat.websocket package but it logs only when receiving data, not when sending. Going through the code shows that there is actually no logging in it. So not sure how to proceed here. I could try debugging Tomcat but I would leave that as a last resort. Do you have any other suggestions?

Also, I reverted to 4.0.0 in one of our installations because of this issue and it started working. Perhaps there is some bug in Tomcat which you didn't hit before 4.0.1 but then it started happening...

P.S. Actually no, it happens also on 4.0.0 but I guess less often... or something.

@sbordet
Copy link
Member

sbordet commented Jan 25, 2019

@boris-petrov if I have a way to reproduce, I would do the debug.

Otherwise you need to follow the Tomcat calls in the debugger and see why it does not complete the callback, or perhaps do what you have done with CometD, add your own logging to Tomcat code.

What I see from the Tomcat implementation is that in many, many places it throws exceptions without completing the SendHandler.
For example, WsRemoteEndpointImplBase.sendStringByCompletion() calls stateMachine.textStart() which may throw without completing the SendHandler; then calls TextMessageSendHandler.write() which also may throw without completing the SendHandler; and so on down the rabbit hole 😞

@boris-petrov
Copy link
Author

I opened a bug in Tomcat about that. Let's see what happens there. So yes, it might be the case that an error happens and the handler is not called. I'm accessing the website that I'm testing with via VPN access from the other side of the world with generally not very good internet so it is possible that a timeout happens or something and that's why it fails but the error never reaches CometD...

@boris-petrov
Copy link
Author

@sbordet - actually I put a try ... catch (Throwable t) { println ... } around the _wsSession.getAsyncRemote().sendText call but it doesn't get there... so the theory that an exception happens is not what happens. I'm not sure how javax.websocket.SendHandler works - who is supposed to call TextMessageSendHandler's onResult method?

@sbordet
Copy link
Member

sbordet commented Jan 28, 2019

The Tomcat WebSocket implementation is supposed to call the SendHandler.

While exceptions do not happen in your case, if they happen, the SendHandler won't be notified and it should not be the burden of the application to try/catch IllegalStateException from the implementation because moving to another implementation may not throw the same exceptions (or at all).

Something else is going on in Tomcat that causes the SendHandler to not be called.

@boris-petrov
Copy link
Author

@sbordet - they answered on the Tomcat bug. They're saying that throwing an exception (at least in some cases) is actually according to the specs. I'm no expert on this but probably you can take a look - and in that case I guess CometD has to handle it.

Thanks for the support. I'll continue playing with this. I'll try different Tomcat versions and see if I can find one that works.

@boris-petrov
Copy link
Author

Also, could you please try the CometD demo and any other app that you may have (running it on Tomcat 9.0.14) and throttling the network speed of the browser (Chrome can do it in the Network tab in the dev tools)? I think this helps triggering the issue. Maybe this way you will be able to reproduce it.

@boris-petrov
Copy link
Author

boris-petrov commented Feb 20, 2019

@sbordet - I wanted to open an issue in Tomcat about this and I noticed this thread there - it says that RemoteEndpoint.Async#sendText(String, SendHandler) is not thread-safe. And they've closed it saying that the spec says nothing about thread-safety. Is it possible that CometD calls it concurrently and that's why it fails? I want to know before I open an issue there.

P.S. Oh, I just saw that actually you've opened the issue. Sorry. So this is taken into consideration in CometD?

@sbordet
Copy link
Member

sbordet commented Feb 20, 2019

@boris-petrov after that issue we have modified CometD to not call sendText() concurrently, so yes we are using the WebSocket APIs properly.

@boris-petrov
Copy link
Author

@sbordet - I downgraded to CometD 3.1.8 and the problem goes away. CometD 4 introduced async methods for authorization and some other things which I guess use the async IO implementation of Tomcat. What I think happens is that Tomcat 9 (all versions) have a deadlock in their code. CometD 3 uses synchronous IO which works correctly in Tomcat and that's why it works fine. Does that make sense?

I've opened an issue in Tomcat.

You're free to close this issue here if you feel so as it is not a problem in CometD. If you can help in any way finding and debugging this in Tomcat, that would be great.

Thank you for the support!

@boris-petrov
Copy link
Author

@sbordet - there is a discussion in the Tomcat issue. They need more information about the execution path of CometD that leads to the call to sendText that never calls the callback. Could you please step in there to help a bit? Perhaps the two of you (or the three of us including me) working together will have a better chance of finding the bug than just me and Mark from Tomcat. Thanks!

@sbordet
Copy link
Member

sbordet commented Mar 15, 2019

@boris-petrov I read the Tomcat issue and we are not moving forward. Both here and there we need a reproducer. It's obvious you have one, can you strip it down of sensitive information so that we can run it ourselves?

@boris-petrov
Copy link
Author

Well, that's going to take some effort, but OK, I will try to do it in the next days/weeks. I'll let you know when I have some reproduction.

@boris-petrov
Copy link
Author

This issue seems fixed with the latest Tomcat. In any case it looks like an issue there so I'll close this issue.

Thank you for the support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants