New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Messages timeout when log level is changed #820
Comments
Looks very strange. Can you pack a simple reproducible case and attach it here? |
Well, as I said, creating a reproduction is an impossible task, so we'll have to approach this in another way. After a little bit of trying, I found that downgrading only the Java version of CometD to 4.0.0 makes the problem go away. 4.0.1 and 4.0.2 don't work. So something in Java changed between 4.0.0 and 4.0.1 which broke it. I'm willing to try patches/new versions on my setup to help you debug. |
@boris-petrov I have changed the logging of the CometD Demo to Logback, used your I don't clearly understand what you mean by "front-end". Are you using Or you mean publishing from the browser using You can search and replace in the whole CometD source What if you use Log4J instead of LogBack? |
Sorry, I should have been more specific. I'm using I don't think it has anything to do with Logback. I think it is a timing issue which the logging just hides as it takes a lot of time to print a million things to the console. I will try what you suggest if all else fails as debugging. Can we start with what I mentioned - that something broke between 4.0.0 and 4.0.1 in the Java code. What changed there? Can you point me to the code that was modified so we can start digging from there? |
@boris-petrov Here is the list of issues fixed: https://github.com/cometd/cometd/milestone/14?closed=1.
I don't see how what you experience would not have been caught by our continuous integration... it may be a very rare case, but you seem to be able to reproduce at will. You seem to have a very fundamental problem (you said you can't even get the Can you grab a network trace with Wireshark? |
Yes, I reproduce it consistently. Browser is latest Chromium 71.0.3578.98, Arch Linux, OpenJDK 11.0.1. How I do it is I run our integration tests, put a big sleep somewhere, then when it stops there I open Chromium, navigate to the page, it opens fine and the websocket works fine, but then when I click somewhere and something over the websocket is supposed to happen, nothing does and the messages timeout. I think the traffic from the browser is fine as I see in the Network tab in Chromium in the Frames section that a correct message is sent. It's just that I don't think it is handled properly in the backend. Perhaps I can put println statements or something? Where is the code that gets the request? Tell me where to put logging and what exactly and also how to build the project and I'll start debugging. I have no other idea what to do or how more to help. |
Show me the code.
That is too generic - I cannot help you if I don't know the details. Big sleeps in WebSocket will halt all CometD processing, reading from the network, etc. To build CometD: https://docs.cometd.org/current/reference/#_build. I don't understand how your case will work with log DEBUG but not with log INFO. |
The big sleep is just to pause the test. We are using Capybara and in the middle of one of the tests I just put a I also don't understand how any of this is possible but the good part is that I reproduce it consistently and this means easier debugging. That's why I need your help to know where to start from and help you figure it out. Thanks for the link for building CometD, please tell me where to put some initial logging where we are expecting this message to get to CometD in the beginning. |
Assuming you are using the |
OK, I added these loggings. I'm seeing:
As the first message, then the second one contains a bunch of subscriptions and messages which are application specific (they are batched), the third message resubscribes for the subscriptions (as now I see that the app logic does that) and then only:
In the frontend, the handshake happens successfully but then after a while I start receiving on |
@boris-petrov the Can you enable DEBUG logging on the client (https://docs.cometd.org/current/reference/#_troubleshooting_logging_javascript)? Can you try to use the HTTP transport, rather than WebSocket? Can you take a network trace via Wireshark to know exactly what passes through the network? |
OK, debug logging shows something like:
This is for the 21st message, as you can see. The first 20 are sent and received fine. They also get responses from the server. The 21st, which is the Using long polling seems to fix the problem, although it does work very slowly. Like, it takes 10 seconds for some responses to come through but I guess that's normal. Nevertheless it does work in the end. I can't seem to figure out how to work with Wireshark. I capture all traffic and set a display filter for |
I don't understand this sentence, what do you mean by "slowly"? Taking 10 seconds for some responses is normal if you have a Can you please try without minifying the JavaScript? Do you see message 21 arriving on server? As for Wireshark, you should capture traffic on port |
This is what I have as a server configuration: @WebListener
public class BayeuxInitializer implements ServletContextAttributeListener {
@Override
public void attributeAdded(ServletContextAttributeEvent event) {
if (BayeuxServer.ATTRIBUTE.equals(event.getName())) {
BayeuxServer bayeux = (BayeuxServer) event.getValue();
bayeux.setSecurityPolicy(new StreamingSecurityPolicy());
// https://docs.cometd.org/current/reference/#_java_server_configuration
bayeux.setOption("ws.messagesPerFrame", 5);
bayeux.setOption("ws.maxMessageSize", 655_200);
// "ws.enableExtension.permessage-deflate" is true by default so compression is enabled for WebSockets
bayeux.addListener(new StreamingService.SessionListenerImp());
bayeux.addListener(new StreamingService.SubscriptionListenerImp());
// https://docs.cometd.org/current/reference/#_extensions_acknowledge
bayeux.addExtension(new AcknowledgedMessagesExtension());
}
}
<servlet>
<servlet-name>cometd</servlet-name>
<servlet-class>org.cometd.annotation.AnnotationCometDServlet</servlet-class>
<init-param>
<param-name>services</param-name>
<param-value>com.company.StreamingService</param-value>
</init-param>
<init-param>
<param-name>ws.cometdURLMapping</param-name>
<param-value>/streaming/*</param-value>
</init-param>
<init-param>
<param-name>jsonContext</param-name>
<param-value>com.company.JsonSerializer</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
<async-supported>true</async-supported>
</servlet> On the frontend: this.comet.configure({
appendMessageTypeToURL: false, // https://groups.google.com/d/msg/cometd-users/DMTZ-Y2bOSg/EmxFJJQICwAJ
autoBatch: true,
logLevel,
maxNetworkDelay: 30000,
url,
}); I understand the timeouts you mention on the longpolling but what do timeouts have to do with WebSockets? What I showed was without minifying JS. All the tests I do with unminified JS. I see message 21 on the server, yes:
The first line is from the In Wireshark I see 4 messages sent from the client to the server - as much as I see in the server logs. I see the server responses for the first 3 (which I explained before are an initial handshake message, then a bunch of subscribes + publishes, then a couple of resubscriptions (due to stupidity of our code) and then the What next? |
Whoa, this is big. AFAIR browsers will choke with messages larger than 8 KiB.
The server side Seems like something is blocking the processing of the server-side |
OK, I'll leave the default value for
As for the thread dump - you're suggesting to get a thread dump when the server receives the messages before it timeouts? We're expecting there to be a thread that is blocked? |
In the thread dump I searched for
A few minutes later there are a few of those with the same stacktrace due to the frontend reconnecting a bunch of times. Any ideas? |
So you are using Tomcat, not Jetty. I need all the stack traces, not just one. Also, try Jetty instead of Tomcat. Do you have the same issue? |
Using Jetty instead of Tomcat is going to be a bit tricky as we use a couple of Tomcat-specific things. Let's leave it as a last resort for now. Please give me an email to send you the thread dump. |
Sorry, I saw it in your profile. Sending now. |
Zip and attach the thread dump here for reference, and remember to obfuscate things you don't want everybody to see (passwords, IP addresses, etc.) |
Here is the dump. threads.zip Ping me when you're done analyzing and/or need more information. |
A WebSocket message is received, and passed along for processing. From the logging you've got, seems to be a Seems to be a case of the application not completing a promise. Try to run the CometD Demo in Tomcat (just deploy the CometD Demo war in Tomcat). |
I don't have any listeners on I debugged a bit more and I'm seeing that this call: boolean queued = flusher.queue(new Entry(context, messages, Promise.from(y -> {
promise.succeed(null);
writeComplete(context, messages);
}, promise::fail))); Never executes the promise. This is inside the Check out this repo. I've added a bunch of println's there. Here is the output when I refresh the webpage. Please check it out and compare with the loggings that I've put. I don't understand the code too much so I need some help. In the output after line 139 you can see that it only processes messages and never gets to a point where this Flusher is Is that enough information for you to go on with debugging? I can put more printing anywhere you want and help you figure it out. Just tell me what to print and where so we can get to the bottom of it. |
@boris-petrov I tried the CometD demo with both Jetty and Tomcat 9.0.14 and they work fine for me. You need to detail exactly what you are doing, what is the problem, what do you expect instead, what is your client and server configuration. Is the CometD demo working for you in Tomcat? |
In this previous message of mine I mentioned the Chromium version that I use (71.0.3578.98) and the OpenJDK version (I've since updated to 11.0.2). Tomcat is the latest 9.0.14. I never said that the repo that I gave doesn't work in some way. It is the original CometD 4.0.2 code plus some I already pasted the server and client configuration above. I also mentioned a few times that I cannot prepare you a reproduction easily. I've already spent many hours on this and I cannot spend full days on it to create a reproduction. I also think I explained multiple times what the problem is. A message timeouts as the server doesn't send a response. The logging issue, as I mentioned, is just a side effect - if there is some race-condition somewhere, the logging will just mask it and that's why it works. When logging is disabled, the code runs fast and the deadlock/whatever happens. This, of course, is just a guess. I never said there is an issue in Tomcat, neither that this repo that I created doesn't work. I built CometD from that repo with the I'm trying hard to give you all the needed information to find the problem (besides me giving you access to our codebase). Not sure how more I can help. |
Is the CometD demo working for you in Tomcat? |
Can you attach a network trace with Wireshark? |
@boris-petrov can you please put another log line in From your Can you please check what is the default max message size for Tomcat? It could be that Tomcat has problem writing the messages, but it does not notify the Another simple try you want to do is to remove the configuration |
Thank you for the support. The CometD demo works fine as well as our application - we have hundreds of tests that run many times a day and all works fine. It's just this corner case that is a problem. I'll send a Wireshark trace later if needed as it does contain app-specific data which I'm not sure I can mangle. I put a log line as you requested. Here is the output. The last few lines are:
So your guess is correct - I'm not sure what "default max message size for Tomcat" means. I haven't changed any Tomcat configuration. What should I look for? I already tried removing the By the way, please keep in mind what I said a few days ago - CometD 4.0.0 works and this problem doesn't happen. 4.0.1 and 4.0.2 "break". Not sure what changed there, just to note it if you have any idea. |
This is the Tomcat implementation of the My theory was that Tomcat was choking and forgetting to notify the callback because the message was too large. A number of things have changed between CometD 4.0.0 and 4.0.1, but I don't think they are causing this issue, which seems related to the Can you enable the Tomcat DEBUG logging and see if there are exceptions or errors when the messages are being sent? We are now into debugging Tomcat, where I can give limited help. Alternatively, if you could switch to Jetty and verify that this issue does not happen, then we have a confirmation that is a Tomcat bug. |
@boris-petrov news on this issue? |
Yes, sorry, I had other things to do today. I'll look at this again either tomorrow, or first thing next week. I'm also trying to figure out how to setup the embedded Tomcat's logging preferences - if you have any ideas, you might save me some time. :) Another thing I noticed yesterday - the same issue happened in production. I was getting message timeouts, I refreshed the page 3 times, the fourth time did a clear-all-caches-hard-refresh and it started working. But on the backend side if I do a thread dump I can see that there are 4 "deadlocked" threads with the same stacktrace as above. So this issue does happen in production too. Just mentioning that, I'm not saying anything else. So, again, I'll write as soon as I can. Thanks! |
@boris-petrov for Tomcat Logging you want to read this: https://tomcat.apache.org/tomcat-9.0-doc/logging.html. |
OK, so I managed to enable logging in Tomcat and enabled it for the Also, I reverted to 4.0.0 in one of our installations because of this issue and it started working. Perhaps there is some bug in Tomcat which you didn't hit before 4.0.1 but then it started happening... P.S. Actually no, it happens also on 4.0.0 but I guess less often... or something. |
@boris-petrov if I have a way to reproduce, I would do the debug. Otherwise you need to follow the Tomcat calls in the debugger and see why it does not complete the callback, or perhaps do what you have done with CometD, add your own logging to Tomcat code. What I see from the Tomcat implementation is that in many, many places it throws exceptions without completing the |
I opened a bug in Tomcat about that. Let's see what happens there. So yes, it might be the case that an error happens and the handler is not called. I'm accessing the website that I'm testing with via VPN access from the other side of the world with generally not very good internet so it is possible that a timeout happens or something and that's why it fails but the error never reaches CometD... |
@sbordet - actually I put a |
The Tomcat WebSocket implementation is supposed to call the While exceptions do not happen in your case, if they happen, the Something else is going on in Tomcat that causes the |
@sbordet - they answered on the Tomcat bug. They're saying that throwing an exception (at least in some cases) is actually according to the specs. I'm no expert on this but probably you can take a look - and in that case I guess CometD has to handle it. Thanks for the support. I'll continue playing with this. I'll try different Tomcat versions and see if I can find one that works. |
Also, could you please try the CometD demo and any other app that you may have (running it on Tomcat 9.0.14) and throttling the network speed of the browser (Chrome can do it in the Network tab in the dev tools)? I think this helps triggering the issue. Maybe this way you will be able to reproduce it. |
@sbordet - I wanted to open an issue in Tomcat about this and I noticed this thread there - it says that P.S. Oh, I just saw that actually you've opened the issue. Sorry. So this is taken into consideration in CometD? |
@boris-petrov after that issue we have modified CometD to not call |
@sbordet - I downgraded to CometD 3.1.8 and the problem goes away. CometD 4 introduced async methods for authorization and some other things which I guess use the async IO implementation of Tomcat. What I think happens is that Tomcat 9 (all versions) have a deadlock in their code. CometD 3 uses synchronous IO which works correctly in Tomcat and that's why it works fine. Does that make sense? I've opened an issue in Tomcat. You're free to close this issue here if you feel so as it is not a problem in CometD. If you can help in any way finding and debugging this in Tomcat, that would be great. Thank you for the support! |
@sbordet - there is a discussion in the Tomcat issue. They need more information about the execution path of CometD that leads to the call to |
@boris-petrov I read the Tomcat issue and we are not moving forward. Both here and there we need a reproducer. It's obvious you have one, can you strip it down of sensitive information so that we can run it ourselves? |
Well, that's going to take some effort, but OK, I will try to do it in the next days/weeks. I'll let you know when I have some reproduction. |
This issue seems fixed with the latest Tomcat. In any case it looks like an issue there so I'll close this issue. Thank you for the support! |
Not sure how to explain this. It's very weird. Here goes. Using CometD 4.0.2.
Having the following
logback.xml
:Causes published messages from the frontend to timeout. These messages reach neither my
@Listener
handlers, nor theSecurityPolicy
'scan{Publish/whatever}
methods on the Java backend.Changing the level in the
logback.xml
file toDEBUG
fixes that issue and all messages are sent and received.Both the timeouts and the "fix" happen every single time I try. I reproduce them consistently by just changing the log level. I've done this at least 10 times each.
No idea what is going on. I've been banging my head on this for a few hours now.
How do I proceed with debugging this? That's what I wanted to do by changing the log level... but the issue disappeared. :D The only thing I can imagine right now as a possible source of the issue is some race condition in CometD which causes messages to be "dropped" but when there's a bunch of logging, everything is slower and that's why it works. Any other ideas/suggestions? How do I approach this? Creating a reproduction will probably be an impossible task so I prefer to aid with debugging in any way I can.
The text was updated successfully, but these errors were encountered: