-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CometD Client Not Receiving Messages While still being Subscribed & Long Polling #1142
Comments
There could be a number of reasons, from bugs to application mistakes, but without evidence could be anything. Did the server provided proof that the messages were sent to the correct connection? |
Thanks for the response. We have the proof that server sent it to the right client id. Is that different from writing to the right connection? We have proof the message did not arrive at the client. So, we enabled debug logs on the Bayeux client which did not print any message it is received. |
Is there a load balancer in between that may have lost the message? If you have proof the server sent it over the network, and you have proof the message did not arrive, then it's been lost. |
There is a load balancer in between, however, server is responding back to the request initiated by the client (in this long polling scenario), so client should get the response back. If server is sending the request, then load balancer might be the culprit here. Please correct if this understanding is not correct. You asked if message was sent to the right connection. The proof from server is the message is delivered to the right client id. As I understand, there is a connection pool which is used by the client to long poll. I don't have guarantee, it is delivered to the right connection. Our cometd server is on the salesforce side and unfortunately, salesforce does not support acknowledgment extension. |
Is also the load balancer controlled by SalesForce? If you have such evidence, then you may talk to SalesForce showing the problem and ask them to verify the load balancer, and/or ask to enable the acknowledgment extension. |
The load balancer I mentioned above is on the client side. Where I was going with that is, the client side load balancer will not be the culprit here. Now, if there is a load balancer on salesforce side, Can you please explain, what will be the scenario client will lose the message? Can you also explain on server writing to a wrong connection? What does that mean? Sure, I will ask them to enable the acknowledgement extension just to troubleshoot this problem. Thanks for your advise and time on this. I really appreciate it,. Have been battling this issue for a while now. Being on the client side, I had very little to do on this. |
Any component in the network path (load balancers, reverse proxies, firewalls) typically has timeouts to close connections (in order to save resources). If these timeouts are not setup correctly it's possible that the server is just about to send a message, when a downstream (i.e. towards the client) component closes the connection, resulting in the message being lost. This is why we have developed the acknowledgement extension, which guarantees server-to-client delivery. About the "wrong connection" case, in case of misconfigured load balancers that are not sticky, it may happen that Do you have evidence that the message did not arrive to your client-side load balancer? |
Thanks again for your reply. So, we added a Logging Extension to see the heart beat between client and the server. In the scenarios, when we miss the message, we do get a reply back from server with a EOF exception indicating there were no message on its side to send to us. example logs: 2022-02-15 16:53:20,674 INFO [pool-550-thread-1] c.e.g.c.s.*LoggingExtension Send in meta-channel : message {clientId=**493w, channel=/meta/connect, id=5507, connectionType=long-polling} session : BayeuxClient@8dacdb4[**493w][SessionState@34a2a2a8[CONNECTED]] There was a message that was produced at 2022-02-15 23:54:18, that the client did not receive. However got the below couple of seconds later. 2022-02-15 16:54:20,055 INFO [HttpClient@4b90d79f-108] c.e.g.c.s.LoggingExtension Receive in meta-channel : message {failure={exception=java.io.EOFException: HttpConnectionOverHTTP@11e6615e::DecryptedEndPoint@4de5017{l=/1,r=**,OPEN,fill=-,flush=-,to=59380/0}, message={clientId=**493w, channel=/meta/connect, id=5507, connectionType=long-polling}, connectionType=long-polling}, channel=/meta/connect, id=5507, successful=false} session : BayeuxClient@8dacdb4[**493w][SessionState@34a2a2a8[CONNECTED]] So, if there were a network component, we should have seen something else and not the above. This further increases the chance for the load balancer not being sticky on the server side. I will check with them on this. Thanks again, Appreciate it! |
The fact that you got an This is exactly what you would expect if a network component between the client and the server closed the connection towards your client.
This is quite unlikely, as nothing will work not only for you but for everybody else, and I doubt SalesForce deployed such a broken system. It could be just some network component in between. If you know so many details about messages that got lost at exactly what times, with some collaboration from SalesForce you should be able to figure out this issue -- just follow all the hops until you find where the message has been lost. |
Thanks Simone for the advise. Makes sense. Will look into this further and get back. In our scenario, we are not missing all the events. We are missing the events that are produced exactly at the timeout or very near (2-3 seconds) to the timeout. Probably, it is the network. Lets see. |
Wanted to check with you, what is the recommended timeout at the network layer for long polling. It looks like we have 60 seconds as timeout at network layer and it is when the EOF exception happens. |
You want to have the CometD However, check all the timeouts in network components, because sometimes there are many with different meaning that may trip the CometD long-polling (e.g. read timeouts, write timeout, idle timeouts, read rates, write rates, etc.). |
Hi @sbordet , Do we expect to loose message whenever we hit this EOFException? Even if there is a tiny chance, then we should stop the Bayeux client on EOFException and re-subscribe from the beginning. I understand that CometD will continue to send '/meta/connect' even after hitting EOFException. But it will not replay the message we may have lost. But if we take control of the EOFException and start the subscription from the beginning where we could specify the last successfully received message id (supported by Salesforce via replayID) then we will not loose any events. Please share your thoughts in this.
|
CometD already has the acknowledgement extension that provides server-to-client delivery guarantee, even in case of short network failures. In case of longer network failures, you will lose the messages that arrived in the meantime, because the server does not know that your client is connected. Implementing a persistent store for messages for clients that may or may not reconnect (and if they don't reconnect will blow up the storage) is outside the scope of CometD and into application scope, as you may have different application logic for handling that case. |
Thanks for the reply and yes I agree that implementation of persistent storage is outside of the scope of CometD. We can develop that for ourselves while building our clinet using CometD. And I am not sure if the backend here (Salesforce) implemented the server-to-client delivery extension. Let's assume that they do not have the server-to-client extension implementaed. But we need to be sure that we do not loose any message. In that case do you think we should take over and handle the EOFException and start re-subscribing? (not CometD default implementation, we could disconnect and connect back from our code) Below is the stacktrace for the Exception 2023-01-05 12:48:54 IST [CLS.0103.0049E] (tid=926) CometD client error for subscriber 'MySalesForce53_listener:SF_53_Enterprise_CDC_Listener': java.io.EOFException: HttpConnectionOverHTTP@4cafbfea::DecryptedEndPoint@60864c65[{l=/10.60.97.119:54362,r=/85.222.152.60:443,OPEN,fill=-,flush=-,to=50120/0}]; Message details: [{clientId=309fzrs, channel=/meta/connect, id=18, connectionType=long-polling}] java.io.EOFException: HttpConnectionOverHTTP@4cafbfea::DecryptedEndPoint@60864c65[{l=/10.60.97.119:54362,r=***/85.222.152.60:443,OPEN,fill=-,flush=-,to=50120/0}] |
The stack trace shows that the server closed the connection before sending the response.
At this point you have to ask Salesforce. If you get that error, you can disconnect the client and re-handshake, but you may have lost messages in between. |
) * CAMEL-20918: set MAX_NETWORK_DELAY_OPTION to be 120 sec as per cometd/cometd#1142 (comment) and https://developer.salesforce.com/docs/atlas.en-us.api_streaming.meta/api_streaming/using_streaming_api_timeouts.htm * CAMEL-20918: fix exception handling for Salesforce Streaming API: - execute subscriptionListener in a separate thread similarly to handshakeListener and connectListener, - fix SubscriptionHelper#listenerMap remaining empty when an exception occurs while subscribing to a channel, - remove SubscriptionHelper#handshake as there is always a thread hanging in this method as per https://issues.apache.org/jira/browse/CAMEL-20388, - remove the call to closeChannel for topic listeners as they will be removed on SubscriptionHelper#unsubscribe * CAMEL-20918: fix a failing build * CAMEL-20918: set MAX_NETWORK_DELAY_OPTION as first to allow overwriting * CAMEL-20918: update replayId per channel when explicitly subscribed and not after rehandshake * CAMEL-20918: fix replay extension not to overwrite ReplayId per channel if already set * CAMEL-20918: fix formatting * CAMEL-20918: remove unnecessary log
CometD version(s)
7.0.5
Java version & vendor (use:
java -version
)11.0.7
Description
CometD Client is missing to consume events while still being subscribed and long polling the salesforce server. We found two scenarios when this happened.
Scenario 1 - The event was produced exactly at the same instance when long polling ended. When client connected back for next long polling, it did not receive the event.
Scenario 2 - At times, even when the event is produced when the client is in middle of long polling, server did not send any event to the client. Client returned with EOF Exception from the server.
We worked with salesforce on this and they confirmed that events are delivered to the clientIds which we claimed that missed those events.
Any insights on why client is missing these events.?
How to reproduce?
No specific steps to reproduce.
The text was updated successfully, but these errors were encountered: