Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CometD Client Not Receiving Messages While still being Subscribed & Long Polling #1142

Open
techy-spongebob opened this issue Feb 17, 2022 · 16 comments

Comments

@techy-spongebob
Copy link

CometD version(s)
7.0.5

Java version & vendor (use: java -version)
11.0.7

Description
CometD Client is missing to consume events while still being subscribed and long polling the salesforce server. We found two scenarios when this happened.

Scenario 1 - The event was produced exactly at the same instance when long polling ended. When client connected back for next long polling, it did not receive the event.
Scenario 2 - At times, even when the event is produced when the client is in middle of long polling, server did not send any event to the client. Client returned with EOF Exception from the server.
We worked with salesforce on this and they confirmed that events are delivered to the clientIds which we claimed that missed those events.

Any insights on why client is missing these events.?

How to reproduce?
No specific steps to reproduce.

@sbordet
Copy link
Member

sbordet commented Feb 18, 2022

Any insights on why client is missing these events.?

There could be a number of reasons, from bugs to application mistakes, but without evidence could be anything.

Did the server provided proof that the messages were sent to the correct connection?
Do you have proof that the messages arrived on the client?
Do you have client-side DEBUG logs that can be analyzed to see if the message was, for some reason, discarded?

@techy-spongebob
Copy link
Author

Thanks for the response.

We have the proof that server sent it to the right client id. Is that different from writing to the right connection?

We have proof the message did not arrive at the client. So, we enabled debug logs on the Bayeux client which did not print any message it is received.

@sbordet
Copy link
Member

sbordet commented Feb 18, 2022

Is there a load balancer in between that may have lost the message?

If you have proof the server sent it over the network, and you have proof the message did not arrive, then it's been lost.
Do you have the acknowledgment extension enabled (which would guarantee server-to-client message delivery)?

@techy-spongebob
Copy link
Author

There is a load balancer in between, however, server is responding back to the request initiated by the client (in this long polling scenario), so client should get the response back.

If server is sending the request, then load balancer might be the culprit here. Please correct if this understanding is not correct.

You asked if message was sent to the right connection. The proof from server is the message is delivered to the right client id. As I understand, there is a connection pool which is used by the client to long poll. I don't have guarantee, it is delivered to the right connection.

Our cometd server is on the salesforce side and unfortunately, salesforce does not support acknowledgment extension.

@sbordet
Copy link
Member

sbordet commented Feb 18, 2022

Is also the load balancer controlled by SalesForce?

If you have such evidence, then you may talk to SalesForce showing the problem and ask them to verify the load balancer, and/or ask to enable the acknowledgment extension.

@techy-spongebob
Copy link
Author

techy-spongebob commented Feb 19, 2022

The load balancer I mentioned above is on the client side. Where I was going with that is, the client side load balancer will not be the culprit here.

Now, if there is a load balancer on salesforce side, Can you please explain, what will be the scenario client will lose the message?

Can you also explain on server writing to a wrong connection? What does that mean?

Sure, I will ask them to enable the acknowledgement extension just to troubleshoot this problem.

Thanks for your advise and time on this. I really appreciate it,. Have been battling this issue for a while now. Being on the client side, I had very little to do on this.

@sbordet
Copy link
Member

sbordet commented Feb 21, 2022

Any component in the network path (load balancers, reverse proxies, firewalls) typically has timeouts to close connections (in order to save resources).

If these timeouts are not setup correctly it's possible that the server is just about to send a message, when a downstream (i.e. towards the client) component closes the connection, resulting in the message being lost.
Sure the client will be notified that the connection to the server was broken, but in absence of the acknowledgment extension the message is lost because the server sent it and has no reference to it anymore.

This is why we have developed the acknowledgement extension, which guarantees server-to-client delivery.

About the "wrong connection" case, in case of misconfigured load balancers that are not sticky, it may happen that /meta/connect requests end up in one server, but other messages sent on other connections end up in a different server, causing all sort of problems.

Do you have evidence that the message did not arrive to your client-side load balancer?
If so, it's either another network component, or the server.

@techy-spongebob
Copy link
Author

Thanks again for your reply.

So, we added a Logging Extension to see the heart beat between client and the server.

In the scenarios, when we miss the message, we do get a reply back from server with a EOF exception indicating there were no message on its side to send to us.

example logs:

2022-02-15 16:53:20,674 INFO [pool-550-thread-1] c.e.g.c.s.*LoggingExtension Send in meta-channel : message {clientId=**493w, channel=/meta/connect, id=5507, connectionType=long-polling} session : BayeuxClient@8dacdb4[**493w][SessionState@34a2a2a8[CONNECTED]]

There was a message that was produced at 2022-02-15 23:54:18, that the client did not receive. However got the below couple of seconds later.

2022-02-15 16:54:20,055 INFO [HttpClient@4b90d79f-108] c.e.g.c.s.LoggingExtension Receive in meta-channel : message {failure={exception=java.io.EOFException: HttpConnectionOverHTTP@11e6615e::DecryptedEndPoint@4de5017{l=/1,r=**,OPEN,fill=-,flush=-,to=59380/0}, message={clientId=**493w, channel=/meta/connect, id=5507, connectionType=long-polling}, connectionType=long-polling}, channel=/meta/connect, id=5507, successful=false} session : BayeuxClient@8dacdb4[**493w][SessionState@34a2a2a8[CONNECTED]]

So, if there were a network component, we should have seen something else and not the above.

This further increases the chance for the load balancer not being sticky on the server side. I will check with them on this.

Thanks again, Appreciate it!

@sbordet
Copy link
Member

sbordet commented Feb 21, 2022

The fact that you got an EOFException means that there was a network failure, that likely caused the loss of the message.

This is exactly what you would expect if a network component between the client and the server closed the connection towards your client.

This further increases the chance for the load balancer not being sticky on the server side

This is quite unlikely, as nothing will work not only for you but for everybody else, and I doubt SalesForce deployed such a broken system.

It could be just some network component in between. If you know so many details about messages that got lost at exactly what times, with some collaboration from SalesForce you should be able to figure out this issue -- just follow all the hops until you find where the message has been lost.

@techy-spongebob
Copy link
Author

techy-spongebob commented Feb 21, 2022

Thanks Simone for the advise. Makes sense. Will look into this further and get back.

In our scenario, we are not missing all the events. We are missing the events that are produced exactly at the timeout or very near (2-3 seconds) to the timeout. Probably, it is the network. Lets see.

@techy-spongebob
Copy link
Author

Wanted to check with you, what is the recommended timeout at the network layer for long polling.

It looks like we have 60 seconds as timeout at network layer and it is when the EOF exception happens.

@sbordet
Copy link
Member

sbordet commented Feb 22, 2022

You want to have the CometD timeout parameter, which is the long-polling period, less than network idle timeouts by some noticeable margin.
For a 60 seconds network idle timeout, I'd say anything less or equal to 55 seconds.
The default value of 30 seconds is good enough.

However, check all the timeouts in network components, because sometimes there are many with different meaning that may trip the CometD long-polling (e.g. read timeouts, write timeout, idle timeouts, read rates, write rates, etc.).

@dyutiman
Copy link

dyutiman commented Jan 9, 2023

Hi @sbordet ,

Do we expect to loose message whenever we hit this EOFException? Even if there is a tiny chance, then we should stop the Bayeux client on EOFException and re-subscribe from the beginning.

I understand that CometD will continue to send '/meta/connect' even after hitting EOFException. But it will not replay the message we may have lost.

But if we take control of the EOFException and start the subscription from the beginning where we could specify the last successfully received message id (supported by Salesforce via replayID) then we will not loose any events.

Please share your thoughts in this.

  • thanks

@sbordet
Copy link
Member

sbordet commented Jan 9, 2023

CometD already has the acknowledgement extension that provides server-to-client delivery guarantee, even in case of short network failures.

In case of longer network failures, you will lose the messages that arrived in the meantime, because the server does not know that your client is connected.

Implementing a persistent store for messages for clients that may or may not reconnect (and if they don't reconnect will blow up the storage) is outside the scope of CometD and into application scope, as you may have different application logic for handling that case.

@dyutiman
Copy link

dyutiman commented Jan 9, 2023

Thanks for the reply and yes I agree that implementation of persistent storage is outside of the scope of CometD. We can develop that for ourselves while building our clinet using CometD.

And I am not sure if the backend here (Salesforce) implemented the server-to-client delivery extension. Let's assume that they do not have the server-to-client extension implementaed. But we need to be sure that we do not loose any message.

In that case do you think we should take over and handle the EOFException and start re-subscribing? (not CometD default implementation, we could disconnect and connect back from our code)

Below is the stacktrace for the Exception

2023-01-05 12:48:54 IST [CLS.0103.0049E] (tid=926) CometD client error for subscriber 'MySalesForce53_listener:SF_53_Enterprise_CDC_Listener': java.io.EOFException: HttpConnectionOverHTTP@4cafbfea::DecryptedEndPoint@60864c65[{l=/10.60.97.119:54362,r=/85.222.152.60:443,OPEN,fill=-,flush=-,to=50120/0}]; Message details: [{clientId=309fzrs, channel=/meta/connect, id=18, connectionType=long-polling}] java.io.EOFException: HttpConnectionOverHTTP@4cafbfea::DecryptedEndPoint@60864c65[{l=/10.60.97.119:54362,r=***/85.222.152.60:443,OPEN,fill=-,flush=-,to=50120/0}]
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.earlyEOF(HttpReceiverOverHTTP.java:395) ~[?:?]
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1605) ~[?:?]
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.shutdown(HttpReceiverOverHTTP.java:281) ~[?:?]
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.process(HttpReceiverOverHTTP.java:197) ~[?:?]
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.receive(HttpReceiverOverHTTP.java:91) ~[?:?]
at org.eclipse.jetty.client.http.HttpChannelOverHTTP.receive(HttpChannelOverHTTP.java:91) ~[?:?]
at org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onFillable(HttpConnectionOverHTTP.java:194) ~[?:?]
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:319) ~[?:?]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[?:?]
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:530) ~[?:?]
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:379) ~[?:?]
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:146) ~[?:?]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[?:?]
at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[?:?]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:412) ~[?:?]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:381) ~[?:?]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:268) ~[?:?]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:138) ~[?:?]
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:407) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]

@sbordet
Copy link
Member

sbordet commented Jan 9, 2023

The stack trace shows that the server closed the connection before sending the response.

And I am not sure if the backend here (Salesforce) implemented the server-to-client delivery extension. Let's assume that they do not have the server-to-client extension implementaed. But we need to be sure that we do not loose any message.

At this point you have to ask Salesforce.
The CometD project supports what you want in terms of features, but if the server does not use them, there is nothing we can do.

If you get that error, you can disconnect the client and re-handshake, but you may have lost messages in between.

oscerd pushed a commit to apache/camel that referenced this issue Jul 22, 2024
)

* CAMEL-20918: set MAX_NETWORK_DELAY_OPTION to be 120 sec as per cometd/cometd#1142 (comment) and https://developer.salesforce.com/docs/atlas.en-us.api_streaming.meta/api_streaming/using_streaming_api_timeouts.htm

* CAMEL-20918: fix exception handling for Salesforce Streaming API:
- execute subscriptionListener in a separate thread similarly to handshakeListener and connectListener,
- fix SubscriptionHelper#listenerMap remaining empty when an exception occurs while subscribing to a channel,
- remove SubscriptionHelper#handshake as there is always a thread hanging in this method as per https://issues.apache.org/jira/browse/CAMEL-20388,
- remove the call to closeChannel for topic listeners as they will be removed on SubscriptionHelper#unsubscribe

* CAMEL-20918: fix a failing build

* CAMEL-20918: set MAX_NETWORK_DELAY_OPTION as first to allow overwriting

* CAMEL-20918: update replayId per channel when explicitly subscribed and not after rehandshake

* CAMEL-20918: fix replay extension not to overwrite ReplayId per channel if already set

* CAMEL-20918: fix formatting

* CAMEL-20918: remove unnecessary log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants