Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify how to handle all fatal errors #780

Open
Diagoras opened this issue Jul 6, 2018 · 7 comments
Open

Clarify how to handle all fatal errors #780

Diagoras opened this issue Jul 6, 2018 · 7 comments

Comments

@Diagoras
Copy link
Contributor

Diagoras commented Jul 6, 2018

Hey,

I've been working on an internal library that tries to wrap the CometD Java client into a Reactor Flux. Part of this involves propagating all possible fatal events (terminal failures, server sent disconnects) as errors.

I've registered the necessary callbacks on handshake and subscribe as suggested in the manual - and thanks for documenting that!

The problem is that wasn't quite enough. I've also had to override the BayeuxClient's onTransportFailure(String, String, Throwable) method in order to handle fatal internal errors when there are no more transports to fall back on (eg. a handshake failure where the advice is 'none', or exceeding the max backoff time). I've also had to define an extension to handle server initiated disconnects, which while not being errors are fatal to the BayeuxClient.

If anything above seems obviously wrong or if there's an easier way to capture all errors, please let me know. If not, I have some recommendations:

  • For now, documenting the need to override onTransportFailure to handle internal issues would be awesome. The documentation already covers using callback methods to handle handshake and subscription failures, so this would fit right in. Same goes for clarifying that server sent disconnects need to be handled, and maybe even having an example of an extension that would server.
  • In future versions, it might make sense to take a page from Reactor and define callback registering methods directly on the BayeuxClient. It'd be more consistent than overriding (since we already use callbacks to handle handshake/subscribe issues) and allow us to define handlers like onFatalError, onTerminate, and onServerDisconnect to make it easier to ensure you have all your bases covered. I think onTerminate is particularly powerful, since "do this thing if the BayeuxClient is dead and not coming back" strikes me as likely to be a very common use case.

Thanks, and let me know what you think.

@sbordet
Copy link
Member

sbordet commented Jul 21, 2018

I've also had to define an extension to handle server initiated disconnects, which while not being errors are fatal to the BayeuxClient.

It should have been enough to register a /meta/disconnect listener, I think?

allow us to define handlers like onFatalError, onTerminate

What would be the difference between the two above?
I agree that an onTerminate callback would be useful although is definitely not the common case (the common case is that everything works and BayeuxClient can communicate with the server).

We'll probably do some similar work (ReactiveStream-ing BayeuxClient) for CometD 4, so your input/contributions will be great.

Thanks!

@Diagoras
Copy link
Contributor Author

One note - I ended up having to override the onTransportFailure(Message, ClientTransport.FailureInfo, ClientTransport.FailureHandler) to capture all possible fatal exceptions, not the other method with the same name. For the benefit of any future developers Googling "BayeuxClient silently disconnects":

  @Override
  protected void onTransportFailure(final Message message, final ClientTransport.FailureInfo failureInfo, final ClientTransport.FailureHandler handler) {
    // Have to call super first, since some paths can modify the `failureInfo` parameter in ways we care about.
    // ie. Update the 'action' field to 'none'.
    super.onTransportFailure(message, failureInfo, handler);
    if (Message.RECONNECT_NONE_VALUE.equals(failureInfo.action)) { // Only fatal if the action is 'none', otherwise let the BayeuxClient retry.
      LOGGER.debug("Received a fatal error from server: {}", message);
      sink.error(new RuntimeException("Fatal error received on channel '"+ message.getChannel()
                                           + "', message was: " + message.getJSON()));
    }
  }

On top of that, you have to register a normal error handler on the subscribe call since those fatal errors don't forward to onTransportFailure.

It should have been enough to register a /meta/disconnect listener, I think?

Yeah, that'd have the same effect. It might be nice to call out in documentation though, since there's some tricky details you need to handle (eg. ensuring the disconnect is server-initiated by checking that the ID is missing). That's also why I suggested having some kind of onServerDisconnect callback, much like how you currently provide easy to use handshake and subscribe callbacks.

What would be the difference between the two above?

Uh...I could've sworn I had some subtle distinction in mind when I wrote that, but it totally escapes me now.

I agree that an onTerminate callback would be useful although is definitely not the common case (the common case is that everything works and BayeuxClient can communicate with the server).

Absolutely. And avoiding these kinds of silent errors is always really tricky when doing async programming, since exceptions don't just neatly propagate up the stack to the caller.

We'll probably do some similar work (ReactiveStream-ing BayeuxClient) for CometD 4, so your input/contributions will be great.

That sounds fantastic! If you're interested in a PR, let me know. I've already integrated BayeuxClient 3 with the Reactor library, so doing it for a generic Publisher with version 4 shouldn't be too dissimilar I hope.

@sbordet
Copy link
Member

sbordet commented Sep 5, 2018

I've already integrated BayeuxClient 3 with the Reactor library, so doing it for a generic Publisher with version 4 shouldn't be too dissimilar I hope.

Is this public so I can take a look?

@Diagoras
Copy link
Contributor Author

Diagoras commented Sep 5, 2018

@sbordet It is not, but I can see about open sourcing the work so you can see our approach. It's definitely less than ideal (for example, right now it opens a new Bayeux session for each subscription) but might give you some ideas.

It might be quicker to get permission to show you the codebase than to put it on Github for everyone to see (since the latter requires clearing any credentials out of our Git history, security review, rewriting ITs that currently hit Salesforce, etc.). If you think that would be helpful, let me know and I can go about getting that for you.

Oh, and thanks!

@sbordet
Copy link
Member

sbordet commented Sep 5, 2018

@Diagoras I would not mind to take a look at it, even privately.

I'm thinking that it should not have been necessary to override onTransportFailure(), you should have been able to do everything via listeners.

If that is not the case, then I'd like to know the exact details of what listener is not called, and look into fixing that.

@Diagoras
Copy link
Contributor Author

Diagoras commented Sep 6, 2018

@sbordet Awesome. I should be able to get back to you next week on that, once I'm done convincing my company that having you a sign an NDA for code we want you to use is beyond pointless. ;)

As to onTransportFailure(), I can describe some scenarios. Failed subscriptions will have their relevant listener invoked, so they're fine.

However, unsuccessful connect attempts that receive advice like {reconnect=none} will not trigger any of the listener callbacks and will also not get propagated as failed messages - instead, the Bayeux client will just terminate.

Also, handshake failures that are initially meant to rehandshake but are modified by BayeuxClient#onTransportFailure to instead have an advice value of RECONNECT_NONE_VALUE (eg. after all transports are exhausted) will propagate the original, unmodified failure info to the callback since it's invoked before BayeuxClient#onTransportFailure. This is a problem if you want to specifically listen for terminal handshake events (in order to only react when CometD's own automatic retry logic has given up), but is fine if you only care about handshake failures in general or if the server sends {reconnect=none} with the handshake failure.

...I'm less certain about that handshake one I just described, so please do check my logic. I might be getting something wrong there.

The other scenario (though not one handled by overriding onTransportFailure) involves server-sent disconnect messages, which will result in the CometD client terminating silently. I ended up writing an extension that watches for inbound disconnect messages with no ID field (since that means they aren't a response to a client disconnect), but as you noted you could also use a /meta/disconnect listener.

Sorry about the essay, but I wanted to describe each scenario I've seen in a complete manner. If I've left anything out or you need clarification, let me know and I can try to follow up.

@Diagoras
Copy link
Contributor Author

@sbordet Unfortunately, there's still a holdup on showing you our integration with Reactor. I am cleared to work on this project in my free time, though, so hopefully I can contribute to making things reactive in CometD 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants