Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THRIFT-3060: clear the offline queue when written #1777

Merged
merged 1 commit into from
May 14, 2019

Conversation

razvanz
Copy link

@razvanz razvanz commented Apr 5, 2019

This PR addresses an issue in the NodeJS module where data would be sent multiple times when the socket reconnects multiple times because the offline_queue would never be emptied.

This fix clears the offline_queue when the socket reconnects and the queued data is written.

Fixes THRIFT-3060.

Copy link
Contributor

@jeking3 jeking3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI issues have been resolved so I would suggest rebasing on master to get a clean run. The logic looks good to me in terms of solving the described issue. The whole offline queue gives me some concerns:

  1. What happens if a disconnect happens during the base class write? Does the message get put into the offline queue? I don't see how it would, but calling write() has a chance of detecting the disconnect, wouldn't one want that message to be placed into the offline queue in that situation?

  2. Just because something successfully called write() does not mean it was delivered. A disconnect following write but before the OS can push it out the door will drop anything queued for outbound send. So if folks are really relying on this mechanism to guarantee delivery of a queued request, they are not getting it. This is one of the tough parts about asynchronous messaging that are usually overlooked.

  3. Not sure this logic of tolerating disconnects and queueing up outbound requests has any limits. What happens if the client queues up 5000 queries for the server and on connect slams it? (Not a concern for your PR, but something to think about). I would consider moving this logic to another transport layer that wraps the connection, like a disconnect_tolerant_connection that has a queue size limit).

@razvanz
Copy link
Author

razvanz commented May 14, 2019

@jeking3 Having looked into the details, all your 3 points seem to be valid issues. Without using the callback interface, when calling write, the message will be put into some internal queues. This means that if a disconnect happens there will be no transparency over what was delivered and what was left pending into internal queues.

The solution is to use the callback interface for the Socket.write method, thus ensuring transparency over what get's delivered and what not. The offline_queue managed by this client adds yet another layer of queuing, which doesn't bring many benefits for the complexity it introduces so it should be avoided. This is however another discussion, which will require a bigger change.

This PR is meant to address a smaller issue in regards to the current implementation, which I think could be merged until a bigger redesign would be ready. Let me know what you think 😃.

@jeking3
Copy link
Contributor

jeking3 commented May 14, 2019

I agree with you in that this is an improvement. My comments were really just about the whole offline queue in general, and how it really isn't guaranteeing anything. I'm not sure something like that belongs in the thrift transport implementation at all. Given clients are asynchronous already, they should implement their own request timeouts and re-submit requests, and design their systems to have idempotent requests.

@jeking3 jeking3 merged commit c035eca into apache:master May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants