Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a heartbeat to detect hung connections #1101

Merged
merged 3 commits into from
Dec 14, 2019
Merged

Conversation

FKSRipple
Copy link
Collaborator

@FKSRipple FKSRipple commented Nov 25, 2019

ripple-lib now keeps a constant heartbeat to detect broken/hung connections. Every minute, a "ping" command is sent by the RippleAPI connection instance. If that fails, the connection is disconnected so that the consumer can reconnect.

I was able to test this by dropping the timeout down to below the mocked "ping" response, to simulate a hung connection receiving no responses from the server. It's not perfect, but it's a fair enough representation, especially considering how tricky it is to test server disconnect/reconnect logic.

Questions

  • Is "disconnect" the proper action? With a little more complexity, we could also reconnect automatically (silently so that the user never knows, or with emitted "disconnect" and "reconnect" events).
    • A: Talked with @intelliot, and we'll make "reconnect" the default behavior.
  • I'm assuming that there's no concern here for added load on the network (1 ping request, every minute, for everyone who uses the official Ripple SDK). If there is a concern there, let me know and we can bump up the interval or make it opt-in.
    • A: This is fine.

Alternative Designs / Considerations

For a while I worked to do this in some clever way using the ledgerClosed stream that we already connect to (so no extra ping requests needed). But ultimately it felt more maintainable/reliable to just add a true heartbeat that wouldn't have to rely on the behavior of that stream's implementation on both the server and the client.

@keefertaylor
Copy link
Contributor

This seems reasonable to me, though I defer to you and Elliot about the answers to your questions. In general, folks should be running their own node so if this overloads ours I think there's a clear migration path for affected users (not that we wouldn't roll back and fix the issues on our end).

Can you get your unit tests passing? I think you probably just need to increase your timeout because of how slow Travis is.

@FKSRipple FKSRipple force-pushed the heartbeat-ping branch 2 times, most recently from 8bb0203 to 16ea1cc Compare November 27, 2019 00:56
@FKSRipple
Copy link
Collaborator Author

@keefertaylor thanks for taking a look. I spoke with Elliot and we agreed to move forward with an automatic reconnect. I'll look into adding some sort of backoff strategy as well so that a user doesn't overwhelm their own node.

Not sure why that test is failing, it seems to be more serious than just Travis being slow. Will look into that as well

@FKSRipple
Copy link
Collaborator Author

FKSRipple commented Dec 13, 2019

@intelliot @keefertaylor I changed the "disconnect" behavior to "reconnect" automatically, and fixed that Travis bug. PTAL

Related: By hammering on our overall xrp-api resiliency and testing the reconnect flow in a few different real-world examples, I have some concerns about the state of our reconnect logic/story overall that I'd like to fix over the next few weeks. But as long as this is the desired behavior, I think that it makes sense to just keep using the logic that we have for now and then tackle those improvements/rewrites separately.

@intelliot
Copy link
Collaborator

I have some concerns about the state of our reconnect logic/story overall that I'd like to fix over the next few weeks. But as long as this is the desired behavior, I think that it makes sense to just keep using the logic that we have for now and then tackle those improvements/rewrites separately.

Agreed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants