Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(peer): reconnection ping-pongs #2841

Merged
merged 1 commit into from
Jul 25, 2023
Merged

fix(peer): reconnection ping-pongs #2841

merged 1 commit into from
Jul 25, 2023

Conversation

dpc
Copy link
Contributor

@dpc dpc commented Jul 25, 2023

The root of all evil is state. Especially mutable and shared. Seems to me like peer communication is using a connection where both sides think they are in control. So with higher latencies it is possible to get into a sort of a reconnect ping-pong, when one sides reconnects, starts re-sending all messages, and during that time the other side reconnects, and starts sending messages themselves... just to get new connection from the other side...

Typically the way I'd write is to have two connections (for each side) and the sending side being responsible for (re-)connecting. A bit wasteful, but makes the code easier.

Here, to avoid refactoring too much, I just make the peer with a lower PeerId responsible for reconnections.

Fix #2800

@dpc dpc requested a review from a team as a code owner July 25, 2023 21:37
@codecov
Copy link

codecov bot commented Jul 25, 2023

Codecov Report

Patch coverage: 93.33% and project coverage change: -0.07% ⚠️

Comparison is base (0403bdb) 63.26% compared to head (474d288) 63.20%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2841      +/-   ##
==========================================
- Coverage   63.26%   63.20%   -0.07%     
==========================================
  Files         211      211              
  Lines       42214    42219       +5     
==========================================
- Hits        26708    26684      -24     
- Misses      15506    15535      +29     
Files Changed Coverage Δ
fedimint-server/src/net/peers.rs 91.26% <93.33%> (-2.11%) ⬇️

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

The root of all evil is state. Especially mutable and shared. Seems to
me like peer communication is using a connection where both sides think
they are in control of the connection. So with higher latencies it is
possible to get into a sort of a reconnect ping-pong, when one sides
reconnects, starts re-sending all messages, and during that time the
other side reconnects, and starts sending messages themselves... just to
get new connection from the other side...

Fix fedimint#2800
Copy link
Contributor

@douglaz douglaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and it works! Not sure it solves 100% of the problems, but it should at least fix 99%

@dpc dpc added this pull request to the merge queue Jul 25, 2023
Merged via the queue into fedimint:master with commit 4919189 Jul 25, 2023
18 checks passed
@dpc dpc deleted the timeouts-dkg branch July 25, 2023 23:45
@elsirion
Copy link
Contributor

Interesting! I was aware of that problem but expected exponential backoff+jitter to fix this after a few iterations, apparently not or some variables were chosen poorly :/ Anyway, your solution is quite elegant.

@justinmoon
Copy link
Contributor

Fantastic!

@dpc
Copy link
Contributor Author

dpc commented Jul 26, 2023

I was aware of that problem but expected exponential backoff+jitter to fix this after a few iterations,

Right. But I think we've optimized these at some point to improve test times or something, which becomes a problem on longer latency links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weird communication errors while setting up federation
4 participants