-
Notifications
You must be signed in to change notification settings - Fork 402
fix(connlib): reconnect in case we lose all relays #7164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
9e9d396 to
e3e8eb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these tests, we don't add any relays and thus they can't succeed. The only thing they really test are timeouts which I couldn't be bothered to make work 😅
e3e8eb7 to
f953a01
Compare
f953a01 to
6602e7a
Compare
conectado
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
6602e7a to
843680b
Compare
843680b to
7b093b0
Compare
During normal operation, we should never lose connectivity to the set of assigned relays in a client or gateway. In the presence of odd network conditions and partitions however, it is possible that we disconnect from a relay that is in fact only temporarily unavailable. Without an explicit mechanism to retrieve new relays, this means that both clients and gateways can end up with no relays at all. For clients, this can be fixed by either roaming or signing out and in again. For gateways, this can only be fixed by a restart!
Without connected relays, no connections can be established. With #7163, we will at least be able to still establish direct connections. Yet, that isn't good enough and we need a mechanism for restoring full connectivity in such a case.
We creating a new connection, we already sample one of our relays and assign it to this particular connection. This ensures that we don't create an excessive amount of candidates for each individual connection. Currently, this selection is allowed to be silently fallible. With this PR, we make this a hard-error and bubble up the error that all the way to the client's and gateway's event-loop. There, we initiate a reconnect to the portal as a compensating action. Reconnecting to the portal means we will receive another
initmessage that allows us to reconnect the relays.Due to the nature of this implementation, this fix may only apply with a certain delay from when we actually lost connectivity to the last relay. However, this design has the advantage that we don't have to introduce an additional state within
snownet: Connections now simply fail to establish and the next one soon after should succeed again because we will have received a newinitmessage.Resolves: #7162.