Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add agent acks to in-memory coordinator #12786

Merged
merged 14 commits into from
Apr 10, 2024

Conversation

coadler
Copy link
Member

@coadler coadler commented Mar 27, 2024

When an agent receives a node, it responds with an ACK which is relayed
to the client. After the client receives the ACK, it's allowed to begin
pinging.

Copy link
Member Author

coadler commented Mar 27, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @coadler and the rest of your teammates on Graphite Graphite

@coadler coadler force-pushed the colin/feat_add_agent_acks_to_in-memory_coordinator branch 2 times, most recently from 1f1e8f4 to c066f24 Compare March 27, 2024 22:51
@coadler coadler marked this pull request as ready for review March 27, 2024 22:55
tailnet/proto/tailnet.proto Outdated Show resolved Hide resolved
tailnet/proto/tailnet.proto Outdated Show resolved Hide resolved
tailnet/proto/tailnet.proto Outdated Show resolved Hide resolved
tailnet/coordinator.go Outdated Show resolved Hide resolved
tailnet/coordinator.go Outdated Show resolved Hide resolved
tailnet/coordinator.go Outdated Show resolved Hide resolved
// potentially be smarter to only send an ACK once per client,
// but there's nothing currently stopping clients from reusing
// IDs.
for _, peer := range resp.GetPeerUpdates() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this is too superficial --- here we are only acknowledging the fact that we received a peer update, not that it was programmed into wireguard, which is what is actually needed for the handshake to complete.

I guess this is an OK start in that it cuts out any propagation delay from the Coordinator out of the race condition, but still leaves the race there. I realize that you have yet to add support to the PGCoordinator, which is where we suspect the real problems are, so we will need to test this out and confirm that missed handshakes are substantially reduced. We can embed the ack deeper into tailnet in some later PR if we are still missing handshakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, completely eliminating the race would require digging down into the configmaps which I wasn't keen to do unless necessary. In my testing with the in-memory coordinator I wasn't able to hit the 5s backoff anymore. I suspect pgcoord to actually fare better considering the extra round trip latency as compared to the in-memory coordinator.

tailnet/coordinator.go Outdated Show resolved Hide resolved
codersdk/workspacesdk/workspacesdk.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
@coadler coadler requested a review from spikecurtis April 3, 2024 05:27
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
@@ -387,33 +408,78 @@ func (c *configMaps) updatePeerLocked(update *proto.CoordinateResponse_PeerUpdat
// SSH connections don't send packets while idle, so we use keep alives
// to avoid random hangs while we set up the connection again after
// inactivity.
node.KeepAlive = ok && peerStatus.Active
node.KeepAlive = (statusOk && peerStatus.Active) || (peerOk && lc.node != nil && lc.node.KeepAlive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this status-based keep-alive business? My understanding was that was a bit of a hack to avoid building what you are now building. That is, we didn't want to turn on keep-alives at the source here because that would cause it to start the handshake too early. But now, at the source we can wait until the destination sends us READY_FOR_HANDSHAKE.

IIRC, we never really needed keep-alive turned on at the destination, but until now the Conn was unaware of the distinction.

Putting this calculation here made sense when it was only based on the status but now there is logic in some of the case statements below that set the KeepAlive.

I think we need a single function that computes the KeepAlive value that we can call after all the changes are made, and consistently use it to compute the value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to leave this be, mostly since I don't want to change the behavior of how tunnel destinations handle keep alives for now. Once the RFH stuff is all in and working properly, I'll revisit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we should leave the status-based keep alive processing until everything is working including PGCoordinator.

tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Show resolved Hide resolved
tailnet/conn.go Outdated Show resolved Hide resolved
codersdk/workspacesdk/connector.go Outdated Show resolved Hide resolved
codersdk/workspacesdk/connector.go Outdated Show resolved Hide resolved
@coadler coadler requested a review from spikecurtis April 3, 2024 23:19
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some stylistic suggestions inline, but I don't need to review again. Looks good overall!

tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
tailnet/configmaps.go Outdated Show resolved Hide resolved
@coadler coadler force-pushed the colin/feat_add_agent_acks_to_in-memory_coordinator branch from dfdef50 to cae734d Compare April 10, 2024 22:00
@coadler coadler merged commit e801e87 into main Apr 10, 2024
29 of 45 checks passed
@coadler coadler deleted the colin/feat_add_agent_acks_to_in-memory_coordinator branch April 10, 2024 22:15
@github-actions github-actions bot locked and limited conversation to collaborators Apr 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants