Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relay: rewrite design of Server to follow a better SANS-IO design #4095

Closed
thomaseizinger opened this issue Mar 12, 2024 · 0 comments · Fixed by #4455
Closed

relay: rewrite design of Server to follow a better SANS-IO design #4095

thomaseizinger opened this issue Mar 12, 2024 · 0 comments · Fixed by #4455
Labels
area/relay Issues involving the Firezone Relay kind/refactor Code refactoring

Comments

@thomaseizinger
Copy link
Member

thomaseizinger commented Mar 12, 2024

The relay was the first SANS-IO component written for firezone. With the experience on snownet, I am no longer convinced that it was written in a very good way. In particular, bugs like #4094 are a result of some of its design choices.

To alleviate future bugs and make maintenance easier, we should rewrite relay::Server to an API similar to snownet:

  • Introduce a dedicated poll_transmit for sending messages
  • Introduce a dedicated poll_timeout for when we want to be woken next
  • Introduce a dedicated handle_timeout for performing time-based actions

In the current design, the above are combined into Commands. This can lead to subtle bugs where something needs to be performed at a certain time but forgetting to explicitly register a Command for that results in that never happening. Similarly, emitting a Command::Wake with the wrong timestamp may delay certain time-based actions.

@thomaseizinger thomaseizinger added kind/refactor Code refactoring area/relay Issues involving the Firezone Relay labels Mar 12, 2024
ReactorScram pushed a commit that referenced this issue Apr 2, 2024
Previously, we would allocate each message twice:

1. When receiving the original packet.
2. When forming the resulting channel-data message.

We can optimise this to only one allocation each by:

1. Carrying around the original `ChannelData` message for traffic from
clients to peers.
2. Pre-allocating enough space for the channel-data header for traffic
from peers to clients.

Local flamegraphing still shows most of user-space activity as
allocations. I did occasionally see a throughput of ~10GBps with these
patches. I'd like to still work towards #4095 to ensure we handle
anything time-sensitive better.
github-merge-queue bot pushed a commit that referenced this issue Apr 2, 2024
This is much more robust than the previous implementation because we now
go through all allocations and channels every time we get a
`handle_timeout` and clean up everything that is expired.

Resolves: #4095.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/relay Issues involving the Firezone Relay kind/refactor Code refactoring
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant