Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(snownet): minimize delay when roaming #4246

Merged
merged 3 commits into from
Mar 22, 2024

Conversation

thomaseizinger
Copy link
Member

Currently, we need to wait for the timeout of the current candidate pair during reconnect before we nominate a new one. To speed this up, we can preemptively invalidate candidates we have previously discovered via our Allocations, i.e. relay candidates and srflx candidates.

Copy link

vercel bot commented Mar 21, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
firezone ⬜️ Ignored (Inspect) Visit Preview Mar 22, 2024 1:18am

@thomaseizinger
Copy link
Member Author

This is still a draft because there is an edge-case in which we immediately go into Disconnected as a result of invalidating the candidates. Relevant discussion is here: algesten/str0m#486 (comment)

Copy link

github-actions bot commented Mar 21, 2024

Terraform Cloud Plan Output

Plan: 9 to add, 8 to change, 0 to destroy.

Terraform Cloud Plan

Copy link

github-actions bot commented Mar 21, 2024

Performance Test Results

TCP

Test Name Received/s Sent/s Retransmits
direct-tcp-client2server 222.2 MiB (+0%) 223.7 MiB (+0%) 219 (+59%)
direct-tcp-server2client 225.4 MiB (-0%) 226.8 MiB (-1%) 174 (-75%)
relayed-tcp-client2server 143.0 MiB (-5%) 143.7 MiB (-5%) 153 (+3%)
relayed-tcp-server2client 155.5 MiB (-1%) 155.8 MiB (-1%) 177 (+2%)

UDP

Test Name Total/s Jitter Lost
direct-udp-client2server 50.0 MiB (+0%) 0.19ms (+583%) 0.00% (NaN%)
direct-udp-server2client 50.0 MiB (+0%) 0.01ms (-3%) 0.00% (NaN%)
relayed-udp-client2server 50.0 MiB (+0%) 0.18ms (+150%) 0.00% (NaN%)
relayed-udp-server2client 50.0 MiB (+0%) 0.05ms (-9%) 0.00% (NaN%)

@thomaseizinger
Copy link
Member Author

@firezone/engineering I am getting pretty good results for the downtime when switching networks with this PR. I'd appreciate some testing on other platforms. My test setup was:

  • Laptop is connected to hotspot from phone
  • Phone has cellular & WiFi
  • Continuously ping a resource
  • Toggle WiFi on and off (I also learned today that Android can hotspot AND use the wifi it is connected to for Internet!)
  • Send SIGHUP to firezone-linux-client after toggling the connection

I am getting a downtime of about 5 seconds before the pings resume:

64 bytes from 10.0.32.101: icmp_seq=7 ttl=62 time=239 ms
64 bytes from 10.0.32.101: icmp_seq=8 ttl=62 time=238 ms
64 bytes from 10.0.32.101: icmp_seq=9 ttl=62 time=240 ms
64 bytes from 10.0.32.101: icmp_seq=14 ttl=62 time=252 ms
64 bytes from 10.0.32.101: icmp_seq=15 ttl=62 time=256 ms
64 bytes from 10.0.32.101: icmp_seq=16 ttl=62 time=277 ms

Note that I also didn't send the SIGHUP signal instantly, but probably had like a second or even a bit more delay so from the time the app receives the signal until it has a working connection again it might only be 4 seconds or something.

@jamilbk
Copy link
Member

jamilbk commented Mar 21, 2024

Nice!

@thomaseizinger I can test this on Apple once #4133 is nearly finished, so blocked on that atm.

@ReactorScram
Copy link
Collaborator

On Windows 6905491

  • Had the laptop on my home Wi-Fi
  • Pinged ifconfig.net a few times
  • Switched to the iPhone's hotspot
  • Couldn't ping ifconfig.net for a very long time, was getting "Error with the DNS fallback lookup"
  • It started pinging again eventually
  • Turned off iPhone's Wi-Fi - No change. Maybe it was using cellular for the hotspot
  • Turned iPhone's Wi-Fi back on - Got a warning about "This will disconnect hotspot users"
  • The laptop eventually reconnects to the iPhone

Ping logs during that reconnection:

Reply from 172.67.199.190: bytes=32 time=76ms TTL=53
Request timed out.
Request timed out.
General failure.
Request timed out.
Request timed out.
Reply from 172.67.199.190: bytes=32 time=38ms TTL=53

Not sure if I tested the right thing.

@thomaseizinger
Copy link
Member Author

@ReactorScram That seems like a plausible test and something users might do.

Just for science reasons, can you try other ways of changing your observed public address?

Moving your laptop from one WiFi with working Internet to another should also do the trick for example. Does connlib disconnect its connection at some point? i.e. do you ever see "ICE timeout" in the logs?

@thomaseizinger
Copy link
Member Author

Also, is "reconnect" triggered correctly as a result? You can identify that by seeing "Connected to the portal" in the logs and "Allocation mismatch".

@thomaseizinger
Copy link
Member Author

I've now patches our str0m fork to do what I want to do for this. We may have to come back for a different solution eventually if we can't get this upstreamed. For now, this will do and makes reconnect more stable because we don't trigger a connection failure which results in all sorts of state like cached DNS queries etc to be cleared.

@thomaseizinger
Copy link
Member Author

For reference, this is the patch that is now included: algesten/str0m#489

@conectado
Copy link
Collaborator

Just tested this PR on Android

Switching networks with no noticeable downtime now 🚀 🎉

The way I tested it, load ifconfig.net switch networks, then load it again and it does immediately, no matter how many times we switch networks.

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

@thomaseizinger
Copy link
Member Author

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

Can we deploy some server to download a large file?

@thomaseizinger thomaseizinger marked this pull request as ready for review March 22, 2024 01:16
@conectado
Copy link
Collaborator

conectado commented Mar 22, 2024

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

Can we deploy some server to download a large file?

@bmanifold might have an idea on how to do that

@jamilbk
Copy link
Member

jamilbk commented Mar 22, 2024

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

Can we deploy some server to download a large file?

How big of a file? We already have GitHub -- we can add a repo with LFS support which supports up to 5 GB.

Copy link
Collaborator

@conectado conectado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we depend on a fork I think it's good to merge this ASAP since it improves UX a lot

@jamilbk
Copy link
Member

jamilbk commented Mar 22, 2024

@thomaseizinger You can also use ionice to make downloads last longer:

/usr/bin/ionice -c2 -n7 rsync \
-bwlimit=1000 /path/to/source /path/to/dest/

@thomaseizinger
Copy link
Member Author

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

Can we deploy some server to download a large file?

How big of a file? We already have GitHub -- we can add a repo with LFS support which supports up to 5 GB.

I guess we could also test using a speedtest?

@thomaseizinger
Copy link
Member Author

Even if we depend on a fork I think it's good to merge this ASAP since it improves UX a lot

We already did depend on a fork :)

@conectado
Copy link
Collaborator

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

Can we deploy some server to download a large file?

How big of a file? We already have GitHub -- we can add a repo with LFS support which supports up to 5 GB.

I guess we could also test using a speedtest?

Just tested with speed.cloudflare.com and it seems to resume connection reliably

But I think something that we deploy ourselves would be better, otherwise I can't be sure we're connected through firezone

@AndrewDryga
Copy link
Collaborator

We can upload a file to Google Cloud Storage, but you can also find a large docker image and pull it - free large binary hosting :).

@bmanifold
Copy link
Collaborator

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

Can we deploy some server to download a large file?

@bmanifold might have an idea on how to do that

Multiple good ideas have been thrown out, but as another suggestion, if we don't care about what the file actually is, I just threw together a quick PoC in golang that creates an HTTP server that will generate an arbitrarily large random file and stream the result so that it doesn't take up any disk space or much memory on the server. I just used my personal laptop to make a curl request to download a 1GB file from my work laptop and it worked fine.

All we'd need to do from there is create a docker container and then either use it in CI or we could also run it in AWS/GCP if needed. That said, we may want to be careful of downloading large files from AWS/GCP VMs outside of the AWS/GCP network as we could run up our network traffic bill.

@AndrewDryga
Copy link
Collaborator

AndrewDryga commented Mar 22, 2024

@bmanifold I have a small coding challenge for you. Use Elixirs Plug to stream endless random bytes :). You can see an example in our CSV export

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Mar 22, 2024

What I'd like to test next is some long-lived connection but I assume that it will work a-ok, I will make some setup to test this on android

Can we deploy some server to download a large file?

@bmanifold might have an idea on how to do that

Multiple good ideas have been thrown out, but as another suggestion, if we don't care about what the file actually is, I just threw together a quick PoC in golang that creates an HTTP server that will generate an arbitrarily large random file and stream the result so that it doesn't take up any disk space or much memory on the server. I just used my personal laptop to make a curl request to download a 1GB file from my work laptop and it worked fine.

I built something similar for the integration test I want to write 😁

We could actually deploy that to staging too, hadn't thought of that!

@thomaseizinger
Copy link
Member Author

Lets hold off on this for now, I want to see if my integration tests passes with all the fixes we are putting in. Once that is done, we can think about deploying such a dummy server to staging to do more testing :)

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Mar 22, 2024

This has been validated on multiple systems so I am going ahead and merge it.

@thomaseizinger thomaseizinger added this pull request to the merge queue Mar 22, 2024
Merged via the queue into main with commit 3fe8f6d Mar 22, 2024
138 checks passed
@thomaseizinger thomaseizinger deleted the feat/connlib/faster-reconnect branch March 22, 2024 06:10
@bmanifold
Copy link
Collaborator

@bmanifold I have a small coding challenge for you. Use Elixirs Plug to stream endless random bytes :). You can see an example in our CSV export

😄 I like it. I might try that this weekend.

@ReactorScram
Copy link
Collaborator

@jamilbk
Copy link
Member

jamilbk commented Mar 22, 2024

Building a new Android client with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants