Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(android): detect network and dns changes and send them to connlib #4163

Merged
merged 1 commit into from
Mar 21, 2024

Conversation

conectado
Copy link
Collaborator

@conectado conectado commented Mar 15, 2024

This completely removes the get_system_default_resolvers for android

Copy link

vercel bot commented Mar 15, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
firezone ⬜️ Ignored (Inspect) Visit Preview Mar 21, 2024 1:47am

@jamilbk jamilbk changed the title feat(kotlin): detect network and dns changes feat(android): detect network and dns changes Mar 15, 2024
@conectado conectado force-pushed the feat/android-react-to-network branch from b95e428 to 021c116 Compare March 18, 2024 21:19
Copy link

github-actions bot commented Mar 18, 2024

Terraform Cloud Plan Output

Plan: 9 to add, 8 to change, 9 to destroy.

Terraform Cloud Plan

Copy link

github-actions bot commented Mar 19, 2024

Performance Test Results

TCP

Test Name Received/s Sent/s Retransmits
direct-tcp-client2server 221.8 MiB (+1%) 224.0 MiB (+1%) 246 (+64%)
direct-tcp-server2client 230.9 MiB (+2%) 232.3 MiB (+2%) 117 (-75%)
relayed-tcp-client2server 148.2 MiB (+1%) 149.0 MiB (+1%) 188 (+23%)
relayed-tcp-server2client 152.3 MiB (+2%) 152.8 MiB (+2%) 217 (+10%)

UDP

Test Name Total/s Jitter Lost
direct-udp-client2server 50.0 MiB (+0%) 0.29ms (+717%) 0.00% (NaN%)
direct-udp-server2client 50.0 MiB (+0%) 0.01ms (-30%) 0.00% (NaN%)
relayed-udp-client2server 50.0 MiB (+0%) 0.17ms (+42%) 0.00% (NaN%)
relayed-udp-server2client 50.0 MiB (+0%) 0.07ms (+30%) 0.00% (NaN%)

@@ -69,10 +77,14 @@ impl Session {
///
/// In case of destructive network state changes, i.e. the user switched from wifi to cellular,
/// reconnect allows connlib to re-establish connections faster because we don't have to wait for timeouts first.
pub fn reconnect(&mut self) {
pub fn reconnect(&self) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is mut we could break aliasing rules.

Copy link
Member

@jamilbk jamilbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some code style suggestions for Kotlin, can't really comment on the Rust changes so I'll let @thomaseizinger and @ReactorScram approve those.

Comment on lines 206 to 208
val connectivityManager =
getSystemService(ConnectivityManager::class.java) as ConnectivityManager
connectivityManager.unregisterNetworkCallback(networkCallback!!)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val connectivityManager =
getSystemService(ConnectivityManager::class.java) as ConnectivityManager
connectivityManager.unregisterNetworkCallback(networkCallback!!)
networkCallback?.let {
val connectivityManager =
getSystemService(ConnectivityManager::class.java) as ConnectivityManager
connectivityManager.unregisterNetworkCallback(networkCallback)
networkCallback = null
}

I think I assumed that shutdown() would be safe to call idempotently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet move this to a helper function, stopNetworkMonitoring(connlibSessionPtr) like the Apple client.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we want to call shutdown multiple times in a row?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like shutdown should only be called when the tunnel is no longer used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, but there's an edge case if onDisconnect and the user disconnect at the same time.

The system can shut us down too but I don't think that calls disconnect.

Just trying to avoid crashing in our only cleanup handler is all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, then we probably want to change disconnect to not use connlibSessionPtr!!, I can do that on this PR

Copy link
Collaborator Author

@conectado conectado Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it here: 0ce223b lmk if I should roll the change on disconnect back 😃

rust/connlib/clients/shared/src/eventloop.rs Outdated Show resolved Hide resolved
@conectado conectado requested a review from jamilbk March 20, 2024 00:03
@conectado
Copy link
Collaborator Author

Having problems to reproduce the CI errors locally.

if let Some(mut dns_update) = self.dns_update.take() {
match dns_update.poll_unpin(cx) {
Poll::Ready(dns) => {
self.tunnel.set_dns(dns);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have Tunnel::set_dns debounce internally than out here. We can just capture the timestamp of now and delay acting ln it in the ClientState.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will allows us to unit-test the debouncing and it doesn't require messing around with futures. All we need is an Option with an Instant and act on it in handle_timeout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside with this approach is that handle_timeout would need to call into Tunnel to call update_interface, so we would need to return some event from handle_timeout, if that's acceptable I almost have an implementation for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_timeout shouldn't return anything, instead we should queue an event of some kind that is then consumed by the Tunnel so it can update its IO state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of the Android PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're moving to set_dns otherwise, this would break all other platforms.

Copy link
Member

@jamilbk jamilbk Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe it would be better to land all of these into a staging branch? Otherwise main is gonna go wonky for a bit while we test manually.

Just an idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to merge it like this so we have everything on main. This pr should work a-ok like this for all platforms

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could extract the set_dns to its own PR but that is quite a lot of work at this point

Comment on lines 31 to 32
channel: tokio::sync::mpsc::Sender<Command>,
dns_updater: tokio::sync::mpsc::UnboundedSender<Vec<IpAddr>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have two channels? Can we just make the command one unbounded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just didn't want to make the channel for the other events unbounded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It is not like the app will memory DoS connlib with a million commands that we can't process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it won't, then set the limit to a million and panic if the impossible happens?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it could be a watch or Notify+ArcSwap or something, right? No question of memory leaks when out-of-date DNS server vecs don't even matter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bounded channels normally preemptively allocate enough to handle the worst case scenario, I don't think it'd make sense to set it to a million.

Also I think it'll look a bit weird and could throw someone off when debugging

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it could be a watch or Notify+ArcSwap or something, right? No question of memory leaks when out-of-date DNS server vecs don't even matter.

I think this is what we discussed about having dumb plumbing, it's more logic that we can move inside the tunnel.

We could go with watch without the logic to notify if updated but that will also look a bit weird, having the notify/debounce logic distributed between the session and the tunnel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also only shipping around pointers or ZST here so the memory footprint should be tiny.

Lets set the bound to a 100 and panic on error. It should never be full in normal operation and if it is, that is a bug.

It will not be disconnected unless the app calls commands after on_disconnect which is also a bug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not be disconnected unless the app calls commands after on_disconnect which is also a bug.

This can also happen if the user disconnects manually at the same time on_disconnect is called, super unlikely but possible.

let _ = self.channel.try_send(Command::Reconnect);
}

pub fn set_dns(&self, new_dns: Vec<IpAddr>) {
self.dns_updater.send(new_dns).expect("Developer error: As long as the session is up the dns update receiver must not be dropped");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be hit if the app calls connlib after it emitted on_disconnect. I don't think we should panic here but just drop the message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make it a debug_assert if you want but there is no need to crash a production app I'd say.

== HashSet::<&IpAddr>::from_iter(new_dns.iter())
})
{
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a debug log here.

Comment on lines 558 to 561
callback_handler
.get_system_default_resolvers()
.expect("We expect to get the system's DNS")
.unwrap_or_default(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, can we avoid calling our own callbacks and instead initialize it with the data that the callbacks return?

Comment on lines 197 to 202
let Some(system_resolvers) = self.role_state.system_resolvers.as_ref().cloned() else {
return Ok(());
};

let effective_dns_servers =
effective_dns_servers(config.upstream_dns.clone(), system_resolvers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this duplicated state? If our effective DNS servers contain the system resolves, why can't we just check our current DNS mapping instead of tracking the system resolvers separately?

We are already tracking so much state, we should avoid redundancy as much as we can.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because, if we have some upstream dns the effective dns doesn't contain the system's resolver

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unit test this somehow? Seems fragile to me :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to stop reaching into fields of ClientState from the Tunnel and make proper APIs that manipulate ClientState.

Copy link
Collaborator Author

@conectado conectado Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unit test this somehow? Seems fragile to me :)

I can add some unit-tests for effective_dns_servers but is that what you want to unit-test?

We need to stop reaching into fields of ClientState from the Tunnel and make proper APIs that manipulate ClientState.

We can add a ClientState::get_dns that returns this effective_dns_servers since both part of this function are stored there, what do you thinka bout that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'd like to test for example is that setting the same DNS servers is a no-op.

For example, we could make a function that given a set of DNS servers outputs an Option of dns_mapping and only update Io if that is Some. Maybe there is more stuff we can do, just need to set down and de-couple the side-effects from the state management and translate it into values that we pass around and then can write tests against.

Copy link
Member

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally made separate comments than a full review. I have some concerns in how some things are implemented, esp. in regards to redundant state in ClientState and the duplicated channel.

github-merge-queue bot pushed a commit that referenced this pull request Mar 20, 2024
@conectado conectado force-pushed the feat/android-react-to-network branch from 904c265 to 9279de2 Compare March 20, 2024 19:23
@thomaseizinger
Copy link
Member

I'll try to review this today in-between sessions.

@conectado conectado changed the title feat(android): detect network and dns changes feat(connlib): detect network changes on android and add set_dns for tunnel Mar 20, 2024
@conectado conectado marked this pull request as draft March 21, 2024 00:23
@conectado conectado force-pushed the feat/android-react-to-network branch from 14cc68a to 3326eda Compare March 21, 2024 01:08
github-merge-queue bot pushed a commit that referenced this pull request Mar 21, 2024
@conectado conectado changed the title feat(connlib): detect network changes on android and add set_dns for tunnel feat(android): detect network changes Mar 21, 2024
@conectado conectado changed the base branch from main to refactor/use-set-dns-instead-of-callback March 21, 2024 01:09
@conectado conectado force-pushed the refactor/use-set-dns-instead-of-callback branch from 309375e to 56d7ded Compare March 21, 2024 01:12
@conectado conectado force-pushed the feat/android-react-to-network branch from 3326eda to cb1fb29 Compare March 21, 2024 01:14
github-merge-queue bot pushed a commit that referenced this pull request Mar 21, 2024
Base automatically changed from refactor/use-set-dns-instead-of-callback to main March 21, 2024 01:36
@conectado conectado force-pushed the feat/android-react-to-network branch from cb1fb29 to bf66d65 Compare March 21, 2024 01:47
@conectado conectado marked this pull request as ready for review March 21, 2024 01:47
@conectado conectado changed the title feat(android): detect network changes feat(android): detect network and dns changes and send them to connlib Mar 21, 2024
Copy link
Member

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the Android part but the FFI looks clean :)

_: JClass,
session: *const SessionWrapper,
) {
(*session).inner.reconnect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, what happens if we are passed a null-ptr? Will we segfault?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we somehow signal this via JNIEnv? Like throw an exception in Android?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, what happens if we are passed a null-ptr? Will we segfault?

It's UB but normally we will segfault

Can we somehow signal this via JNIEnv? Like throw an exception in Android?

I'm not sure, I'd need to research it

@conectado conectado added this pull request to the merge queue Mar 21, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 21, 2024
@conectado conectado added this pull request to the merge queue Mar 21, 2024
Merged via the queue into main with commit db62e7b Mar 21, 2024
138 checks passed
@conectado conectado deleted the feat/android-react-to-network branch March 21, 2024 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants