Skip to content

fix(headless-client/windows): improve Client startup times on Windows#5375

Merged
thomaseizinger merged 93 commits intomainfrom
fix/windows-slow-startup
Jul 16, 2024
Merged

fix(headless-client/windows): improve Client startup times on Windows#5375
thomaseizinger merged 93 commits intomainfrom
fix/windows-slow-startup

Conversation

@ReactorScram
Copy link
Contributor

@ReactorScram ReactorScram commented Jun 14, 2024

Closes #5026
Closes #5879

On the resource-constrained Windows Server 2022 test VM, the median sign-in time dropped from 5.0 seconds to 2.2 seconds.

Changes

  • Measure end-to-end connection time in the GUI process
  • Use ipconfig instead of Powershell to flush DNS faster
  • Activate DNS control by manipulating the Windows Registry directly instead of calling Powershell
  • Remove deactivate step when changing DNS servers (seals a DNS leak when roaming networks)
  • Remove completely redundant Set-DnsClientServerAddress step from activating DNS control
  • Remove Remove-NetRoute powershell cmdlet that seems to do nothing

Benchmark 7

  • Optimized release builds
  • x86-64 constrained VM (1 CPU thread, 2 GB RAM)

Main with measurement added, c1c99197e from #5864

  • 6.0 s
  • 5.5 s
  • 4.1 s
  • 5.0 s
  • 4.1 s
  • (Median = 5.0 s)

Main with speedups added, 2128329f9 from #5375, this PR

  • 3.7 s
  • 2.2 s
  • 1.9 s
  • 2.3 s
  • 2.0 s
  • (Median = 2.2 s)
### Next steps
- [x] Benchmark on the resource-constrained VM
- [x] Move raw benchmark data to a comment and summarize in the description
- [x] Clean up tasks that don't need to be in the commit
- [x] Merge

Hypothetical further optimizations

  • Ditch the netsh subprocess in set_ips

e.g. `#[tracing::instrument]` will now print the time a function spent
busy and idle without having to add a timer crate or manually capture
`Instant`s
Copy link
Member

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! I left some comments, nothing blocking.

Regarding the PR description: I appreciate the thorough benchmarking and write-up. Can we move that to a comment on the PR instead? With our squash-merging policy, the PR description turns into a commit message on main. For bisecting for example, it is nice to be able to read the commit message as short prose of why we ended up making certain changes and have it not intermediate results. It is a balance acts obviously :)

dns_control::deactivate()?;
impl<'a> Handler<'a> {
async fn new(server: &mut IpcServer, dns_controller: &'a mut DnsController) -> Result<Self> {
dns_controller.deactivate()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this, is the default of may_be_active: true redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, maybe active is the wrong word for it. It's more like a dirty flag in a cache, it means we need to deactivate our control if we should not be in control.

If I split up the I/O benchmarking from the deterministic parts as I wrote in the other thread, I might be able to remove this completely.

There's 3 reasons to deactivate DNS control:

  1. The IPC service or Headless Client just started, and we need to recover from a possible Firezone crash or system crash
  2. Connlib initiated a disconnect and we need to bail out of this iteration of the IPC service's main loop
  3. The GUI initiated a sign-out and we need to stop controlling DNS

Number 1 should always happen. Numbers 2 and 3 can afford to be slow since Firezone is shutting down anyway. So I think this optimization made sense a few days ago when it removed the defensive "deactivate during sign-in" but now it may be unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPC service or Headless Client just started, and we need to recover from a possible Firezone crash or system crash

Can we move this concern to the main of the respective binary?

@ReactorScram
Copy link
Contributor Author

(Old benchmarks were removed)

All benchmarks rounded to the nearest 0.1 seconds

Benchmark 2

-43f3fd2e from this PR

  • Unoptimized, just comparing across OSes
  • Host is the aarch64 Macbook

Ubuntu 20.04 UTM VM:

  • 30.6 s (Outlier, for some reason the VM had DNS issues)
  • 0.4 s
  • 0.4 s
  • 0.4 s
  • 1.3 s
  • 0.4 s

Windows 11 Parallels VM:

  • 5.2 s
  • 4.3 s
  • 4.3 s
  • 3.8 s
  • 4.0 s

Benchmark 3

Windows 11 Parallels VM:

  • 3.3 s
  • 3.0 s
  • 2.7 s
  • 2.8 s
  • 2.8 s
  • Median: 2.8 s

Why is main faster than this PR that's supposed to improve speed? 🤔

Benchmark 4

  • b23ed1033 from this PR
  • Unoptimized, compare with Windows in Benchmark 3

Windows 11 Parallels VM:

  • 2.6 s
  • 2.2 s
  • 2.3 s
  • 2.1 s
  • 2.3 s

Now this PR is faster again. I must have made some human error in the testing.

Benchmark 5

  • 55be4274f from this PR, which skips a redundant DNS deactivate during every sign-in
  • Unoptimized, compare with Windows in Benchmarks 3 and 4

Windows 11 Parallels VM:

  • 3.8 s (outlier, hopefully, since I was signing in)
  • (restarted the IPC service so that the token would be on disk but the service would be freshly-started)
  • 2.2 s
  • 2.0 s
  • 1.8 s
  • 1.8 s
  • 1.9 s
  • Median: 1.9 s

Benchmark 6

  • 0355db043 from this PR, which skips an entire Powershell step that doesn't seem to do anything
  • Unoptimized, compare with Windows in Benchmarks 3, 4, and 5

Windows 11 Parallels VM:

  • 2.0 s
  • 1.3 s
  • 1.4 s
  • 1.4 s
  • 1.4 s
  • Median: 1.4 s (down from 2.8 s on main)

Benchmark 7

  • Optimized release builds
  • x86-64 constrained VM (1 CPU thread, 2 GB RAM)

Main with measurement added, c1c99197e from #5864

  • 6.0 s
  • 5.5 s
  • 4.1 s
  • 5.0 s
  • 4.1 s
  • (Median = 5.0 s)

Main with speedups added, 2128329f9 from #5375, this PR

  • 3.7 s
  • 2.2 s
  • 1.9 s
  • 2.3 s
  • 2.0 s
  • (Median = 2.2 s)

It's about twice as fast on median.

Base automatically changed from chore/gui-measure-startup-time to main July 16, 2024 15:00
@ReactorScram ReactorScram enabled auto-merge July 16, 2024 21:19
@ReactorScram ReactorScram added this pull request to the merge queue Jul 16, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 16, 2024
@thomaseizinger thomaseizinger added this pull request to the merge queue Jul 16, 2024
Merged via the queue into main with commit 6362334 Jul 16, 2024
@thomaseizinger thomaseizinger deleted the fix/windows-slow-startup branch July 16, 2024 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(gui-client/windows): DNS leaks when roaming Initial sign-in takes 30 seconds on Windows

3 participants