Skip to content

Connector heartbeat interval (lease_duration / 2) can reach ~10 min, gating tunnel usability #159

@drewr

Description

@drewr

This is part of a four-issue tunnel creation story

A newly created tunnel takes up to ~14 minutes before it reliably routes traffic. There are two distinct delays, each with an operator-side and a client-side component:

  • Delay 1 (~3-4 min): creation → toggle turns green
  • Delay 2 (~0-10 min): toggle green → traffic flows
  • UX consequence
    • app#160 — green toggle shown before tunnel is usable

Summary

This and #160 are related to a degraded experience where not only does a tunnel take minutes to become usable, the app shows the tunnel as active when it's not yet routing traffic.

The connector heartbeat interval is computed as lease_duration_seconds / 2 + jitter (app/lib/src/heartbeat.rs:566-573). For the observed connector datum-connect-jttwh, the Lease has lease_duration_seconds ≈ 1200, yielding a ~10 minute heartbeat interval.

Since the network-services-operator re-reconciles downstream tunnel routes on each connector heartbeat (see network-services-operator#167), this interval directly controls how long a newly created tunnel remains non-functional after the toggle turns green. A user who creates a tunnel immediately after a heartbeat fires will wait the full ~10 minutes before routing is confirmed live.

Relevant code

// app/lib/src/heartbeat.rs
const DEFAULT_LEASE_DURATION_SECS: i32 = 30;

fn renewal_interval(lease_duration_seconds: i32) -> Duration {
    let base = Duration::from_secs((lease_duration_seconds / 2).max(1));
    let jitter_max = (base.as_secs() / 5).max(1);
    let mut rng = rand::rng();
    let jitter = rng.random_range(0..=jitter_max);
    base + Duration::from_secs(jitter)
}

DEFAULT_LEASE_DURATION_SECS is 30s, but the actual Lease resource in the cluster has a much longer duration (~1200s), overriding this default.

Expected

Either:

  1. (Preferred) Decouple tunnel routing confirmation from the heartbeat entirely — fix in network-services-operator#167 makes the NSO push addressing proactively, making the heartbeat interval irrelevant for Delay 2.
  2. (Short-term) Reduce the Lease duration in the cluster to something appropriate for interactive use (e.g. 60s → 30s heartbeat interval), so Delay 2 is bounded to <60s even without the NSO fix.

Impact

Up to ~10 minute window where tunnel toggle is green but tunnel does not route traffic. Worse when tunnel is created right after a heartbeat. See also network-services-operator#166 for Delay 1 (creation → toggle green).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions