Skip to content

bug: topology manager uses filtered connection count instead of actual count #1962

@sanity

Description

@sanity

Summary

The topology manager's adjust_topology() function uses neighbor_locations.len() to check against min_connections, but neighbor_locations is filtered to exclude connections younger than 5 seconds. This causes incorrect topology decisions during network startup.

Root Cause

In crates/core/src/topology/mod.rs, the adjust_topology() function receives neighbor_locations which has already been filtered by connection age:

// In ring/mod.rs - caller filters connections
let neighbor_locations: BTreeMap<_, _> = self
    .connection_manager
    .get_connections()
    .iter()
    .filter(|conn| conn.age() > Duration::from_secs(5))  // FILTERED
    .filter_map(|conn| {
        conn.location()
            .location
            .map(|location| (location, vec![conn.clone()]))
    })
    .collect();

// Then passed to adjust_topology
let adjustment = self
    .connection_manager
    .topology_manager
    .write()
    .adjust_topology(
        &neighbor_locations,  // Uses filtered count
        // ...
    );

In adjust_topology():

pub(crate) fn adjust_topology(
    &mut self,
    neighbor_locations: &BTreeMap<Location, Vec<Connection>>,
    // ...
) -> TopologyAdjustment {
    // BUG: This uses the filtered count
    if neighbor_locations.len() < self.limits.min_connections {
        // Tries to add more connections
    }
}

During network startup, all connections are < 5 seconds old, so neighbor_locations.len() = 0 even when the node has connections. This causes the manager to constantly try adding more connections.

Impact

Why the bug is currently hidden:

The bug accidentally helps tests pass! With default min_connections=25, small test networks (4 nodes) would normally experience connection churn as the topology manager tries to reach 25 connections (impossible with 4 nodes). The bug bypasses this by reporting 0 connections during startup, avoiding the min_connections check.

What happens when fixed:

Fixing this bug exposes configuration issues in the test suite:

  • Operations tests (crates/core/tests/operations.rs) use default min_connections=25 with 4-node networks
  • When topology manager correctly sees actual connection counts, it tries to add 24 more connections
  • This causes connection churn that breaks PUT operations

Attempted Fix (Reverted)

An initial fix was attempted in PR #1937 (commit 02e63ef):

  • Added current_connections parameter to adjust_topology()
  • Used actual connection count instead of filtered count
  • Added logic to skip resource-based removal in networks < 5 connections

This fix was reverted because:

  1. It caused 5 operations tests to fail (test_put_contract, test_update_contract, etc.)
  2. Even after adding .with_min_connections(1) to operations tests, tests still failed with timeouts
  3. The scope of required changes was too large to include in PR feat: proximity-based update forwarding (clean implementation) #1937

Proper Fix Requirements

A comprehensive fix would require:

  1. Update adjust_topology() signature to accept actual connection count:

    pub(crate) fn adjust_topology(
        &mut self,
        neighbor_locations: &BTreeMap<Location, Vec<Connection>>,
        my_location: &Option<Location>,
        at_time: Instant,
        current_connections: usize,  // NEW
    ) -> TopologyAdjustment
  2. Use actual count for min_connections logic:

    if current_connections < self.limits.min_connections {
        // Now uses correct count
    }
  3. Skip resource-based removal in small networks:

    if current_connections < 5 {
        // Don't remove connections in very small networks
        return TopologyAdjustment::NoChange;
    }
  4. Update ALL test configurations to use appropriate min_connections:

    • operations.rs: 21 instances need .with_min_connections(1) for 4-node networks
    • connectivity tests: Review and update as needed
    • proximity_forwarding tests: Review and update as needed
  5. Investigate PUT operation timeouts: Even with topology churn fixed, operations tests still failed, suggesting another underlying issue.

Related

Next Steps

This should be addressed in a dedicated PR that:

  1. Implements the topology manager fix
  2. Systematically updates all test configurations
  3. Investigates and resolves any remaining PUT operation issues
  4. Ensures all tests pass with the corrected topology manager

[AI-assisted debugging and comment]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions