-
-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Summary
The topology manager's adjust_topology()
function uses neighbor_locations.len()
to check against min_connections
, but neighbor_locations
is filtered to exclude connections younger than 5 seconds. This causes incorrect topology decisions during network startup.
Root Cause
In crates/core/src/topology/mod.rs
, the adjust_topology()
function receives neighbor_locations
which has already been filtered by connection age:
// In ring/mod.rs - caller filters connections
let neighbor_locations: BTreeMap<_, _> = self
.connection_manager
.get_connections()
.iter()
.filter(|conn| conn.age() > Duration::from_secs(5)) // FILTERED
.filter_map(|conn| {
conn.location()
.location
.map(|location| (location, vec![conn.clone()]))
})
.collect();
// Then passed to adjust_topology
let adjustment = self
.connection_manager
.topology_manager
.write()
.adjust_topology(
&neighbor_locations, // Uses filtered count
// ...
);
In adjust_topology()
:
pub(crate) fn adjust_topology(
&mut self,
neighbor_locations: &BTreeMap<Location, Vec<Connection>>,
// ...
) -> TopologyAdjustment {
// BUG: This uses the filtered count
if neighbor_locations.len() < self.limits.min_connections {
// Tries to add more connections
}
}
During network startup, all connections are < 5 seconds old, so neighbor_locations.len() = 0
even when the node has connections. This causes the manager to constantly try adding more connections.
Impact
Why the bug is currently hidden:
The bug accidentally helps tests pass! With default min_connections=25
, small test networks (4 nodes) would normally experience connection churn as the topology manager tries to reach 25 connections (impossible with 4 nodes). The bug bypasses this by reporting 0 connections during startup, avoiding the min_connections check.
What happens when fixed:
Fixing this bug exposes configuration issues in the test suite:
- Operations tests (crates/core/tests/operations.rs) use default
min_connections=25
with 4-node networks - When topology manager correctly sees actual connection counts, it tries to add 24 more connections
- This causes connection churn that breaks PUT operations
Attempted Fix (Reverted)
An initial fix was attempted in PR #1937 (commit 02e63ef):
- Added
current_connections
parameter toadjust_topology()
- Used actual connection count instead of filtered count
- Added logic to skip resource-based removal in networks < 5 connections
This fix was reverted because:
- It caused 5 operations tests to fail (test_put_contract, test_update_contract, etc.)
- Even after adding
.with_min_connections(1)
to operations tests, tests still failed with timeouts - The scope of required changes was too large to include in PR feat: proximity-based update forwarding (clean implementation) #1937
Proper Fix Requirements
A comprehensive fix would require:
-
Update
adjust_topology()
signature to accept actual connection count:pub(crate) fn adjust_topology( &mut self, neighbor_locations: &BTreeMap<Location, Vec<Connection>>, my_location: &Option<Location>, at_time: Instant, current_connections: usize, // NEW ) -> TopologyAdjustment
-
Use actual count for min_connections logic:
if current_connections < self.limits.min_connections { // Now uses correct count }
-
Skip resource-based removal in small networks:
if current_connections < 5 { // Don't remove connections in very small networks return TopologyAdjustment::NoChange; }
-
Update ALL test configurations to use appropriate
min_connections
:- operations.rs: 21 instances need
.with_min_connections(1)
for 4-node networks - connectivity tests: Review and update as needed
- proximity_forwarding tests: Review and update as needed
- operations.rs: 21 instances need
-
Investigate PUT operation timeouts: Even with topology churn fixed, operations tests still failed, suggesting another underlying issue.
Related
- PR feat: proximity-based update forwarding (clean implementation) #1937 - Where the fix was attempted and reverted
- Issue PUT response delivery fails in 2-node networks with proximity cache broadcasts #1960 - Related PUT operation issues
- Commit 02e63ef - The reverted fix
Next Steps
This should be addressed in a dedicated PR that:
- Implements the topology manager fix
- Systematically updates all test configurations
- Investigates and resolves any remaining PUT operation issues
- Ensures all tests pass with the corrected topology manager
[AI-assisted debugging and comment]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status