From caa6d2736cd7f630e6ac230744366fbd2a5c51c2 Mon Sep 17 00:00:00 2001 From: Ian Clarke Date: Wed, 4 Jun 2025 16:39:05 -0500 Subject: [PATCH] chore: remove AI-generated markdown files and clean up repository - Remove 9 AI-generated markdown files from debugging sessions - Remove extra CLAUDE.local.md from operations directory - Delete merged debug-update-issues branch - Keep main CLAUDE.md file as it's intentional project documentation As suggested by Nacho, removing noise files that accumulated during development work on connection/transport issues. --- BRANCH_STATUS.md | 29 ------ Cargo.lock | 4 +- PROGRESS_SUMMARY.md | 54 ---------- PR_DESCRIPTION.md | 51 ---------- crates/core/src/operations/CLAUDE.local.md | 32 ------ issue_body.md | 48 --------- issue_comment.md | 9 -- pr_description.md | 49 --------- stdlib | 2 +- transport_channel_closed_issue.md | 63 ------------ transport_fix.md | 113 --------------------- transport_fix_summary.md | 31 ------ 12 files changed, 3 insertions(+), 482 deletions(-) delete mode 100644 BRANCH_STATUS.md delete mode 100644 PROGRESS_SUMMARY.md delete mode 100644 PR_DESCRIPTION.md delete mode 100644 crates/core/src/operations/CLAUDE.local.md delete mode 100644 issue_body.md delete mode 100644 issue_comment.md delete mode 100644 pr_description.md delete mode 100644 transport_channel_closed_issue.md delete mode 100644 transport_fix.md delete mode 100644 transport_fix_summary.md diff --git a/BRANCH_STATUS.md b/BRANCH_STATUS.md deleted file mode 100644 index 9d1bd6fbf..000000000 --- a/BRANCH_STATUS.md +++ /dev/null @@ -1,29 +0,0 @@ -# Branch and PR Status Overview - -## Current Branch -- **Branch**: replace-network-topology-tests -- **Uncommitted Changes**: - - crates/core/src/transport/connection_handler.rs (transport channel fixes) - - apps/freenet-ping/Cargo.lock - -## Open PRs -- **PR #1622**: fix-issue-1616-update-propagation → debug-update-issues - - Status: Uses non-main base branch - - Purpose: Fix update propagation and connection handling - -- **PR #1612**: debug-update-issues-clean → main - - Status: Appears to be cleaned up version - - Purpose: Integration test fixes and connection latency - -## Uncommitted Work -Transport layer fixes in connection_handler.rs: -1. Connection health checks -2. try_send with error recovery -3. Increased buffer sizes (1 → 100) -4. Fixed ownership issues - -## Issues Being Addressed -- #1616: Update propagation issues -- #1624: Network topology test failures ("channel closed") -- #1623: Flaky subscription tests -- #1614: Integration test failures \ No newline at end of file diff --git a/Cargo.lock b/Cargo.lock index ce4b94251..cf937b7e3 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1627,7 +1627,7 @@ dependencies = [ [[package]] name = "freenet-macros" -version = "0.1.2" +version = "0.1.1" dependencies = [ "proc-macro2", "quote", @@ -1681,7 +1681,7 @@ dependencies = [ [[package]] name = "freenet-stdlib" -version = "0.1.5" +version = "0.1.3" dependencies = [ "arbitrary", "bincode", diff --git a/PROGRESS_SUMMARY.md b/PROGRESS_SUMMARY.md deleted file mode 100644 index bf704ffb7..000000000 --- a/PROGRESS_SUMMARY.md +++ /dev/null @@ -1,54 +0,0 @@ -# Progress Summary - GET Operation Performance - -## What We've Fixed - -### 1. NAT Traversal Timeout (✅ FIXED) -- **Problem**: Connections were taking 600ms+ on localhost due to timeout-based retry mechanism -- **Solution**: Changed INITIAL_TIMEOUT from 600ms to 50ms, set TIMEOUT_MULTIPLIER to 1.0 -- **Result**: Connections now establish quickly - -### 2. Connection Configuration (✅ FIXED) -- **Problem**: min_connections was set to 25 (impossible for 3-node test network) -- **Solution**: Set min_connections to 2 for test nodes -- **Result**: Realistic connection requirements for small networks - -### 3. Gateway Connection Detection (✅ FIXED) -- **Problem**: initial_join_procedure only connected if open_connections == 0 -- **Solution**: Modified to check for unconnected gateways specifically -- **Result**: Nodes properly connect to all available gateways - -### 4. Connection Maintenance Interval (✅ FIXED) -- **Problem**: Connection maintenance ran every 60 seconds (too slow for tests) -- **Solution**: Set CHECK_TICK_DURATION to 2 seconds for tests -- **Result**: Faster connection acquisition in test environments - -### 5. Immediate Peer Discovery (✅ FIXED) -- **Problem**: Peers weren't sending CONNECT messages immediately after gateway connection -- **Solution**: Implemented immediate FindOptimalPeer request after gateway connection -- **Result**: Nodes now have 2 connections within 1 second of startup - -## Current Status - -Despite all the connection improvements, GET operations still take ~13 seconds in the test. The delay appears to be in the GET operation routing/execution itself, not in connection establishment. - -### Timeline of a GET Operation: -1. 0s: Nodes start up -2. ~1s: All nodes connected (gateway + peer connections established) -3. ~15s: Node1 publishes contract -4. ~20s: Node2 sends GET request -5. **~33s: Node2 receives GET response (13 second delay!)** - -## Remaining Issues - -The 13-second GET operation delay is still present. This appears to be due to: -1. Routing delays in finding the contract -2. Possible backtracking/retry logic in the GET operation -3. Contract execution/loading overhead - -## Next Steps - -To achieve acceptable performance for GET operations: -1. Investigate GET operation routing logic -2. Check if backtracking search is causing delays -3. Optimize contract caching and routing -4. Consider pre-warming contract execution environment \ No newline at end of file diff --git a/PR_DESCRIPTION.md b/PR_DESCRIPTION.md deleted file mode 100644 index 2402089fc..000000000 --- a/PR_DESCRIPTION.md +++ /dev/null @@ -1,51 +0,0 @@ -# Fix: Reduce cold-start latency for GET operations - -## Problem -GET operations were taking 13-17 seconds on cold start in small networks due to on-demand connection establishment. Once connections were established, subsequent GET operations only took 4ms. - -## Root Cause -1. NAT traversal was using 600ms timeouts with exponential backoff -2. Connection establishment happened on-demand when messages needed to be sent -3. Nodes only connected to gateways initially, not discovering other peers until later -4. Connection maintenance task ran every 60 seconds (too slow for responsive startup) - -## Solution -This PR implements several coordinated changes to ensure nodes establish connections quickly on startup: - -### 1. NAT Traversal Optimization (`transport/connection_handler.rs`) -- Reduced INITIAL_TIMEOUT from 600ms to 50ms -- Set TIMEOUT_MULTIPLIER to 1.0 (no exponential backoff) -- Changed packet sending to continuous 50ms intervals for hole punching - -### 2. Connection Configuration (`test nodes`) -- Set min_connections to 2 for test networks (was 25) -- Set max_connections to 10 for test networks - -### 3. Gateway Connection Detection (`operations/connect.rs`) -- Fixed initial_join_procedure to check for unconnected gateways specifically -- Previously only connected if open_connections == 0 - -### 4. Connection Maintenance (`ring/mod.rs`) -- Reduced CHECK_TICK_DURATION to 2 seconds for tests (was 60 seconds) - -### 5. Immediate Peer Discovery (`operations/connect.rs`, `node/p2p_impl.rs`) -- Send FindOptimalPeer request immediately after gateway connection -- Added aggressive connection acquisition phase during startup -- Track immediate connect operations in LiveTransactionTracker - -## Results -- Nodes now establish 2 connections within 1 second of startup -- First GET: Still ~13 seconds (due to remaining on-demand connections) -- Second GET: 4ms (all connections established) - -## Test Evidence -Added test that demonstrates: -- First GET: 13-17 seconds -- Second GET: 4ms -- Gateway second GET: 133ms - -## Future Work -While this PR significantly improves connection establishment, the first GET still takes ~13 seconds due to on-demand connection establishment between non-gateway peers. A follow-up PR could: -1. Pre-warm all connections in small networks -2. Reduce NAT traversal MAX_TIMEOUT further -3. Implement connection prediction based on routing tables \ No newline at end of file diff --git a/crates/core/src/operations/CLAUDE.local.md b/crates/core/src/operations/CLAUDE.local.md deleted file mode 100644 index c096a8710..000000000 --- a/crates/core/src/operations/CLAUDE.local.md +++ /dev/null @@ -1,32 +0,0 @@ -# Operations Module - Technical Notes - -## Update Operation - -### Key Issues Fixed - -1. **Update Propagation Without State Change** - - The `update_contract` function must handle `UpdateNoChange` events from the contract handler - - Updates that don't change state should not be propagated to prevent circular loops - - Compare state before and after update to determine if broadcasting is needed - -2. **Routing to Unconnected Peers** - - Update operations were selecting subscribers without checking connectivity - - Always use `closest_potentially_caching` which filters to connected peers only - - Don't just pick any subscriber with `.pop()` - must verify connectivity - -### Implementation Details - -```rust -// BAD: Picks any subscriber without checking connection -let target = subscribers.clone().pop() - -// GOOD: Uses routing to find connected peer -let target = op_manager.ring.closest_potentially_caching(key, skip_list) -``` - -## General Operation Patterns - -- Operations use transaction IDs to prevent duplicate processing -- `op_manager.pop(msg.id())` ensures each transaction is only processed once -- State machines track operation progress through various states -- Always verify peer connectivity before setting as target \ No newline at end of file diff --git a/issue_body.md b/issue_body.md deleted file mode 100644 index 3037e3524..000000000 --- a/issue_body.md +++ /dev/null @@ -1,48 +0,0 @@ -## Summary - -Several ping tests that involve network topology constraints (blocked addresses, partial connectivity) are currently marked as ignored because they have never worked properly. These tests fail during node startup with "channel closed" errors. - -## Affected Tests - -1. **test_ping_partially_connected_network** (apps/freenet-ping/app/tests/run_app.rs) - - Creates 3 gateways and 7 regular nodes with partial connectivity (CONNECTIVITY_RATIO = 0.5) - - Fails with: "Gateway node failed: channel closed" - - Uses asymmetric blocking (node A blocks B, but B doesn't block A) - -2. **test_ping_blocked_peers** (apps/freenet-ping/app/tests/run_app_blocked_peers.rs) - - 1 gateway and 2 nodes that block each other - - Fails with: "Failed to read contract code" and "channel closed" - - Contract loading code is broken (tries to read directory as file) - -3. **test_ping_blocked_peers_simple** (apps/freenet-ping/app/tests/run_app_blocked_peers.rs) - - Simplified version of blocked peers test - - Same failures as above - -## Root Causes - -1. **Channel closed errors**: When a node's event listener exits/crashes during startup, it drops the notification channel receiver. Other nodes trying to connect get "channel closed" errors when calling `notify_node_event()`. - -2. **Test design issues**: - - Asymmetric blocking in partially connected network test - - Contract loading expects compiled WASM but tries to read source directory - - Tests monitor all nodes and exit immediately if any node fails - -3. **Blocking mechanism confusion**: - - Tests set up TCP addresses for blocking but P2P connections use UDP - - Blocking logic needs to be in the connect operation's CheckConnectivity handler - -## What Was Attempted - -In PR #1615, we attempted to fix these tests but discovered they had never worked since creation (added in commit 82451d7e on May 16, 2025). We reverted all changes and kept these tests marked as ignored. - -## Proposed Solution - -1. Fix contract loading to use `common::load_contract()` function -2. Implement symmetric blocking (if node i blocks j, then j also blocks i) -3. Add proper blocked address handling in the connect operation -4. Debug why gateway nodes fail during startup in these specific test configurations -5. Consider hardcoding peer connections instead of using CONNECTIVITY_RATIO for more predictable test behavior - -## Related PRs - -- #1615 - Where these tests were discovered to be broken and marked as ignored \ No newline at end of file diff --git a/issue_comment.md b/issue_comment.md deleted file mode 100644 index bb879b179..000000000 --- a/issue_comment.md +++ /dev/null @@ -1,9 +0,0 @@ -## Additional test found to be unreliable - -**test_small_network_get_failure** (apps/freenet-ping/app/tests/test_small_network_get_issue.rs) -- Test for GET backtracking in sparse networks -- Was previously disabled for CI timeouts (commit 70fddd5a) -- Re-enabled but still failing with: - - PUT operations timing out - - Gateway node crashing with "channel closed" -- Now marked as ignored again \ No newline at end of file diff --git a/pr_description.md b/pr_description.md deleted file mode 100644 index f0d44eb93..000000000 --- a/pr_description.md +++ /dev/null @@ -1,49 +0,0 @@ -# Fix: Ensure connection callbacks are always notified - -## Problem - -When establishing connections between peers, callbacks could be dropped without notification in several scenarios, leading to "channel closed" errors when operations waited for responses. The root issue was that callbacks stored in `awaiting_connection` were only removed on specific success/failure paths, but not in many error scenarios. - -## Solution - -This PR adds comprehensive callback cleanup to ensure all connection callbacks are notified with either success or failure: - -### 1. Added cleanup helper method -- `cleanup_awaiting_connection` ensures callbacks are notified and removed -- Also cleans up transaction tracking - -### 2. Fixed timeout handling in `handle_connect_peer` -- Callbacks are now cleaned up when `establish_conn` times out (10s timeout) -- Callbacks are cleaned up when `establish_conn` returns an error - -### 3. Added cleanup on peer disconnection -- When `TransportError::ConnectionClosed` occurs -- When explicitly dropping a connection via `NodeEvent::DropConnection` - -### 4. Transaction timeout handling -- Added `connection_tx` HashMap to track transaction IDs for each connection attempt -- Callbacks are cleaned up when `NodeEvent::TransactionTimedOut` is received - -### 5. Handshake handler notifications -- Modified `start_outbound_connection` to push failure events instead of silently returning -- Now handles "already connected" and "connection in progress" cases properly - -## Technical Details - -### Files Changed: -- `crates/core/src/node/network_bridge/p2p_protoc.rs` - Main callback management improvements -- `crates/core/src/node/network_bridge/handshake.rs` - Ensures all code paths generate events -- `crates/core/src/node/network_bridge.rs` - Added `AlreadyConnected` and `ConnectionInProgress` error variants - -### Key Improvements: -- Callbacks are guaranteed to be notified on all error paths -- Memory leaks from orphaned callbacks are prevented -- Better error messages for connection failures - -## Testing - -While the connection callback issue is fixed, the ping tests still fail due to a separate transport layer issue where channels are closed during connection establishment. This will be addressed in a follow-up PR. - -## Related Issues -- Fixes part of #1616 (connection callback dropping) -- Transport layer issues will be addressed separately \ No newline at end of file diff --git a/stdlib b/stdlib index b2bd5c812..aed4d4cad 160000 --- a/stdlib +++ b/stdlib @@ -1 +1 @@ -Subproject commit b2bd5c812a08f4427d25e9844c0a06f9c96e356a +Subproject commit aed4d4cad3c692f4700727eec88ab1b843573ca3 diff --git a/transport_channel_closed_issue.md b/transport_channel_closed_issue.md deleted file mode 100644 index 6a530e538..000000000 --- a/transport_channel_closed_issue.md +++ /dev/null @@ -1,63 +0,0 @@ -# Transport Layer Channel Closed Errors During Connection Establishment - -## Summary - -During connection establishment, particularly in tests with network topology constraints, we're seeing "channel closed" errors from the transport layer that cause nodes to fail. These are distinct from the connection callback dropping issues that were fixed in the related PR. - -## Problem Description - -When establishing connections between nodes, the transport layer's `UdpPacketsListener` encounters channel closed errors during packet reception. This leads to: - -1. Gateway connection failures with "decryption error" -2. Transport connection drops with "failed to receive packet from remote, connection closed" -3. Node failures that cascade through the test infrastructure - -## Affected Components - -- `crates/core/src/transport/connection_handler.rs`: - - `UdpPacketsListener::listen()` - lines 319 and 333 - - Gateway connection handling - - Inbound packet sender channels - -## Error Pattern - -``` -WARN freenet::transport::connection_handler: failed to receive packet from remote, connection closed, remote_addr: 127.0.0.1:53226, err: channel closed -ERROR freenet::transport::connection_handler: Failed to establish gateway connection, error: decryption error, remote_addr: 127.0.0.1:53226 -``` - -## Tests Affected - -- `test_ping_partially_connected_network` -- `test_ping_multi_node` -- Other network topology tests with constrained connectivity - -## Root Cause Analysis - -The errors appear when: -1. A remote connection's inbound packet sender channel is dropped -2. The UDP packets listener tries to send packets to a closed channel -3. This often happens during rapid connection attempts or when nodes have connectivity constraints - -## Potential Issues - -1. **Race condition**: Channels may be dropped while packets are still being processed -2. **Cleanup timing**: Connection cleanup may happen too aggressively -3. **Error propagation**: Transport errors cascade up causing node failures -4. **Channel lifetime**: Inbound packet sender channels may have incorrect lifetimes - -## Proposed Investigation - -1. Trace the lifecycle of `inbound_packet_sender` channels -2. Identify why channels are being dropped during active connections -3. Add better error recovery for transport layer channel failures -4. Consider if connection attempts need better synchronization - -## Related Issues - -- Original issue about connection callback dropping (fixed) -- Network topology test failures (#1615) - -## Notes - -This is a separate issue from the connection callback dropping that was addressed. The transport layer channel management appears to have its own set of race conditions or lifecycle issues that need investigation. \ No newline at end of file diff --git a/transport_fix.md b/transport_fix.md deleted file mode 100644 index 5f4649e4c..000000000 --- a/transport_fix.md +++ /dev/null @@ -1,113 +0,0 @@ -# Transport Layer Channel Management Fix - -## Problem Analysis - -The transport layer has several issues causing "channel closed" errors: - -1. **Premature connection dropping**: When a new connection request arrives for an already connected peer, the existing connection is immediately dropped without proper cleanup or verification. - -2. **No connection reuse**: The system doesn't check if an existing connection is still valid before dropping it. - -3. **Race conditions**: During rapid connection attempts (common in constrained network tests), packets can arrive for connections still being established, leading to channel drops. - -## Proposed Solution - -### 1. Check Connection Health Before Dropping - -Instead of immediately dropping existing connections, check if they're still active: - -```rust -// Instead of: -if let Some(_conn) = self.remote_connections.remove(&remote_addr) { - tracing::warn!(%remote_addr, "connection already established, dropping old connection"); -} - -// Do: -if let Some(existing_conn) = self.remote_connections.get(&remote_addr) { - // Try to send a test packet to verify the connection is still alive - if existing_conn.inbound_packet_sender.is_closed() { - // Only remove if the channel is actually closed - self.remote_connections.remove(&remote_addr); - tracing::warn!(%remote_addr, "removing closed connection"); - } else { - // Connection is still alive, ignore new connection attempt - tracing::debug!(%remote_addr, "connection already established and healthy, ignoring new attempt"); - // Notify the caller that connection already exists - let _ = open_connection.send(Err(TransportError::ConnectionEstablishmentFailure { - cause: "connection already exists".into(), - })); - continue; - } -} -``` - -### 2. Handle Channel Errors Gracefully - -When sending packets fails due to closed channels, don't silently drop the connection: - -```rust -// For established connections (line 316): -if let Some(remote_conn) = self.remote_connections.remove(&remote_addr) { - match remote_conn.inbound_packet_sender.try_send(packet_data) { - Ok(_) => { - self.remote_connections.insert(remote_addr, remote_conn); - } - Err(mpsc::error::TrySendError::Full(_)) => { - // Channel full, reinsert and log - self.remote_connections.insert(remote_addr, remote_conn); - tracing::warn!(%remote_addr, "inbound packet channel full"); - } - Err(mpsc::error::TrySendError::Closed(_)) => { - // Channel closed, connection is dead - tracing::warn!(%remote_addr, "connection closed, removing from active connections"); - // Don't reinsert - connection is truly dead - } - } - continue; -} -``` - -### 3. Add Connection State Tracking - -Track connection establishment state to avoid race conditions: - -```rust -enum ConnectionState { - Establishing, - Established, - Closing, -} - -struct ConnectionTracker { - state: ConnectionState, - conn: Option, -} -``` - -### 4. Implement Connection Deduplication - -For gateway connections, check if a connection attempt is already in progress: - -```rust -// Before starting a new gateway connection: -if ongoing_gw_connections.contains_key(&remote_addr) { - tracing::debug!(%remote_addr, "gateway connection already in progress, ignoring duplicate packet"); - continue; -} -``` - -## Implementation Steps - -1. ✅ Add connection health checks before dropping -2. ✅ Use try_send instead of send for non-critical paths -3. ✅ Add proper error handling for channel failures -4. ✅ Increase channel buffer sizes from 1 to 100 for packet channels -5. Implement connection state tracking (if needed) -6. Add tests for rapid connection scenarios - -## Expected Benefits - -- Eliminates spurious "channel closed" errors -- Prevents dropping healthy connections -- Handles race conditions in connection establishment -- Improves stability in constrained network topologies \ No newline at end of file diff --git a/transport_fix_summary.md b/transport_fix_summary.md deleted file mode 100644 index c47f12a35..000000000 --- a/transport_fix_summary.md +++ /dev/null @@ -1,31 +0,0 @@ -# Transport Layer Fix Summary - -## Changes Made - -### 1. Connection Health Checks (line 467-482) -- Added check for existing connections before establishing new ones -- Use `is_closed()` to verify if connection is still alive -- Only remove connections that are actually dead -- Reject new connection attempts if healthy connection exists - -### 2. Better Error Handling for Packet Sending (line 315-334) -- Changed from `send()` to `try_send()` for non-blocking operation -- Handle `TrySendError::Full` by reinserting connection and logging -- Handle `TrySendError::Closed` by removing dead connection -- Prevents panic on channel errors - -### 3. Duplicate Connection Prevention -- Check for existing connections in `ongoing_connections` (line 485-491) -- Prevent duplicate gateway connection attempts (line 373-376) -- Send proper error responses for duplicate attempts - -### 4. Increased Channel Buffer Sizes -- Changed inbound packet channel buffers from 1 to 100 (lines 522, 685) -- Matches buffer sizes used in other parts of the system -- Prevents channel full errors during burst traffic - -## Expected Impact -- Eliminates "channel closed" errors during connection establishment -- Prevents dropping healthy connections -- Better handles rapid connection attempts in constrained topologies -- More robust error handling for transport layer operations \ No newline at end of file