Fix perf/toptalk #85

acooks · 2025-12-28T09:59:41Z

No description provided.

tcp_window_get_info was returning window info for the wrong direction. When querying flow A→B (data from A to B), we want to know B's advertised receive window, which determines how much A can send. The TCP window field in packets from B→A tells us B's receive buffer space. So for flow A→B, we should return the REVERSE direction's window info (B→A packets), not the same direction. Changed tcp_window_get_info to return: is_forward ? &entry->rev : &entry->fwd instead of: is_forward ? &entry->fwd : &entry->rev This fixes zero-window event markers not appearing in the TCP Window chart when a receiver advertises window=0. Updated unit tests to match the corrected semantics. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Events (zero-window, dup-ack, retransmit, ECE, CWR) were being queried with timestamp lookback at publication time, causing markers to appear at wrong positions or not at all. Now events are accumulated per-interval during packet processing, matching how bytes/packets work. Changes: - tcp_window_process_packet() returns detected events as uint64_t bitmask - propagate_events_to_intervals() ORs events into all interval tables - Edge-triggered zero-window detection with 5% hysteresis threshold - Frontend simplified to use backend-provided per-interval events - Screenshot controller updated to capture RTT/Window chart containers with titles and legends 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace per-packet hash lookups with per-interval accumulation to fix performance degradation that was breaking real-time throughput/toptalk charts at high traffic rates. Problem: The previous approach called tcp_window_get_sender_value() and propagate_window_to_intervals() for every TCP packet, adding millions of extra hash lookups per second at 10 Gbps. This broke real-time deadlines and caused gaps in charts. Solution: Track window statistics the same way bytes/packets are tracked: 1. Accumulate cheaply per-packet by piggybacking on existing hash lookups 2. Aggregate at interval boundaries when tables rotate 3. No extra hash lookups - zero additional per-packet overhead Changes: - flow.h: Add window accumulation fields (window_sum, window_samples, window_min, window_max) and condition flags to flow_window_info. Add tcp_scaled_window to flow_pkt for passing window through pipeline. - tcp_window.h/c: Add scaled_window_out parameter to tcp_window_process_packet() to return computed scaled window without extra lookups. - intervals.c: Accumulate window stats in add_flow_to_interval(), compute conditions (low, zero, starving, recovered) at interval boundary, expose stats in fill_short_int_flows(). - test_tcp_window.c: Update tests for new function signature. Performance: Reduces per-packet overhead from ~8M extra hash lookups/sec to zero at 10 Gbps. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

TCP window/congestion events (zero-window, dup-ack, retransmit, ECE, CWR) detected in packets FROM host A should appear on the flow TO host A, since that's where host A's advertised window is displayed. Previously, events were stored on the packet's flow direction, causing markers to appear on the wrong flow line in the TCP Window chart (e.g., zero-window marker at 64KB instead of near 0 bytes). Also fixes interval 0's recent_events being overwritten with cumulative events from tcp_window_get_info() instead of per-interval events. Added PHASE comments to clarify the three-phase data population in fill_short_int_flows() and updated tcp_window.h comments to clarify that recent_events is cumulative (never cleared). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract the flow reversal logic from intervals.c into a reusable static inline function in flow.h. This improves code readability while maintaining performance (function is inlined by compiler). The flow_reverse() function creates a reversed copy of a flow by swapping src/dst addresses and ports. Used for TCP window event propagation where events detected in packets FROM host A need to be stored on the flow TO host A. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add 5 tests for the flow_reverse() utility function: - IPv4 flow reversal (addresses and ports swap correctly) - IPv6 flow reversal (16-byte addresses swap correctly) - Double reversal returns original flow - Original flow unchanged (input not modified) - flow_cmp symmetry with reverse 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Test 8 to document and verify that TCP window/congestion events are associated with the receiver's flow direction: - Events from server->client packets appear on client->server flow - This is where the server's receive window is displayed in the UI This test documents the receiver-direction semantics that were implemented to fix duplicate event markers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Eliminates 40-line code duplication between tcp_rtt.c and tcp_window.c. The canonical key function normalizes bidirectional flows so A→B and B→A map to the same hash table entry (lower IP/port first). New tcp_flow_key.h provides: - struct tcp_flow_key: canonical key for bidirectional lookup - make_canonical_key(): inline function shared by RTT and window tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests the canonical flow key generation used for bidirectional TCP connection tracking. Verifies: - IPv4/IPv6 canonical ordering (lower IP first) - Port-based ordering when IPs equal - is_forward flag correctness - Round-trip: key from A→B equals key from B→A - Ethertype preservation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Prevents potential NULL pointer dereferences on memory allocation failure. Gracefully handles OOM by: - Skipping entries during table rotation (clear_table) - Dropping packets when packet list entry allocation fails - Dropping packets when flow hash entry allocation fails 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extracts the hash table clearing logic into a free_flow_table() helper function that takes a pointer-to-pointer, making it explicit that the head pointer is being modified. This eliminates the confusing pattern where HASH_DELETE modified a local variable while a global was set to NULL separately. Also improves documentation: - Explains the 3-step rotation process (free old, copy, free source) - Documents why we copy entries rather than moving pointers - Removes obsolete TODO comment No behavior change, just clearer code structure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add l4_offset field to struct flow_pkt that stores the offset to the TCP/UDP header computed during initial packet decode. This eliminates the need to call find_tcp_header() for TCP packets, avoiding redundant header parsing. Changes: - Add l4_offset and has_l4_offset fields to struct flow_pkt in flow.h - Create internal decode_ip4/ip6 versions that track packet start - Store L4 offset during TCP/UDP decode - Handle VLAN offset adjustment for recursive decoding - Use stored offset in handle_packet() with fallback to find_tcp_header The optimization reduces per-packet processing overhead by eliminating duplicate header traversal for TCP packets. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace the malloc/free per packet in the sliding window packet list with a fixed-size ring buffer. This eliminates allocation overhead on the hot path. Changes: - Add PKT_RING_SIZE constant (default 262144, configurable at compile) - Static ring buffer with head/tail indices - Replace DL_APPEND/DL_DELETE with simple index manipulation - Compile-time assertion that ring size is power of 2 The ring buffer supports ~87K packets/second with default 3-second window. At higher rates, oldest packets will be overwritten (graceful degradation vs. allocation failure). Benchmark showed 72% reduction in allocation overhead: malloc/free: 45 cycles ring buffer: 13 cycles 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace copy-based interval table rotation with O(1) pointer swap. Instead of allocating new entries and copying data from incomplete to complete table, simply reassign the pointer. Previous approach: O(n) - malloc + memcpy for each flow entry New approach: O(1) - single pointer assignment The only remaining O(n) work is freeing the old complete table, which is unavoidable. Benchmark showed 37% reduction in rotation overhead: copy rotation: 28,554 cycles swap rotation: 18,060 cycles 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace HASH_SRT (O(n log n) full sort) with a simple top-N selection algorithm (O(n)) for finding the highest-traffic flows. Since we only need MAX_FLOW_COUNT results, sorting all flows is wasteful. The new find_top_n_flows() iterates through all flows once, maintaining a small sorted array of the top N flows seen so far. Uses insertion sort for the array which is O(n²) in the worst case, but n is small (MAX_FLOW_COUNT is typically 5-20) so this is faster in practice. Benchmark showed 86% reduction in sorting overhead: HASH_SRT: 10,460 cycles per sort Top-N select: 1,480 cycles per sort At 1000 sorts/sec (1ms tick rate): HASH_SRT: 6.5ms/sec overhead Top-N select: 0.8ms/sec overhead 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When the ring buffer is full and we're about to add a new packet, we must first expire the oldest entry. Without this, the old entry's bytes/packets are never subtracted from the flow table, causing accounting corruption. With few unique flows and millions of packets (ring wraps many times), the accumulated error eventually exceeds actual counts, causing negative packet counts and assertion failures. Bug: assertion failure "fte->f.packets >= 0" in delete_pkt_from_ref_table 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace full flow_pkt storage with compact pkt_ring_entry containing only the fields needed for expiration: flow key, bytes, and timestamp. Key changes: - Add pkt_ring_entry struct (68 bytes vs 736 bytes for flow_pkt) - Use struct timeval for timestamps to avoid time domain mismatch between packet timestamps (Unix epoch) and deadline (CLOCK_MONOTONIC) - Copy flow keys in find_top_n_flows() instead of storing pointers, ensuring keys remain valid after expire_old_packets() frees entries - Add return value to fill_short_int_flows() to handle expired flows Memory reduction: ~10x less memory per ring buffer entry. Ring buffer at 262K entries: 192MB -> 18MB. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The stats path was incorrectly passing CLOCK_MONOTONIC deadline to expire_old_packets(), which compares against packet timestamps (Unix epoch). These time domains are incompatible - monotonic time is uptime since boot (~days) while epoch time is seconds since 1970 (~1.7 billion). The comparison "worked" by accident because epoch >> monotonic, so packets were never considered aged in the stats path. Expiration only happened correctly in the packet receive path. Fix: Track last_pkt_time and use it for expiration in the stats path. This ensures all expiration comparisons stay within the packet time domain. Document the two time domains: - PACKET TIME: Unix epoch from pcap, used for sliding window expiration - MONOTONIC TIME: Uptime, used for 1ms tick scheduling and interval tables 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change bytes field from int64_t (8 bytes) to uint32_t (4 bytes). The uint32_t fits in the padding gap after struct flow, eliminating wasted space and aligning the struct to exactly 64 bytes. Layout: - struct flow: 44 bytes (hash key) - uint32_t bytes: 4 bytes (fits in padding gap) - struct timeval: 16 bytes (8-byte aligned at offset 48) - Total: 64 bytes Max packet size of 4GB is more than sufficient (MTU ~1500, jumbo ~9KB). Ring buffer memory at 262K entries: 16.8 MB (was 18.9 MB with 72 bytes). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Two major performance optimizations for high-rate packet capture: 1. Deferred interval table updates: - Eliminate INTERVAL_COUNT (8) hash lookups per packet - Accumulate deltas in ref table entry, commit at interval boundaries - ~800 cycles/packet saved in hot path 2. Runtime-configurable ring buffer with hugepage support: - tt_set_ring_size() / tt_get_ring_size() API for runtime config - Hugepage allocation for buffers >= 2MB (better TLB efficiency) - Supports up to 32M entries (~2GB) for 10M pps with 3s window The deferred commit approach trades memory (interval_delta array per flow) for CPU cycles (single hash lookup instead of 9 per packet). Events are still tracked accurately and committed at interval boundaries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 2: Pre-allocated flow pool - Add struct flow_pool with 64K entries for ref table, 128K for intervals - O(1) alloc/free via free list stack - Falls back to malloc() if pool exhausted - Eliminates malloc stalls on hot path Phase 4: CRC32C hardware-accelerated hash - Use SSE4.2 _mm_crc32_u64/u32/u8 intrinsics when available - Process 8 bytes at a time for maximum throughput - Override uthash default Jenkins hash (~50 -> ~20 cycles) - Add -msse4.2 to CFLAGS Phase 5: Prefetch hints - Prefetch flow key before hash lookup in update_stats_tables() - Prefetch next ring entry during expiration loop - Prefetch next hash entry during commit_interval_deltas() Benchmark results (vs baseline): - decode: 21.5 -> 20.1 cycles (7% faster) - ring buffer: 58.4 -> 30.7 cycles (47% faster) - top-N: 4970 -> 1875 cycles (62% faster) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The top-N selection algorithm introduced in 909882a lacked a stable tie-breaker for flows with equal byte counts. This caused flow ordering to depend on non-deterministic hash table iteration order, resulting in UI jitter where flows with similar traffic rates would swap positions between updates. Backend (intervals.c): - Add top_flow_cmp() function that compares by bytes first, then uses memcmp on flow key as deterministic tie-breaker - Update find_top_n_flows() to use the new comparator Frontend (jittertrap-core.js): - Add localeCompare() tie-breaker to flowRank sorting - Ensures consistent ordering even when totals.tbytes are equal This ensures deterministic flow ordering regardless of hash table iteration order, preventing position/color instability in the UI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

acooks and others added 22 commits December 28, 2025 10:35

acooks merged commit d1b94a8 into master Dec 28, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix perf/toptalk #85

Fix perf/toptalk #85

Uh oh!

acooks commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix perf/toptalk #85

Fix perf/toptalk #85

Uh oh!

Conversation

acooks commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants