Skip to content

Conversation

@acooks
Copy link
Owner

@acooks acooks commented Dec 28, 2025

No description provided.

acooks and others added 22 commits December 28, 2025 10:35
tcp_window_get_info was returning window info for the wrong direction.
When querying flow A→B (data from A to B), we want to know B's
advertised receive window, which determines how much A can send.

The TCP window field in packets from B→A tells us B's receive buffer
space. So for flow A→B, we should return the REVERSE direction's
window info (B→A packets), not the same direction.

Changed tcp_window_get_info to return:
  is_forward ? &entry->rev : &entry->fwd
instead of:
  is_forward ? &entry->fwd : &entry->rev

This fixes zero-window event markers not appearing in the TCP Window
chart when a receiver advertises window=0.

Updated unit tests to match the corrected semantics.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Events (zero-window, dup-ack, retransmit, ECE, CWR) were being queried
with timestamp lookback at publication time, causing markers to appear
at wrong positions or not at all. Now events are accumulated per-interval
during packet processing, matching how bytes/packets work.

Changes:
- tcp_window_process_packet() returns detected events as uint64_t bitmask
- propagate_events_to_intervals() ORs events into all interval tables
- Edge-triggered zero-window detection with 5% hysteresis threshold
- Frontend simplified to use backend-provided per-interval events
- Screenshot controller updated to capture RTT/Window chart containers
  with titles and legends

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace per-packet hash lookups with per-interval accumulation to fix
performance degradation that was breaking real-time throughput/toptalk
charts at high traffic rates.

Problem:
The previous approach called tcp_window_get_sender_value() and
propagate_window_to_intervals() for every TCP packet, adding millions
of extra hash lookups per second at 10 Gbps. This broke real-time
deadlines and caused gaps in charts.

Solution:
Track window statistics the same way bytes/packets are tracked:
1. Accumulate cheaply per-packet by piggybacking on existing hash lookups
2. Aggregate at interval boundaries when tables rotate
3. No extra hash lookups - zero additional per-packet overhead

Changes:
- flow.h: Add window accumulation fields (window_sum, window_samples,
  window_min, window_max) and condition flags to flow_window_info.
  Add tcp_scaled_window to flow_pkt for passing window through pipeline.
- tcp_window.h/c: Add scaled_window_out parameter to tcp_window_process_packet()
  to return computed scaled window without extra lookups.
- intervals.c: Accumulate window stats in add_flow_to_interval(),
  compute conditions (low, zero, starving, recovered) at interval boundary,
  expose stats in fill_short_int_flows().
- test_tcp_window.c: Update tests for new function signature.

Performance: Reduces per-packet overhead from ~8M extra hash lookups/sec
to zero at 10 Gbps.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TCP window/congestion events (zero-window, dup-ack, retransmit, ECE, CWR)
detected in packets FROM host A should appear on the flow TO host A,
since that's where host A's advertised window is displayed.

Previously, events were stored on the packet's flow direction, causing
markers to appear on the wrong flow line in the TCP Window chart (e.g.,
zero-window marker at 64KB instead of near 0 bytes).

Also fixes interval 0's recent_events being overwritten with cumulative
events from tcp_window_get_info() instead of per-interval events.

Added PHASE comments to clarify the three-phase data population in
fill_short_int_flows() and updated tcp_window.h comments to clarify
that recent_events is cumulative (never cleared).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract the flow reversal logic from intervals.c into a reusable
static inline function in flow.h. This improves code readability
while maintaining performance (function is inlined by compiler).

The flow_reverse() function creates a reversed copy of a flow by
swapping src/dst addresses and ports. Used for TCP window event
propagation where events detected in packets FROM host A need to
be stored on the flow TO host A.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add 5 tests for the flow_reverse() utility function:
- IPv4 flow reversal (addresses and ports swap correctly)
- IPv6 flow reversal (16-byte addresses swap correctly)
- Double reversal returns original flow
- Original flow unchanged (input not modified)
- flow_cmp symmetry with reverse

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Test 8 to document and verify that TCP window/congestion events
are associated with the receiver's flow direction:
- Events from server->client packets appear on client->server flow
- This is where the server's receive window is displayed in the UI

This test documents the receiver-direction semantics that were
implemented to fix duplicate event markers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminates 40-line code duplication between tcp_rtt.c and tcp_window.c.
The canonical key function normalizes bidirectional flows so A→B and B→A
map to the same hash table entry (lower IP/port first).

New tcp_flow_key.h provides:
- struct tcp_flow_key: canonical key for bidirectional lookup
- make_canonical_key(): inline function shared by RTT and window tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests the canonical flow key generation used for bidirectional TCP
connection tracking. Verifies:
- IPv4/IPv6 canonical ordering (lower IP first)
- Port-based ordering when IPs equal
- is_forward flag correctness
- Round-trip: key from A→B equals key from B→A
- Ethertype preservation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents potential NULL pointer dereferences on memory allocation
failure. Gracefully handles OOM by:
- Skipping entries during table rotation (clear_table)
- Dropping packets when packet list entry allocation fails
- Dropping packets when flow hash entry allocation fails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extracts the hash table clearing logic into a free_flow_table() helper
function that takes a pointer-to-pointer, making it explicit that the
head pointer is being modified. This eliminates the confusing pattern
where HASH_DELETE modified a local variable while a global was set
to NULL separately.

Also improves documentation:
- Explains the 3-step rotation process (free old, copy, free source)
- Documents why we copy entries rather than moving pointers
- Removes obsolete TODO comment

No behavior change, just clearer code structure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add l4_offset field to struct flow_pkt that stores the offset to the
TCP/UDP header computed during initial packet decode. This eliminates
the need to call find_tcp_header() for TCP packets, avoiding redundant
header parsing.

Changes:
- Add l4_offset and has_l4_offset fields to struct flow_pkt in flow.h
- Create internal decode_ip4/ip6 versions that track packet start
- Store L4 offset during TCP/UDP decode
- Handle VLAN offset adjustment for recursive decoding
- Use stored offset in handle_packet() with fallback to find_tcp_header

The optimization reduces per-packet processing overhead by eliminating
duplicate header traversal for TCP packets.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace the malloc/free per packet in the sliding window packet list
with a fixed-size ring buffer. This eliminates allocation overhead on
the hot path.

Changes:
- Add PKT_RING_SIZE constant (default 262144, configurable at compile)
- Static ring buffer with head/tail indices
- Replace DL_APPEND/DL_DELETE with simple index manipulation
- Compile-time assertion that ring size is power of 2

The ring buffer supports ~87K packets/second with default 3-second
window. At higher rates, oldest packets will be overwritten (graceful
degradation vs. allocation failure).

Benchmark showed 72% reduction in allocation overhead:
  malloc/free: 45 cycles
  ring buffer: 13 cycles

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace copy-based interval table rotation with O(1) pointer swap.
Instead of allocating new entries and copying data from incomplete
to complete table, simply reassign the pointer.

Previous approach: O(n) - malloc + memcpy for each flow entry
New approach: O(1) - single pointer assignment

The only remaining O(n) work is freeing the old complete table,
which is unavoidable.

Benchmark showed 37% reduction in rotation overhead:
  copy rotation: 28,554 cycles
  swap rotation: 18,060 cycles

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace HASH_SRT (O(n log n) full sort) with a simple top-N selection
algorithm (O(n)) for finding the highest-traffic flows. Since we only
need MAX_FLOW_COUNT results, sorting all flows is wasteful.

The new find_top_n_flows() iterates through all flows once, maintaining
a small sorted array of the top N flows seen so far. Uses insertion
sort for the array which is O(n²) in the worst case, but n is small
(MAX_FLOW_COUNT is typically 5-20) so this is faster in practice.

Benchmark showed 86% reduction in sorting overhead:
  HASH_SRT:     10,460 cycles per sort
  Top-N select:  1,480 cycles per sort

At 1000 sorts/sec (1ms tick rate):
  HASH_SRT:     6.5ms/sec overhead
  Top-N select: 0.8ms/sec overhead

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When the ring buffer is full and we're about to add a new packet,
we must first expire the oldest entry. Without this, the old entry's
bytes/packets are never subtracted from the flow table, causing
accounting corruption.

With few unique flows and millions of packets (ring wraps many times),
the accumulated error eventually exceeds actual counts, causing
negative packet counts and assertion failures.

Bug: assertion failure "fte->f.packets >= 0" in delete_pkt_from_ref_table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace full flow_pkt storage with compact pkt_ring_entry containing
only the fields needed for expiration: flow key, bytes, and timestamp.

Key changes:
- Add pkt_ring_entry struct (68 bytes vs 736 bytes for flow_pkt)
- Use struct timeval for timestamps to avoid time domain mismatch
  between packet timestamps (Unix epoch) and deadline (CLOCK_MONOTONIC)
- Copy flow keys in find_top_n_flows() instead of storing pointers,
  ensuring keys remain valid after expire_old_packets() frees entries
- Add return value to fill_short_int_flows() to handle expired flows

Memory reduction: ~10x less memory per ring buffer entry.
Ring buffer at 262K entries: 192MB -> 18MB.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The stats path was incorrectly passing CLOCK_MONOTONIC deadline to
expire_old_packets(), which compares against packet timestamps (Unix
epoch). These time domains are incompatible - monotonic time is uptime
since boot (~days) while epoch time is seconds since 1970 (~1.7 billion).

The comparison "worked" by accident because epoch >> monotonic, so
packets were never considered aged in the stats path. Expiration only
happened correctly in the packet receive path.

Fix: Track last_pkt_time and use it for expiration in the stats path.
This ensures all expiration comparisons stay within the packet time
domain.

Document the two time domains:
- PACKET TIME: Unix epoch from pcap, used for sliding window expiration
- MONOTONIC TIME: Uptime, used for 1ms tick scheduling and interval tables

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change bytes field from int64_t (8 bytes) to uint32_t (4 bytes).
The uint32_t fits in the padding gap after struct flow, eliminating
wasted space and aligning the struct to exactly 64 bytes.

Layout:
- struct flow: 44 bytes (hash key)
- uint32_t bytes: 4 bytes (fits in padding gap)
- struct timeval: 16 bytes (8-byte aligned at offset 48)
- Total: 64 bytes

Max packet size of 4GB is more than sufficient (MTU ~1500, jumbo ~9KB).

Ring buffer memory at 262K entries: 16.8 MB (was 18.9 MB with 72 bytes).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two major performance optimizations for high-rate packet capture:

1. Deferred interval table updates:
   - Eliminate INTERVAL_COUNT (8) hash lookups per packet
   - Accumulate deltas in ref table entry, commit at interval boundaries
   - ~800 cycles/packet saved in hot path

2. Runtime-configurable ring buffer with hugepage support:
   - tt_set_ring_size() / tt_get_ring_size() API for runtime config
   - Hugepage allocation for buffers >= 2MB (better TLB efficiency)
   - Supports up to 32M entries (~2GB) for 10M pps with 3s window

The deferred commit approach trades memory (interval_delta array per flow)
for CPU cycles (single hash lookup instead of 9 per packet). Events are
still tracked accurately and committed at interval boundaries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 2: Pre-allocated flow pool
- Add struct flow_pool with 64K entries for ref table, 128K for intervals
- O(1) alloc/free via free list stack
- Falls back to malloc() if pool exhausted
- Eliminates malloc stalls on hot path

Phase 4: CRC32C hardware-accelerated hash
- Use SSE4.2 _mm_crc32_u64/u32/u8 intrinsics when available
- Process 8 bytes at a time for maximum throughput
- Override uthash default Jenkins hash (~50 -> ~20 cycles)
- Add -msse4.2 to CFLAGS

Phase 5: Prefetch hints
- Prefetch flow key before hash lookup in update_stats_tables()
- Prefetch next ring entry during expiration loop
- Prefetch next hash entry during commit_interval_deltas()

Benchmark results (vs baseline):
- decode: 21.5 -> 20.1 cycles (7% faster)
- ring buffer: 58.4 -> 30.7 cycles (47% faster)
- top-N: 4970 -> 1875 cycles (62% faster)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The top-N selection algorithm introduced in 909882a lacked a stable
tie-breaker for flows with equal byte counts. This caused flow ordering
to depend on non-deterministic hash table iteration order, resulting in
UI jitter where flows with similar traffic rates would swap positions
between updates.

Backend (intervals.c):
- Add top_flow_cmp() function that compares by bytes first, then uses
  memcmp on flow key as deterministic tie-breaker
- Update find_top_n_flows() to use the new comparator

Frontend (jittertrap-core.js):
- Add localeCompare() tie-breaker to flowRank sorting
- Ensures consistent ordering even when totals.tbytes are equal

This ensures deterministic flow ordering regardless of hash table
iteration order, preventing position/color instability in the UI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@acooks acooks merged commit d1b94a8 into master Dec 28, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants