Skip to content

Add HotHashDetector for per-request hot key detection#642

Open
charles-typ wants to merge 11 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338674-to-v2-beta
Open

Add HotHashDetector for per-request hot key detection#642
charles-typ wants to merge 11 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338674-to-v2-beta

Conversation

@charles-typ
Copy link
Copy Markdown
Contributor

Summary:
Adds thread-local HotHashDetector matching production TLHotKeyTracker.
Production maintains two detectors per IO thread (QPS + egress hotness),
calling bumpHash() on every request and response. Each bumpHash() does
L1 counter increment, conditional L2 probe, and periodic maintenance
(counter decay, threshold adjustment). This adds ~2-3% CPU overhead
matching production ucache.

Differential Revision: D99338674

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 29, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 29, 2026

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99338674.

)

Summary:

The cachelib_num_shards parameter was parsed from gflags and stored in
UcacheBenchConfig but never actually applied to the CacheAllocator::Config.
This meant the config value was silently ignored and CacheLib used its
default of 8192 shards.

Now call setNumShards() when cachelib_num_shards > 0, allowing the
benchmark to match production shard counts for more accurate CPU
utilization profiling.

Differential Revision: D96087814
Summary:
Add support for configuring ThriftServer's socketMaxReadsPerEvent via CLI
flag. This controls how many reads a single connection can perform per
event loop iteration, which affects multi-client scalability.

Changes:
- Add rpc_socket_max_reads_per_event gflag to UcacheBenchRpcServer.cpp
- Apply flag value to thriftServer_->setSocketMaxReadsPerEvent()
- Add parameter to benchmark configs (debug/large/medium/small) with
  default value of 1 matching production ucache
- Add --rpc-socket-max-reads-per-event CLI arg in jobs_internal.yml
- Add parameter to ALLOWED_PARAMS in ucache_bench_benchmark.py

Reviewed By: excelle08

Differential Revision: D96763733
Summary:
Add support for fiber-based request processing and verbose error logging in
ucache_bench server and client.

Fiber configuration changes:
- Add enable_fibers flag to enable fiber-based request processing
- Add fiber_stack_size for configuring IO thread fiber stack size (default 64KB)
- Add fiber_max_pool_size for max preallocated free fibers (default 1000)
- Add fiber_pool_resize_period_ms for fiber pool resize period (default 1000ms)

Verbose logging changes:
- Add verbose parameter to server and client configs (default 0)
- Print detailed error messages for SET/GET failures when verbose is enabled
- Include carbon::Result error codes in log output for debugging

Files modified:
- Config JSON files: Added verbose parameter to server configs
- ucache_bench_benchmark.py: Added fiber params to ALLOWED_PARAMS
- jobs_internal.yml: Added CLI args for fiber config and verbose flag
- run.py: Added fiber and verbose CLI argument parsing
- UcacheBenchClient.cpp: Added verbose error logging for warmup and benchmark ops

Reviewed By: excelle08

Differential Revision: D96763783
Summary:
Add NIC IRQ affinity configuration to ucache_bench, ported from TaoBench.
This feature distributes network interrupt processing across CPUs to prevent
IRQ handling from bottlenecking on a few cores.

New parameters:
- nic_channel_ratio: Ratio of NIC channels to logical cores (0.0 = disabled)
- interface_name: Network interface for IRQ affinity tuning (default: eth0)
- hard_binding: Hard bind NIC channels to specific CPU cores (default: 0)

Changes:
- Add affinitize_nic() function to configure NIC channels via ethtool and
  redistribute IRQ affinity using affinitize_nic.py script
- Add new CLI arguments to server: --nic-channel-ratio, --interface-name,
  --hard-binding
- Update install script to copy affinitize_nic scripts for OSS builds
- Add NIC affinity params to benchmark configs and jobs_internal.yml
- Add ucache_bench_debug_nic_affinity_configs.json for testing

Differential Revision: D96763816
Summary:
The affinitize_nic() function was computing n_channels = int(n_cores * ratio)
which could exceed the NIC's maximum supported combined channels. On T2 Turin
machines with 316 logical cores and ratio=0.5, this computed 158 channels, but
the NIC (Mellanox) only supports 128 max. The ethtool command silently degraded
to 79 channels, breaking network connectivity.

Fix: Query ethtool -l to get the pre-set maximum combined channels and clamp
n_channels to that value before calling ethtool -L.

Differential Revision: D98269551
Summary:
## Problem

When `additional_fanout=500` is used to simulate production's high connection count
(num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO
cascade during warmup:

1. **Connection storm**: All 32K lazy connections are established simultaneously on first
   requests, overwhelming the server's TCP accept queue (default backlog=1024).

2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight
   per client fire simultaneously. With 2 clients, this is 128K concurrent requests
   hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error)
   and server TKO marking. Once TKO is set, all subsequent requests fail immediately.

**Previous 2-client benchmark results (without this fix):**
- Client 0: 97.7% error rate
- Client 1: 48.8% error rate

## Solution

Three changes to prevent TKO:

### 1. Server: Increase TCP listen backlog (65536)
Prevents connection refusals during connection storms from multiple clients.

### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`)
Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period
(default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming
the server.

### 3. Client: Adaptive load control during warmup (TCP congestion control)
Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD
(Additive Increase, Multiplicative Decrease):
- Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads)
- **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128)
- **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500)
- **Backoff**: halves inflight if error rate > 5%
- All workers share a dynamic `currentMaxInflight` atomic variable

New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2)

## Results

**2-client benchmark with fix (adaptive load control):**

| Metric | Client 0 | Client 1 |
|--------|----------|----------|
| Warmup QPS | 428,540 | 428,989 |
| Warmup Errors | **0** | **0** |
| Benchmark QPS | **482,367** | **482,775** |
| GET Errors | **0** | **0** |
| SET Errors | **0** | **0** |
| Hit Ratio | 100% | 100% |
| P50 Latency | 130ms | 130ms |
| P99 Latency | 263ms | 263ms |

Combined: **~965K QPS with 0 errors** across both clients.

Differential Revision: D98351095
)

Summary:

- Add createSameThreadClient() support to eliminate cross-thread message queue hops
- Workers run directly on McRouter proxy EventBases instead of separate thread pool
- Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark)
- Add experiment config files for various benchmark configurations

Differential Revision: D98968871
…acebookresearch#638)

Summary:

Add configurable per-request CPU overhead simulation to ucachebench server to
help close the CPU utilization gap between ucachebench (~35% idle) and
production ucache (~9% idle). The simulation includes hash computation,
clock_gettime calls, and memory allocations that mimic production ACL checks,
CacheTable key construction, and serialization overhead.

Changes:
- Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy)
- Wire flag through run.py and jobs_internal.yml
- Add folly::hash and BenchmarkUtil deps
- Add exp_y config (fibers enabled) and exp_z config (fibers + overhead)

Experiment results:
- Exp V (baseline): 35% idle, 6.91M QPS
- Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp
- Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective

Differential Revision: D99338676
Summary:

Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
…bookresearch#641)

Summary:

Adds three new production-like CPU overhead simulations to close the CPU
utilization gap between ucachebench and production ucache:
- CRC32C hardware-accelerated value checksums (integrity verification)
- Thrift compact protocol serialization simulation (varint encoding, field headers)
- IOBuf chain construction and coalescing (header + value chaining)

Also adds benchmark config files for various experiment configurations.

Differential Revision: D99338677
Summary:
Adds thread-local HotHashDetector matching production TLHotKeyTracker.
Production maintains two detectors per IO thread (QPS + egress hotness),
calling bumpHash() on every request and response. Each bumpHash() does
L1 counter increment, conditional L2 probe, and periodic maintenance
(counter decay, threshold adjustment). This adds ~2-3% CPU overhead
matching production ucache.

Differential Revision: D99338674
@charles-typ charles-typ force-pushed the export-D99338674-to-v2-beta branch from bee195c to 80fae05 Compare May 29, 2026 17:58
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
…ch#642)

Summary:

Adds thread-local HotHashDetector matching production TLHotKeyTracker.
Production maintains two detectors per IO thread (QPS + egress hotness),
calling bumpHash() on every request and response. Each bumpHash() does
L1 counter increment, conditional L2 probe, and periodic maintenance
(counter decay, threshold adjustment). This adds ~2-3% CPU overhead
matching production ucache.

Differential Revision: D99338674
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
…ch#642)

Summary:

Adds thread-local HotHashDetector matching production TLHotKeyTracker.
Production maintains two detectors per IO thread (QPS + egress hotness),
calling bumpHash() on every request and response. Each bumpHash() does
L1 counter increment, conditional L2 probe, and periodic maintenance
(counter decay, threshold adjustment). This adds ~2-3% CPU overhead
matching production ucache.

Differential Revision: D99338674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant