Add HotHashDetector for per-request hot key detection by charles-typ · Pull Request #642 · facebookresearch/DCPerf

charles-typ · 2026-05-29T16:44:57Z

Summary:
Adds thread-local HotHashDetector matching production TLHotKeyTracker.
Production maintains two detectors per IO thread (QPS + egress hotness),
calling bumpHash() on every request and response. Each bumpHash() does
L1 counter increment, conditional L2 probe, and periodic maintenance
(counter decay, threshold adjustment). This adds ~2-3% CPU overhead
matching production ucache.

Differential Revision: D99338674

meta-codesync · 2026-05-29T16:45:06Z

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99338674.

) Summary: The cachelib_num_shards parameter was parsed from gflags and stored in UcacheBenchConfig but never actually applied to the CacheAllocator::Config. This meant the config value was silently ignored and CacheLib used its default of 8192 shards. Now call setNumShards() when cachelib_num_shards > 0, allowing the benchmark to match production shard counts for more accurate CPU utilization profiling. Differential Revision: D96087814

Summary: Add support for configuring ThriftServer's socketMaxReadsPerEvent via CLI flag. This controls how many reads a single connection can perform per event loop iteration, which affects multi-client scalability. Changes: - Add rpc_socket_max_reads_per_event gflag to UcacheBenchRpcServer.cpp - Apply flag value to thriftServer_->setSocketMaxReadsPerEvent() - Add parameter to benchmark configs (debug/large/medium/small) with default value of 1 matching production ucache - Add --rpc-socket-max-reads-per-event CLI arg in jobs_internal.yml - Add parameter to ALLOWED_PARAMS in ucache_bench_benchmark.py Reviewed By: excelle08 Differential Revision: D96763733

Summary: Add support for fiber-based request processing and verbose error logging in ucache_bench server and client. Fiber configuration changes: - Add enable_fibers flag to enable fiber-based request processing - Add fiber_stack_size for configuring IO thread fiber stack size (default 64KB) - Add fiber_max_pool_size for max preallocated free fibers (default 1000) - Add fiber_pool_resize_period_ms for fiber pool resize period (default 1000ms) Verbose logging changes: - Add verbose parameter to server and client configs (default 0) - Print detailed error messages for SET/GET failures when verbose is enabled - Include carbon::Result error codes in log output for debugging Files modified: - Config JSON files: Added verbose parameter to server configs - ucache_bench_benchmark.py: Added fiber params to ALLOWED_PARAMS - jobs_internal.yml: Added CLI args for fiber config and verbose flag - run.py: Added fiber and verbose CLI argument parsing - UcacheBenchClient.cpp: Added verbose error logging for warmup and benchmark ops Reviewed By: excelle08 Differential Revision: D96763783

Summary: Add NIC IRQ affinity configuration to ucache_bench, ported from TaoBench. This feature distributes network interrupt processing across CPUs to prevent IRQ handling from bottlenecking on a few cores. New parameters: - nic_channel_ratio: Ratio of NIC channels to logical cores (0.0 = disabled) - interface_name: Network interface for IRQ affinity tuning (default: eth0) - hard_binding: Hard bind NIC channels to specific CPU cores (default: 0) Changes: - Add affinitize_nic() function to configure NIC channels via ethtool and redistribute IRQ affinity using affinitize_nic.py script - Add new CLI arguments to server: --nic-channel-ratio, --interface-name, --hard-binding - Update install script to copy affinitize_nic scripts for OSS builds - Add NIC affinity params to benchmark configs and jobs_internal.yml - Add ucache_bench_debug_nic_affinity_configs.json for testing Differential Revision: D96763816

Summary: The affinitize_nic() function was computing n_channels = int(n_cores * ratio) which could exceed the NIC's maximum supported combined channels. On T2 Turin machines with 316 logical cores and ratio=0.5, this computed 158 channels, but the NIC (Mellanox) only supports 128 max. The ethtool command silently degraded to 79 channels, breaking network connectivity. Fix: Query ethtool -l to get the pre-set maximum combined channels and clamp n_channels to that value before calling ethtool -L. Differential Revision: D98269551

Summary: ## Problem When `additional_fanout=500` is used to simulate production's high connection count (num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO cascade during warmup: 1. **Connection storm**: All 32K lazy connections are established simultaneously on first requests, overwhelming the server's TCP accept queue (default backlog=1024). 2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight per client fire simultaneously. With 2 clients, this is 128K concurrent requests hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error) and server TKO marking. Once TKO is set, all subsequent requests fail immediately. **Previous 2-client benchmark results (without this fix):** - Client 0: 97.7% error rate - Client 1: 48.8% error rate ## Solution Three changes to prevent TKO: ### 1. Server: Increase TCP listen backlog (65536) Prevents connection refusals during connection storms from multiple clients. ### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`) Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period (default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming the server. ### 3. Client: Adaptive load control during warmup (TCP congestion control) Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD (Additive Increase, Multiplicative Decrease): - Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads) - **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128) - **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500) - **Backoff**: halves inflight if error rate > 5% - All workers share a dynamic `currentMaxInflight` atomic variable New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2) ## Results **2-client benchmark with fix (adaptive load control):** | Metric | Client 0 | Client 1 | |--------|----------|----------| | Warmup QPS | 428,540 | 428,989 | | Warmup Errors | **0** | **0** | | Benchmark QPS | **482,367** | **482,775** | | GET Errors | **0** | **0** | | SET Errors | **0** | **0** | | Hit Ratio | 100% | 100% | | P50 Latency | 130ms | 130ms | | P99 Latency | 263ms | 263ms | Combined: **~965K QPS with 0 errors** across both clients. Differential Revision: D98351095

) Summary: - Add createSameThreadClient() support to eliminate cross-thread message queue hops - Workers run directly on McRouter proxy EventBases instead of separate thread pool - Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark) - Add experiment config files for various benchmark configurations Differential Revision: D98968871

…acebookresearch#638) Summary: Add configurable per-request CPU overhead simulation to ucachebench server to help close the CPU utilization gap between ucachebench (~35% idle) and production ucache (~9% idle). The simulation includes hash computation, clock_gettime calls, and memory allocations that mimic production ACL checks, CacheTable key construction, and serialization overhead. Changes: - Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy) - Wire flag through run.py and jobs_internal.yml - Add folly::hash and BenchmarkUtil deps - Add exp_y config (fibers enabled) and exp_z config (fibers + overhead) Experiment results: - Exp V (baseline): 35% idle, 6.91M QPS - Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp - Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective Differential Revision: D99338676

Summary: Implement production-like per-request overhead features to close the CPU utilization gap between ucachebench (~46% idle) and production ucache (~9% idle). Features added: - Compound key construction (McStoredKey-style: "uc:pool:key:v1") - MurmurHash2 key hashing (matching production getHashForKey) - ACL prefix checks with F14FastMap lookup - Overload protection with inflight request counting - Stats tracking (12+ atomic increments per request) - Ticket staleness checks - Egress hash computation - Response timestamps via clock_gettime Also adds --production-features flag to run.py, jobs_internal.yml, and server main.cpp to enable these features via automark config. Differential Revision: D99338673

…bookresearch#641) Summary: Adds three new production-like CPU overhead simulations to close the CPU utilization gap between ucachebench and production ucache: - CRC32C hardware-accelerated value checksums (integrity verification) - Thrift compact protocol serialization simulation (varint encoding, field headers) - IOBuf chain construction and coalescing (header + value chaining) Also adds benchmark config files for various experiment configurations. Differential Revision: D99338677

Summary: Adds thread-local HotHashDetector matching production TLHotKeyTracker. Production maintains two detectors per IO thread (QPS + egress hotness), calling bumpHash() on every request and response. Each bumpHash() does L1 counter increment, conditional L2 probe, and periodic maintenance (counter decay, threshold adjustment). This adds ~2-3% CPU overhead matching production ucache. Differential Revision: D99338674

…ch#642) Summary: Adds thread-local HotHashDetector matching production TLHotKeyTracker. Production maintains two detectors per IO thread (QPS + egress hotness), calling bumpHash() on every request and response. Each bumpHash() does L1 counter increment, conditional L2 probe, and periodic maintenance (counter decay, threshold adjustment). This adds ~2-3% CPU overhead matching production ucache. Differential Revision: D99338674

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 29, 2026

meta-codesync Bot added fb-exported meta-exported labels May 29, 2026

charles-typ force-pushed the export-D99338674-to-v2-beta branch from e549a5c to bee195c Compare May 29, 2026 16:48

charles-typ added 11 commits May 29, 2026 10:57

charles-typ force-pushed the export-D99338674-to-v2-beta branch from bee195c to 80fae05 Compare May 29, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HotHashDetector for per-request hot key detection#642

Add HotHashDetector for per-request hot key detection#642
charles-typ wants to merge 11 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338674-to-v2-beta

charles-typ commented May 29, 2026

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

charles-typ commented May 29, 2026

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant