Skip to content

Add production-like per-request CPU overhead#640

Open
charles-typ wants to merge 9 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338673-to-v2-beta
Open

Add production-like per-request CPU overhead#640
charles-typ wants to merge 9 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338673-to-v2-beta

Conversation

@charles-typ
Copy link
Copy Markdown
Contributor

@charles-typ charles-typ commented May 29, 2026

Summary:
Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:

  • Compound key construction (McStoredKey-style: "uc:pool:key:v1")
  • MurmurHash2 key hashing (matching production getHashForKey)
  • ACL prefix checks with F14FastMap lookup
  • Overload protection with inflight request counting
  • Stats tracking (12+ atomic increments per request)
  • Ticket staleness checks
  • Egress hash computation
  • Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 29, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 29, 2026

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99338673.

@meta-codesync meta-codesync Bot changed the title Add production-like per-request CPU overhead Add production-like per-request CPU overhead (#640) May 29, 2026
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
Summary:

Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
@charles-typ charles-typ force-pushed the export-D99338673-to-v2-beta branch from b112741 to 91af76e Compare May 29, 2026 16:44
)

Summary:

The cachelib_num_shards parameter was parsed from gflags and stored in
UcacheBenchConfig but never actually applied to the CacheAllocator::Config.
This meant the config value was silently ignored and CacheLib used its
default of 8192 shards.

Now call setNumShards() when cachelib_num_shards > 0, allowing the
benchmark to match production shard counts for more accurate CPU
utilization profiling.

Differential Revision: D96087814
Summary:
Add support for configuring ThriftServer's socketMaxReadsPerEvent via CLI
flag. This controls how many reads a single connection can perform per
event loop iteration, which affects multi-client scalability.

Changes:
- Add rpc_socket_max_reads_per_event gflag to UcacheBenchRpcServer.cpp
- Apply flag value to thriftServer_->setSocketMaxReadsPerEvent()
- Add parameter to benchmark configs (debug/large/medium/small) with
  default value of 1 matching production ucache
- Add --rpc-socket-max-reads-per-event CLI arg in jobs_internal.yml
- Add parameter to ALLOWED_PARAMS in ucache_bench_benchmark.py

Reviewed By: excelle08

Differential Revision: D96763733
Summary:
Add support for fiber-based request processing and verbose error logging in
ucache_bench server and client.

Fiber configuration changes:
- Add enable_fibers flag to enable fiber-based request processing
- Add fiber_stack_size for configuring IO thread fiber stack size (default 64KB)
- Add fiber_max_pool_size for max preallocated free fibers (default 1000)
- Add fiber_pool_resize_period_ms for fiber pool resize period (default 1000ms)

Verbose logging changes:
- Add verbose parameter to server and client configs (default 0)
- Print detailed error messages for SET/GET failures when verbose is enabled
- Include carbon::Result error codes in log output for debugging

Files modified:
- Config JSON files: Added verbose parameter to server configs
- ucache_bench_benchmark.py: Added fiber params to ALLOWED_PARAMS
- jobs_internal.yml: Added CLI args for fiber config and verbose flag
- run.py: Added fiber and verbose CLI argument parsing
- UcacheBenchClient.cpp: Added verbose error logging for warmup and benchmark ops

Reviewed By: excelle08

Differential Revision: D96763783
Summary:
Add NIC IRQ affinity configuration to ucache_bench, ported from TaoBench.
This feature distributes network interrupt processing across CPUs to prevent
IRQ handling from bottlenecking on a few cores.

New parameters:
- nic_channel_ratio: Ratio of NIC channels to logical cores (0.0 = disabled)
- interface_name: Network interface for IRQ affinity tuning (default: eth0)
- hard_binding: Hard bind NIC channels to specific CPU cores (default: 0)

Changes:
- Add affinitize_nic() function to configure NIC channels via ethtool and
  redistribute IRQ affinity using affinitize_nic.py script
- Add new CLI arguments to server: --nic-channel-ratio, --interface-name,
  --hard-binding
- Update install script to copy affinitize_nic scripts for OSS builds
- Add NIC affinity params to benchmark configs and jobs_internal.yml
- Add ucache_bench_debug_nic_affinity_configs.json for testing

Differential Revision: D96763816
Summary:
The affinitize_nic() function was computing n_channels = int(n_cores * ratio)
which could exceed the NIC's maximum supported combined channels. On T2 Turin
machines with 316 logical cores and ratio=0.5, this computed 158 channels, but
the NIC (Mellanox) only supports 128 max. The ethtool command silently degraded
to 79 channels, breaking network connectivity.

Fix: Query ethtool -l to get the pre-set maximum combined channels and clamp
n_channels to that value before calling ethtool -L.

Differential Revision: D98269551
Summary:
## Problem

When `additional_fanout=500` is used to simulate production's high connection count
(num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO
cascade during warmup:

1. **Connection storm**: All 32K lazy connections are established simultaneously on first
   requests, overwhelming the server's TCP accept queue (default backlog=1024).

2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight
   per client fire simultaneously. With 2 clients, this is 128K concurrent requests
   hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error)
   and server TKO marking. Once TKO is set, all subsequent requests fail immediately.

**Previous 2-client benchmark results (without this fix):**
- Client 0: 97.7% error rate
- Client 1: 48.8% error rate

## Solution

Three changes to prevent TKO:

### 1. Server: Increase TCP listen backlog (65536)
Prevents connection refusals during connection storms from multiple clients.

### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`)
Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period
(default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming
the server.

### 3. Client: Adaptive load control during warmup (TCP congestion control)
Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD
(Additive Increase, Multiplicative Decrease):
- Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads)
- **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128)
- **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500)
- **Backoff**: halves inflight if error rate > 5%
- All workers share a dynamic `currentMaxInflight` atomic variable

New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2)

## Results

**2-client benchmark with fix (adaptive load control):**

| Metric | Client 0 | Client 1 |
|--------|----------|----------|
| Warmup QPS | 428,540 | 428,989 |
| Warmup Errors | **0** | **0** |
| Benchmark QPS | **482,367** | **482,775** |
| GET Errors | **0** | **0** |
| SET Errors | **0** | **0** |
| Hit Ratio | 100% | 100% |
| P50 Latency | 130ms | 130ms |
| P99 Latency | 263ms | 263ms |

Combined: **~965K QPS with 0 errors** across both clients.

Differential Revision: D98351095
)

Summary:

- Add createSameThreadClient() support to eliminate cross-thread message queue hops
- Workers run directly on McRouter proxy EventBases instead of separate thread pool
- Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark)
- Add experiment config files for various benchmark configurations

Differential Revision: D98968871
…acebookresearch#638)

Summary:

Add configurable per-request CPU overhead simulation to ucachebench server to
help close the CPU utilization gap between ucachebench (~35% idle) and
production ucache (~9% idle). The simulation includes hash computation,
clock_gettime calls, and memory allocations that mimic production ACL checks,
CacheTable key construction, and serialization overhead.

Changes:
- Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy)
- Wire flag through run.py and jobs_internal.yml
- Add folly::hash and BenchmarkUtil deps
- Add exp_y config (fibers enabled) and exp_z config (fibers + overhead)

Experiment results:
- Exp V (baseline): 35% idle, 6.91M QPS
- Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp
- Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective

Differential Revision: D99338676
Summary:
Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
@meta-codesync meta-codesync Bot changed the title Add production-like per-request CPU overhead (#640) Add production-like per-request CPU overhead May 29, 2026
@charles-typ charles-typ force-pushed the export-D99338673-to-v2-beta branch from 91af76e to bf8119b Compare May 29, 2026 17:56
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
Summary:

Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
Summary:

Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
Summary:

Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
Summary:

Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant