branch-4.0: [feat](cloud) Add system rate limit for meta-service #61516 by github-actions[bot] · Pull Request #63931 · apache/doris

github-actions · 2026-06-01T03:57:35Z

Cherry-picked from #61516

## Summary This PR introduces an automatic rate limiting mechanism for the Meta Service (MS) in Doris Cloud. When the Meta Service or its underlying FoundationDB (FDB) cluster is under heavy load, incoming RPC requests will be proactively rejected with a `MS_TOO_BUSY` error code, preventing cascading failures and protecting system stability. ## Motivation In production environments, the Meta Service can become overwhelmed due to high concurrency, FDB cluster performance degradation, or resource exhaustion (CPU/memory). Without a self-protection mechanism, this can lead to cascading failures, elevated latencies, and potential system-wide outages. This change adds a multi-dimensional stress detection system that automatically throttles requests when the system is under significant pressure. ## Design ### Stress Detection Dimensions The rate limiter evaluates system stress across three independent dimensions, any of which can trigger rate limiting: 1. **FDB Cluster Pressure** (`fdb_cluster_under_pressure`) - Triggered when FDB commit latency exceeds `ms_rate_limit_fdb_commit_latency_ms` (default: 50ms) **OR** FDB read latency exceeds `ms_rate_limit_fdb_read_latency_ms` (default: 5ms) - **AND** the FDB `performance_limited_by` indicator reports a non-workload bottleneck (e.g., storage server, log server) - This ensures rate limiting only kicks in when FDB itself is the bottleneck, not when the cluster is simply handling a normal high workload 2. **FDB Client Thread Pressure** (`fdb_client_thread_under_pressure`) - Uses a sliding window (default: 60 seconds) to compute the average FDB client thread busyness percentage - Triggered when the window average exceeds `ms_rate_limit_fdb_client_thread_busyness_avg_percent` (default: 70%) **AND** the instantaneous busyness exceeds `ms_rate_limit_fdb_client_thread_busyness_instant_percent` (default: 90%) - The dual-threshold (average + instant) design avoids false positives from transient spikes 3. **MS Process Resource Pressure** (`ms_resource_under_pressure`) - Monitors the Meta Service process's own CPU and memory usage - Triggered when CPU usage (both current and window average) exceeds `ms_rate_limit_cpu_usage_percent` (default: 95%) **OR** memory usage (both current and window average) exceeds `ms_rate_limit_memory_usage_percent` (default: 95%) - CPU usage is calculated via `getrusage()` delta over wall-clock time, normalized by CPU core count - Memory usage is read from `/proc/self/status` (VmRSS) relative to total system memory via `sysinfo()` ### Sliding Window Mechanism - A `MsStressDetector` class maintains a `std::deque<WindowSample>` of per-second samples - Each sample records: FDB client thread busyness, MS CPU usage, MS memory usage - Samples outside the configured window (`ms_rate_limit_window_seconds`, default: 60s) are evicted - Window averages are only considered valid when the window is fully populated (i.e., the time span of samples covers the full window) ### Request Rejection Flow - The `RPC_PREPROCESS` macro in `meta_service_helper.h` is augmented with rate limit checking logic - Before processing any RPC request, `get_ms_stress_decision()` is called to collect current metrics and evaluate stress - If `under_greate_stress()` returns true, the request is immediately rejected with `MetaServiceCode::MS_TOO_BUSY` (6002) and a detailed debug string describing the trigger reason - On the BE side (`cloud_meta_mgr.cpp`), the `MS_TOO_BUSY` error code is recognized and the error message is propagated ### Fault Injection for Testing - A fault injection mechanism is included for testing rate limiting behavior without actual system stress - Controlled by `enable_ms_rate_limit_injection` (default: false) and `ms_rate_limit_injection_probability` (default: 5%, range: 0-100) - When enabled, each request has a configurable probability of being artificially rate-limited - Uses thread-local `std::mt19937` random number generator for efficiency ### FDB Performance Limited By Metric - A new bvar `g_bvar_fdb_performance_limited_by_name` is added to track the FDB `performance_limited_by.name` field from the FDB status JSON - The value is mapped to: `0` if the limiter is "workload" (normal), `-1` otherwise (indicating an infrastructure bottleneck) - This metric is collected in `metric.cpp` via a new `get_string_value` lambda that parses the FDB status JSON ## Configuration Parameters | Parameter | Type | Default | Description | |---|---|---|---| | `enable_ms_rate_limit` | Bool | `true` | Master switch for rate limiting | | `enable_ms_rate_limit_injection` | mBool | `false` | Enable fault injection for testing | | `ms_rate_limit_injection_probability` | mInt32 | `5` | Injection probability (0-100%) | | `ms_rate_limit_window_seconds` | mInt64 | `60` | Sliding window size in seconds | | `ms_rate_limit_fdb_commit_latency_ms` | mInt64 | `50` | FDB commit latency threshold (ms) | | `ms_rate_limit_fdb_read_latency_ms` | mInt64 | `5` | FDB read latency threshold (ms) | | `ms_rate_limit_fdb_client_thread_busyness_avg_percent` | mInt64 | `70` | FDB client thread avg busyness threshold (%) | | `ms_rate_limit_fdb_client_thread_busyness_instant_percent` | mInt64 | `90` | FDB client thread instant busyness threshold (%) | | `ms_rate_limit_cpu_usage_percent` | mInt64 | `95` | MS process CPU usage threshold (%) | | `ms_rate_limit_memory_usage_percent` | mInt64 | `95` | MS process memory usage threshold (%) | All threshold parameters (prefixed with `m`) are mutable at runtime without restart. ## Update rpc white list update list ``` curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \ -H "Content-Type: application/json" \ -d '{ "rpcs": ["commit_txn", "begin_txn", "get_txn"] }' ``` get list ``` curl http://<meta-service-host>:<port>/MetaService/http/get_rpc_rate_limit_whitelist { "rpcs": [ "commit_txn", "begin_txn", "get_txn" ] } ``` unset ``` curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \ -H "Content-Type: application/json" \ -d '{"rpcs": []}' ```

hello-stephen · 2026-06-01T03:57:41Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

hello-stephen · 2026-06-01T03:57:45Z

run buildall

hello-stephen · 2026-06-01T05:23:02Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.23% (19412/36466)
Line Coverage	36.37% (181313/498486)
Region Coverage	33.00% (140946/427136)
Branch Coverage	33.87% (61001/180080)

github-actions Bot requested review from morningman and yiguolei as code owners June 1, 2026 03:57

hello-stephen closed this Jun 1, 2026

hello-stephen reopened this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

branch-4.0: [feat](cloud) Add system rate limit for meta-service #61516#63931

branch-4.0: [feat](cloud) Add system rate limit for meta-service #61516#63931
github-actions[bot] wants to merge 1 commit into
branch-4.0from
auto-pick-61516-branch-4.0

github-actions Bot commented Jun 1, 2026

Uh oh!

hello-stephen commented Jun 1, 2026

Uh oh!

hello-stephen commented Jun 1, 2026

Uh oh!

hello-stephen commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

github-actions Bot commented Jun 1, 2026

Uh oh!

hello-stephen commented Jun 1, 2026

Uh oh!

hello-stephen commented Jun 1, 2026

Uh oh!

hello-stephen commented Jun 1, 2026

BE UT Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants