branch-4.0: [feat](cloud) Add system rate limit for meta-service #61516#63931
Open
github-actions[bot] wants to merge 1 commit into
Open
branch-4.0: [feat](cloud) Add system rate limit for meta-service #61516#63931github-actions[bot] wants to merge 1 commit into
github-actions[bot] wants to merge 1 commit into
Conversation
## Summary
This PR introduces an automatic rate limiting mechanism for the Meta
Service (MS) in Doris Cloud. When the Meta Service or its underlying
FoundationDB (FDB) cluster is under heavy load, incoming RPC requests
will be proactively rejected with a `MS_TOO_BUSY` error code, preventing
cascading failures and protecting system stability.
## Motivation
In production environments, the Meta Service can become overwhelmed due
to high concurrency, FDB cluster performance degradation, or resource
exhaustion (CPU/memory). Without a self-protection mechanism, this can
lead to cascading failures, elevated latencies, and potential
system-wide outages. This change adds a multi-dimensional stress
detection system that automatically throttles requests when the system
is under significant pressure.
## Design
### Stress Detection Dimensions
The rate limiter evaluates system stress across three independent
dimensions, any of which can trigger rate limiting:
1. **FDB Cluster Pressure** (`fdb_cluster_under_pressure`)
- Triggered when FDB commit latency exceeds
`ms_rate_limit_fdb_commit_latency_ms` (default: 50ms) **OR** FDB read
latency exceeds `ms_rate_limit_fdb_read_latency_ms` (default: 5ms)
- **AND** the FDB `performance_limited_by` indicator reports a
non-workload bottleneck (e.g., storage server, log server)
- This ensures rate limiting only kicks in when FDB itself is the
bottleneck, not when the cluster is simply handling a normal high
workload
2. **FDB Client Thread Pressure** (`fdb_client_thread_under_pressure`)
- Uses a sliding window (default: 60 seconds) to compute the average FDB
client thread busyness percentage
- Triggered when the window average exceeds
`ms_rate_limit_fdb_client_thread_busyness_avg_percent` (default: 70%)
**AND** the instantaneous busyness exceeds
`ms_rate_limit_fdb_client_thread_busyness_instant_percent` (default:
90%)
- The dual-threshold (average + instant) design avoids false positives
from transient spikes
3. **MS Process Resource Pressure** (`ms_resource_under_pressure`)
- Monitors the Meta Service process's own CPU and memory usage
- Triggered when CPU usage (both current and window average) exceeds
`ms_rate_limit_cpu_usage_percent` (default: 95%) **OR** memory usage
(both current and window average) exceeds
`ms_rate_limit_memory_usage_percent` (default: 95%)
- CPU usage is calculated via `getrusage()` delta over wall-clock time,
normalized by CPU core count
- Memory usage is read from `/proc/self/status` (VmRSS) relative to
total system memory via `sysinfo()`
### Sliding Window Mechanism
- A `MsStressDetector` class maintains a `std::deque<WindowSample>` of
per-second samples
- Each sample records: FDB client thread busyness, MS CPU usage, MS
memory usage
- Samples outside the configured window (`ms_rate_limit_window_seconds`,
default: 60s) are evicted
- Window averages are only considered valid when the window is fully
populated (i.e., the time span of samples covers the full window)
### Request Rejection Flow
- The `RPC_PREPROCESS` macro in `meta_service_helper.h` is augmented
with rate limit checking logic
- Before processing any RPC request, `get_ms_stress_decision()` is
called to collect current metrics and evaluate stress
- If `under_greate_stress()` returns true, the request is immediately
rejected with `MetaServiceCode::MS_TOO_BUSY` (6002) and a detailed debug
string describing the trigger reason
- On the BE side (`cloud_meta_mgr.cpp`), the `MS_TOO_BUSY` error code is
recognized and the error message is propagated
### Fault Injection for Testing
- A fault injection mechanism is included for testing rate limiting
behavior without actual system stress
- Controlled by `enable_ms_rate_limit_injection` (default: false) and
`ms_rate_limit_injection_probability` (default: 5%, range: 0-100)
- When enabled, each request has a configurable probability of being
artificially rate-limited
- Uses thread-local `std::mt19937` random number generator for
efficiency
### FDB Performance Limited By Metric
- A new bvar `g_bvar_fdb_performance_limited_by_name` is added to track
the FDB `performance_limited_by.name` field from the FDB status JSON
- The value is mapped to: `0` if the limiter is "workload" (normal),
`-1` otherwise (indicating an infrastructure bottleneck)
- This metric is collected in `metric.cpp` via a new `get_string_value`
lambda that parses the FDB status JSON
## Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| `enable_ms_rate_limit` | Bool | `true` | Master switch for rate
limiting |
| `enable_ms_rate_limit_injection` | mBool | `false` | Enable fault
injection for testing |
| `ms_rate_limit_injection_probability` | mInt32 | `5` | Injection
probability (0-100%) |
| `ms_rate_limit_window_seconds` | mInt64 | `60` | Sliding window size
in seconds |
| `ms_rate_limit_fdb_commit_latency_ms` | mInt64 | `50` | FDB commit
latency threshold (ms) |
| `ms_rate_limit_fdb_read_latency_ms` | mInt64 | `5` | FDB read latency
threshold (ms) |
| `ms_rate_limit_fdb_client_thread_busyness_avg_percent` | mInt64 | `70`
| FDB client thread avg busyness threshold (%) |
| `ms_rate_limit_fdb_client_thread_busyness_instant_percent` | mInt64 |
`90` | FDB client thread instant busyness threshold (%) |
| `ms_rate_limit_cpu_usage_percent` | mInt64 | `95` | MS process CPU
usage threshold (%) |
| `ms_rate_limit_memory_usage_percent` | mInt64 | `95` | MS process
memory usage threshold (%) |
All threshold parameters (prefixed with `m`) are mutable at runtime
without restart.
## Update rpc white list
update list
```
curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \
-H "Content-Type: application/json" \
-d '{
"rpcs": ["commit_txn", "begin_txn", "get_txn"]
}'
```
get list
```
curl http://<meta-service-host>:<port>/MetaService/http/get_rpc_rate_limit_whitelist
{
"rpcs": [
"commit_txn",
"begin_txn",
"get_txn"
]
}
```
unset
```
curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \
-H "Content-Type: application/json" \
-d '{"rpcs": []}'
```
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
|
run buildall |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picked from #61516