Skip to content

branch-4.0: [feat](cloud) Add system rate limit for meta-service #61516#63931

Open
github-actions[bot] wants to merge 1 commit into
branch-4.0from
auto-pick-61516-branch-4.0
Open

branch-4.0: [feat](cloud) Add system rate limit for meta-service #61516#63931
github-actions[bot] wants to merge 1 commit into
branch-4.0from
auto-pick-61516-branch-4.0

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented Jun 1, 2026

Cherry-picked from #61516

## Summary

This PR introduces an automatic rate limiting mechanism for the Meta
Service (MS) in Doris Cloud. When the Meta Service or its underlying
FoundationDB (FDB) cluster is under heavy load, incoming RPC requests
will be proactively rejected with a `MS_TOO_BUSY` error code, preventing
cascading failures and protecting system stability.

## Motivation

In production environments, the Meta Service can become overwhelmed due
to high concurrency, FDB cluster performance degradation, or resource
exhaustion (CPU/memory). Without a self-protection mechanism, this can
lead to cascading failures, elevated latencies, and potential
system-wide outages. This change adds a multi-dimensional stress
detection system that automatically throttles requests when the system
is under significant pressure.

## Design

### Stress Detection Dimensions

The rate limiter evaluates system stress across three independent
dimensions, any of which can trigger rate limiting:

1. **FDB Cluster Pressure** (`fdb_cluster_under_pressure`)
- Triggered when FDB commit latency exceeds
`ms_rate_limit_fdb_commit_latency_ms` (default: 50ms) **OR** FDB read
latency exceeds `ms_rate_limit_fdb_read_latency_ms` (default: 5ms)
- **AND** the FDB `performance_limited_by` indicator reports a
non-workload bottleneck (e.g., storage server, log server)
- This ensures rate limiting only kicks in when FDB itself is the
bottleneck, not when the cluster is simply handling a normal high
workload

2. **FDB Client Thread Pressure** (`fdb_client_thread_under_pressure`)
- Uses a sliding window (default: 60 seconds) to compute the average FDB
client thread busyness percentage
- Triggered when the window average exceeds
`ms_rate_limit_fdb_client_thread_busyness_avg_percent` (default: 70%)
**AND** the instantaneous busyness exceeds
`ms_rate_limit_fdb_client_thread_busyness_instant_percent` (default:
90%)
- The dual-threshold (average + instant) design avoids false positives
from transient spikes

3. **MS Process Resource Pressure** (`ms_resource_under_pressure`)
   - Monitors the Meta Service process's own CPU and memory usage
- Triggered when CPU usage (both current and window average) exceeds
`ms_rate_limit_cpu_usage_percent` (default: 95%) **OR** memory usage
(both current and window average) exceeds
`ms_rate_limit_memory_usage_percent` (default: 95%)
- CPU usage is calculated via `getrusage()` delta over wall-clock time,
normalized by CPU core count
- Memory usage is read from `/proc/self/status` (VmRSS) relative to
total system memory via `sysinfo()`

### Sliding Window Mechanism

- A `MsStressDetector` class maintains a `std::deque<WindowSample>` of
per-second samples
- Each sample records: FDB client thread busyness, MS CPU usage, MS
memory usage
- Samples outside the configured window (`ms_rate_limit_window_seconds`,
default: 60s) are evicted
- Window averages are only considered valid when the window is fully
populated (i.e., the time span of samples covers the full window)

### Request Rejection Flow

- The `RPC_PREPROCESS` macro in `meta_service_helper.h` is augmented
with rate limit checking logic
- Before processing any RPC request, `get_ms_stress_decision()` is
called to collect current metrics and evaluate stress
- If `under_greate_stress()` returns true, the request is immediately
rejected with `MetaServiceCode::MS_TOO_BUSY` (6002) and a detailed debug
string describing the trigger reason
- On the BE side (`cloud_meta_mgr.cpp`), the `MS_TOO_BUSY` error code is
recognized and the error message is propagated

### Fault Injection for Testing

- A fault injection mechanism is included for testing rate limiting
behavior without actual system stress
- Controlled by `enable_ms_rate_limit_injection` (default: false) and
`ms_rate_limit_injection_probability` (default: 5%, range: 0-100)
- When enabled, each request has a configurable probability of being
artificially rate-limited
- Uses thread-local `std::mt19937` random number generator for
efficiency

### FDB Performance Limited By Metric

- A new bvar `g_bvar_fdb_performance_limited_by_name` is added to track
the FDB `performance_limited_by.name` field from the FDB status JSON
- The value is mapped to: `0` if the limiter is "workload" (normal),
`-1` otherwise (indicating an infrastructure bottleneck)
- This metric is collected in `metric.cpp` via a new `get_string_value`
lambda that parses the FDB status JSON

## Configuration Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `enable_ms_rate_limit` | Bool | `true` | Master switch for rate
limiting |
| `enable_ms_rate_limit_injection` | mBool | `false` | Enable fault
injection for testing |
| `ms_rate_limit_injection_probability` | mInt32 | `5` | Injection
probability (0-100%) |
| `ms_rate_limit_window_seconds` | mInt64 | `60` | Sliding window size
in seconds |
| `ms_rate_limit_fdb_commit_latency_ms` | mInt64 | `50` | FDB commit
latency threshold (ms) |
| `ms_rate_limit_fdb_read_latency_ms` | mInt64 | `5` | FDB read latency
threshold (ms) |
| `ms_rate_limit_fdb_client_thread_busyness_avg_percent` | mInt64 | `70`
| FDB client thread avg busyness threshold (%) |
| `ms_rate_limit_fdb_client_thread_busyness_instant_percent` | mInt64 |
`90` | FDB client thread instant busyness threshold (%) |
| `ms_rate_limit_cpu_usage_percent` | mInt64 | `95` | MS process CPU
usage threshold (%) |
| `ms_rate_limit_memory_usage_percent` | mInt64 | `95` | MS process
memory usage threshold (%) |

All threshold parameters (prefixed with `m`) are mutable at runtime
without restart.

## Update rpc white list
update list
```
curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \
-H "Content-Type: application/json" \
-d '{
  "rpcs": ["commit_txn", "begin_txn", "get_txn"]
}'
```
get list
```
curl http://<meta-service-host>:<port>/MetaService/http/get_rpc_rate_limit_whitelist
{
"rpcs": [
  "commit_txn",
  "begin_txn",
  "get_txn"
]
}
```
unset
```
curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \
-H "Content-Type: application/json" \
-d '{"rpcs": []}'
```
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.23% (19412/36466)
Line Coverage 36.37% (181313/498486)
Region Coverage 33.00% (140946/427136)
Branch Coverage 33.87% (61001/180080)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants