[feat](cloud) Add system rate limit for meta-service by wyxxxcat · Pull Request #61516 · apache/doris

wyxxxcat · 2026-03-19T06:54:40Z

Summary

This PR introduces an automatic rate limiting mechanism for the Meta Service (MS) in Doris Cloud. When the Meta Service or its underlying FoundationDB (FDB) cluster is under heavy load, incoming RPC requests will be proactively rejected with a MS_RATE_LIMIT error code, preventing cascading failures and protecting system stability.

Motivation

In production environments, the Meta Service can become overwhelmed due to high concurrency, FDB cluster performance degradation, or resource exhaustion (CPU/memory). Without a self-protection mechanism, this can lead to cascading failures, elevated latencies, and potential system-wide outages. This change adds a multi-dimensional stress detection system that automatically throttles requests when the system is under significant pressure.

Design

Stress Detection Dimensions

The rate limiter evaluates system stress across three independent dimensions, any of which can trigger rate limiting:

FDB Cluster Pressure (fdb_cluster_under_pressure)
- Triggered when FDB commit latency exceeds ms_rate_limit_fdb_commit_latency_ms (default: 50ms) OR FDB read latency exceeds ms_rate_limit_fdb_read_latency_ms (default: 5ms)
- AND the FDB performance_limited_by indicator reports a non-workload bottleneck (e.g., storage server, log server)
- This ensures rate limiting only kicks in when FDB itself is the bottleneck, not when the cluster is simply handling a normal high workload
FDB Client Thread Pressure (fdb_client_thread_under_pressure)
- Uses a sliding window (default: 60 seconds) to compute the average FDB client thread busyness percentage
- Triggered when the window average exceeds ms_rate_limit_fdb_client_thread_busyness_avg_percent (default: 70%) AND the instantaneous busyness exceeds ms_rate_limit_fdb_client_thread_busyness_instant_percent (default: 90%)
- The dual-threshold (average + instant) design avoids false positives from transient spikes
MS Process Resource Pressure (ms_resource_under_pressure)
- Monitors the Meta Service process's own CPU and memory usage
- Triggered when CPU usage (both current and window average) exceeds ms_rate_limit_cpu_usage_percent (default: 95%) OR memory usage (both current and window average) exceeds ms_rate_limit_memory_usage_percent (default: 95%)
- CPU usage is calculated via getrusage() delta over wall-clock time, normalized by CPU core count
- Memory usage is read from /proc/self/status (VmRSS) relative to total system memory via sysinfo()

Sliding Window Mechanism

A MsStressDetector class maintains a std::deque<WindowSample> of per-second samples
Each sample records: FDB client thread busyness, MS CPU usage, MS memory usage
Samples outside the configured window (ms_rate_limit_window_seconds, default: 60s) are evicted
Window averages are only considered valid when the window is fully populated (i.e., the time span of samples covers the full window)

Request Rejection Flow

The RPC_PREPROCESS macro in meta_service_helper.h is augmented with rate limit checking logic
Before processing any RPC request, get_ms_stress_decision() is called to collect current metrics and evaluate stress
If under_greate_stress() returns true, the request is immediately rejected with MetaServiceCode::MS_RATE_LIMIT (6002) and a detailed debug string describing the trigger reason
On the BE side (cloud_meta_mgr.cpp), the MS_RATE_LIMIT error code is recognized and the error message is propagated

Fault Injection for Testing

A fault injection mechanism is included for testing rate limiting behavior without actual system stress
Controlled by enable_ms_rate_limit_injection (default: false) and ms_rate_limit_injection_probability (default: 5%, range: 0-100)
When enabled, each request has a configurable probability of being artificially rate-limited
Uses thread-local std::mt19937 random number generator for efficiency

FDB Performance Limited By Metric

A new bvar g_bvar_fdb_performance_limited_by_name is added to track the FDB performance_limited_by.name field from the FDB status JSON
The value is mapped to: 0 if the limiter is "workload" (normal), -1 otherwise (indicating an infrastructure bottleneck)
This metric is collected in metric.cpp via a new get_string_value lambda that parses the FDB status JSON

Configuration Parameters

Parameter	Type	Default	Description
`enable_ms_rate_limit`	Bool	`true`	Master switch for rate limiting
`enable_ms_rate_limit_injection`	mBool	`false`	Enable fault injection for testing
`ms_rate_limit_injection_probability`	mInt32	`5`	Injection probability (0-100%)
`ms_rate_limit_window_seconds`	mInt64	`60`	Sliding window size in seconds
`ms_rate_limit_fdb_commit_latency_ms`	mInt64	`50`	FDB commit latency threshold (ms)
`ms_rate_limit_fdb_read_latency_ms`	mInt64	`5`	FDB read latency threshold (ms)
`ms_rate_limit_fdb_client_thread_busyness_avg_percent`	mInt64	`70`	FDB client thread avg busyness threshold (%)
`ms_rate_limit_fdb_client_thread_busyness_instant_percent`	mInt64	`90`	FDB client thread instant busyness threshold (%)
`ms_rate_limit_cpu_usage_percent`	mInt64	`95`	MS process CPU usage threshold (%)
`ms_rate_limit_memory_usage_percent`	mInt64	`95`	MS process memory usage threshold (%)

All threshold parameters (prefixed with m) are mutable at runtime without restart.

Thearas · 2026-03-19T06:54:49Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

wyxxxcat requested review from dataroaring, gavinchou and w41ter as code owners March 19, 2026 06:54

wyxxxcat force-pushed the ms_rate_auto_adjust branch 5 times, most recently from 5937d85 to 9195eb3 Compare March 19, 2026 08:46

1

4ee3658

wyxxxcat force-pushed the ms_rate_auto_adjust branch from 9195eb3 to 4ee3658 Compare March 19, 2026 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat](cloud) Add system rate limit for meta-service#61516

[feat](cloud) Add system rate limit for meta-service#61516
wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat:ms_rate_auto_adjust

wyxxxcat commented Mar 19, 2026 •

edited

Loading

Uh oh!

Thearas commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wyxxxcat commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Stress Detection Dimensions

Sliding Window Mechanism

Request Rejection Flow

Fault Injection for Testing

FDB Performance Limited By Metric

Configuration Parameters

Uh oh!

Thearas commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wyxxxcat commented Mar 19, 2026 •

edited

Loading