Skip to content

[feat](cloud) Add system rate limit for meta-service#61516

Open
wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat:ms_rate_auto_adjust
Open

[feat](cloud) Add system rate limit for meta-service#61516
wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat:ms_rate_auto_adjust

Conversation

@wyxxxcat
Copy link
Collaborator

@wyxxxcat wyxxxcat commented Mar 19, 2026

Summary

This PR introduces an automatic rate limiting mechanism for the Meta Service (MS) in Doris Cloud. When the Meta Service or its underlying FoundationDB (FDB) cluster is under heavy load, incoming RPC requests will be proactively rejected with a MS_RATE_LIMIT error code, preventing cascading failures and protecting system stability.

Motivation

In production environments, the Meta Service can become overwhelmed due to high concurrency, FDB cluster performance degradation, or resource exhaustion (CPU/memory). Without a self-protection mechanism, this can lead to cascading failures, elevated latencies, and potential system-wide outages. This change adds a multi-dimensional stress detection system that automatically throttles requests when the system is under significant pressure.

Design

Stress Detection Dimensions

The rate limiter evaluates system stress across three independent dimensions, any of which can trigger rate limiting:

  1. FDB Cluster Pressure (fdb_cluster_under_pressure)

    • Triggered when FDB commit latency exceeds ms_rate_limit_fdb_commit_latency_ms (default: 50ms) OR FDB read latency exceeds ms_rate_limit_fdb_read_latency_ms (default: 5ms)
    • AND the FDB performance_limited_by indicator reports a non-workload bottleneck (e.g., storage server, log server)
    • This ensures rate limiting only kicks in when FDB itself is the bottleneck, not when the cluster is simply handling a normal high workload
  2. FDB Client Thread Pressure (fdb_client_thread_under_pressure)

    • Uses a sliding window (default: 60 seconds) to compute the average FDB client thread busyness percentage
    • Triggered when the window average exceeds ms_rate_limit_fdb_client_thread_busyness_avg_percent (default: 70%) AND the instantaneous busyness exceeds ms_rate_limit_fdb_client_thread_busyness_instant_percent (default: 90%)
    • The dual-threshold (average + instant) design avoids false positives from transient spikes
  3. MS Process Resource Pressure (ms_resource_under_pressure)

    • Monitors the Meta Service process's own CPU and memory usage
    • Triggered when CPU usage (both current and window average) exceeds ms_rate_limit_cpu_usage_percent (default: 95%) OR memory usage (both current and window average) exceeds ms_rate_limit_memory_usage_percent (default: 95%)
    • CPU usage is calculated via getrusage() delta over wall-clock time, normalized by CPU core count
    • Memory usage is read from /proc/self/status (VmRSS) relative to total system memory via sysinfo()

Sliding Window Mechanism

  • A MsStressDetector class maintains a std::deque<WindowSample> of per-second samples
  • Each sample records: FDB client thread busyness, MS CPU usage, MS memory usage
  • Samples outside the configured window (ms_rate_limit_window_seconds, default: 60s) are evicted
  • Window averages are only considered valid when the window is fully populated (i.e., the time span of samples covers the full window)

Request Rejection Flow

  • The RPC_PREPROCESS macro in meta_service_helper.h is augmented with rate limit checking logic
  • Before processing any RPC request, get_ms_stress_decision() is called to collect current metrics and evaluate stress
  • If under_greate_stress() returns true, the request is immediately rejected with MetaServiceCode::MS_RATE_LIMIT (6002) and a detailed debug string describing the trigger reason
  • On the BE side (cloud_meta_mgr.cpp), the MS_RATE_LIMIT error code is recognized and the error message is propagated

Fault Injection for Testing

  • A fault injection mechanism is included for testing rate limiting behavior without actual system stress
  • Controlled by enable_ms_rate_limit_injection (default: false) and ms_rate_limit_injection_probability (default: 5%, range: 0-100)
  • When enabled, each request has a configurable probability of being artificially rate-limited
  • Uses thread-local std::mt19937 random number generator for efficiency

FDB Performance Limited By Metric

  • A new bvar g_bvar_fdb_performance_limited_by_name is added to track the FDB performance_limited_by.name field from the FDB status JSON
  • The value is mapped to: 0 if the limiter is "workload" (normal), -1 otherwise (indicating an infrastructure bottleneck)
  • This metric is collected in metric.cpp via a new get_string_value lambda that parses the FDB status JSON

Configuration Parameters

Parameter Type Default Description
enable_ms_rate_limit Bool true Master switch for rate limiting
enable_ms_rate_limit_injection mBool false Enable fault injection for testing
ms_rate_limit_injection_probability mInt32 5 Injection probability (0-100%)
ms_rate_limit_window_seconds mInt64 60 Sliding window size in seconds
ms_rate_limit_fdb_commit_latency_ms mInt64 50 FDB commit latency threshold (ms)
ms_rate_limit_fdb_read_latency_ms mInt64 5 FDB read latency threshold (ms)
ms_rate_limit_fdb_client_thread_busyness_avg_percent mInt64 70 FDB client thread avg busyness threshold (%)
ms_rate_limit_fdb_client_thread_busyness_instant_percent mInt64 90 FDB client thread instant busyness threshold (%)
ms_rate_limit_cpu_usage_percent mInt64 95 MS process CPU usage threshold (%)
ms_rate_limit_memory_usage_percent mInt64 95 MS process memory usage threshold (%)

All threshold parameters (prefixed with m) are mutable at runtime without restart.

@Thearas
Copy link
Contributor

Thearas commented Mar 19, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch 5 times, most recently from 5937d85 to 9195eb3 Compare March 19, 2026 08:46
@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch from 9195eb3 to 4ee3658 Compare March 19, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants