[feat](cloud) Add system rate limit for meta-service#61516
Open
wyxxxcat wants to merge 1 commit intoapache:masterfrom
Open
[feat](cloud) Add system rate limit for meta-service#61516wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
5937d85 to
9195eb3
Compare
9195eb3 to
4ee3658
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces an automatic rate limiting mechanism for the Meta Service (MS) in Doris Cloud. When the Meta Service or its underlying FoundationDB (FDB) cluster is under heavy load, incoming RPC requests will be proactively rejected with a
MS_RATE_LIMITerror code, preventing cascading failures and protecting system stability.Motivation
In production environments, the Meta Service can become overwhelmed due to high concurrency, FDB cluster performance degradation, or resource exhaustion (CPU/memory). Without a self-protection mechanism, this can lead to cascading failures, elevated latencies, and potential system-wide outages. This change adds a multi-dimensional stress detection system that automatically throttles requests when the system is under significant pressure.
Design
Stress Detection Dimensions
The rate limiter evaluates system stress across three independent dimensions, any of which can trigger rate limiting:
FDB Cluster Pressure (
fdb_cluster_under_pressure)ms_rate_limit_fdb_commit_latency_ms(default: 50ms) OR FDB read latency exceedsms_rate_limit_fdb_read_latency_ms(default: 5ms)performance_limited_byindicator reports a non-workload bottleneck (e.g., storage server, log server)FDB Client Thread Pressure (
fdb_client_thread_under_pressure)ms_rate_limit_fdb_client_thread_busyness_avg_percent(default: 70%) AND the instantaneous busyness exceedsms_rate_limit_fdb_client_thread_busyness_instant_percent(default: 90%)MS Process Resource Pressure (
ms_resource_under_pressure)ms_rate_limit_cpu_usage_percent(default: 95%) OR memory usage (both current and window average) exceedsms_rate_limit_memory_usage_percent(default: 95%)getrusage()delta over wall-clock time, normalized by CPU core count/proc/self/status(VmRSS) relative to total system memory viasysinfo()Sliding Window Mechanism
MsStressDetectorclass maintains astd::deque<WindowSample>of per-second samplesms_rate_limit_window_seconds, default: 60s) are evictedRequest Rejection Flow
RPC_PREPROCESSmacro inmeta_service_helper.his augmented with rate limit checking logicget_ms_stress_decision()is called to collect current metrics and evaluate stressunder_greate_stress()returns true, the request is immediately rejected withMetaServiceCode::MS_RATE_LIMIT(6002) and a detailed debug string describing the trigger reasoncloud_meta_mgr.cpp), theMS_RATE_LIMITerror code is recognized and the error message is propagatedFault Injection for Testing
enable_ms_rate_limit_injection(default: false) andms_rate_limit_injection_probability(default: 5%, range: 0-100)std::mt19937random number generator for efficiencyFDB Performance Limited By Metric
g_bvar_fdb_performance_limited_by_nameis added to track the FDBperformance_limited_by.namefield from the FDB status JSON0if the limiter is "workload" (normal),-1otherwise (indicating an infrastructure bottleneck)metric.cppvia a newget_string_valuelambda that parses the FDB status JSONConfiguration Parameters
enable_ms_rate_limittrueenable_ms_rate_limit_injectionfalsems_rate_limit_injection_probability5ms_rate_limit_window_seconds60ms_rate_limit_fdb_commit_latency_ms50ms_rate_limit_fdb_read_latency_ms5ms_rate_limit_fdb_client_thread_busyness_avg_percent70ms_rate_limit_fdb_client_thread_busyness_instant_percent90ms_rate_limit_cpu_usage_percent95ms_rate_limit_memory_usage_percent95All threshold parameters (prefixed with
m) are mutable at runtime without restart.