Skip to content

[opt](cloud) cache cluster id per query and drop redundant locks on getBackendId hot path#63636

Open
liaoxin01 wants to merge 5 commits into
apache:masterfrom
liaoxin01:opt-getbackendid-hotpath
Open

[opt](cloud) cache cluster id per query and drop redundant locks on getBackendId hot path#63636
liaoxin01 wants to merge 5 commits into
apache:masterfrom
liaoxin01:opt-getbackendid-hotpath

Conversation

@liaoxin01
Copy link
Copy Markdown
Contributor

Proposed changes

Issue Number: close #xxx

Problem

CloudReplica.getBackendId was the dominant FE hotspot in cloud-mode query/load planning. On a profiled query it accounted for 65.6% of total FE CPU samples (async-profiler). Every replica in the plan re-ran the full cluster-id resolution pipeline even though the resolved cluster id is identical for every tablet in the same request.

Call breakdown from the flame graph:

Tablet.getNormalReplicaBackendPathMap            68.6%
└─ CloudReplica.getBackendId                     65.6%
   └─ getCurrentClusterId                        61.5%
      ├─ getCloudClusterIdByName                 35.6%
      │  ├─ getCloudClusterNames                 24.7%   (rw-lock + ArrayList + stream.sorted)
      │  └─ waitForAutoStart                      9.7%   (StopWatch.start/stop even on NORMAL)
      ├─ getPhysicalCluster                      14.5%
      │  └─ getComputeGroupByName                14.5%   (rw-lock around ConcurrentHashMap)
      └─ getCloudStatusByName                     8.3%
         └─ getCloudStatusByIdNoLock             14.0%   (two stream passes, String.valueOf per BE)

ReadLock.unlock consistently outweighed ReadLock.lock in the profile -- a classic cache-line bouncing signature, meaning the read-lock CAS was already operating in the non-linear regime where adding concurrent threads disproportionately hurts throughput. With high tablet counts the call rate (tablets × replicas × concurrent_queries) easily reached 100k+ lock/unlock per second per FE, putting the FE one nudge away from a metastable collapse.

Changes

  • CloudReplica.getCurrentClusterId becomes public static so callers can resolve once per request. Adds getBackendIdWithClusterId(String) that bypasses the per-replica pipeline.
  • OlapTableSink.createLocation and FrontendServiceImpl.{createPartition, replacePartition} resolve the cluster id once before iterating tablets and pass it down via the new CloudTablet.getNormalReplicaBackendPathMap(String) overload. For a 10k-tablet query this collapses 10k full pipelines into 1.
  • CloudSystemInfoService.getComputeGroupByName: drop the rw-lock around ConcurrentHashMap reads, merge containsKey+get into a single get, guard the debug log behind isDebugEnabled (the map toString is expensive).
  • CloudSystemInfoService.containsCloudCluster (new): cheap existence check that replaces getCloudClusterNames().contains(name). Avoids an ArrayList copy + stream filter + natural sort + collect under a read lock for a single existence query.
  • CloudSystemInfoService.getCloudStatusByIdNoLock: single-pass loop with a precomputed NORMAL constant in place of two stream pipelines that re-evaluated String.valueOf(NORMAL) per backend.
  • CloudSystemInfoService.waitForAutoStart: fast-path return when the cluster is already NORMAL, skipping the withTemporaryNereidsTimeout wrap and the StopWatch.start/stop inside waitForClusterToResume (the while loop never executed in that state but the wrapping still ran per call -- ~3% of total FE CPU in the profile).

Behavior

Semantics preserved on all paths:

  • waitForAutoStart NORMAL fast-path: the original code already had existAliveBe = true as initializer, so waitForClusterToResume was a no-op for NORMAL clusters.
  • getComputeGroupByName without rw-lock: the brief window between a rename's two map updates can now return null; same as the existing read-only paths everywhere else in this class that already access these maps without locks.
  • containsCloudCluster matches getCloudClusterNames().contains(name) for non-empty name (empty-name filtering in getCloudClusterNames was for the returned list, not for .contains semantics).

Further comments

The cluster id resolution is hoisted only on the no-BE-endpoint paths (the hot ones in the profile). The endpoint-resolved path in FrontendServiceImpl already has its own resolution logic and is untouched.

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No Need
    • Same as the original logic
  3. Has document been added or modified:
    • Yes
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

…etBackendId hot path

CloudReplica.getBackendId was the dominant FE hotspot (>65% CPU on profiled
queries): every replica in a query/load plan re-ran the full cluster-id
resolution pipeline (ConnectContext lookup, priv check, status check,
autoStart, existence check, plus rw-lock CAS pairs inside two of those calls)
even though the resolved cluster id is identical for every tablet in the same
request.

Changes:

- CloudReplica.getCurrentClusterId becomes public static so callers can resolve
  once per request; add getBackendIdWithClusterId(String) that bypasses the
  per-replica pipeline.
- OlapTableSink.createLocation and FrontendServiceImpl.{createPartition,
  replacePartition} resolve the cluster id once before iterating tablets and
  pass it down via the new CloudTablet.getNormalReplicaBackendPathMap(String)
  overload.
- CloudSystemInfoService.getComputeGroupByName: drop the rw-lock around
  ConcurrentHashMap reads, merge containsKey+get into a single get, and guard
  the debug log behind isDebugEnabled (the map toString is expensive).
- CloudSystemInfoService.containsCloudCluster: new cheap existence check that
  replaces getCloudClusterNames().contains(name); used by the hot path to
  avoid an ArrayList copy, stream filter, natural sort, and collect under a
  read lock for a single existence query.
- CloudSystemInfoService.getCloudStatusByIdNoLock: single-pass loop with a
  precomputed NORMAL constant in place of two stream pipelines that
  re-evaluated String.valueOf(NORMAL) per backend.
- CloudSystemInfoService.waitForAutoStart: fast path return when the cluster
  is already NORMAL, skipping the withTemporaryNereidsTimeout wrap and the
  StopWatch.start/stop inside waitForClusterToResume (the while loop never
  executed in that state but the wrapping still ran per call).
Copilot AI review requested due to automatic review settings May 25, 2026 15:25
@liaoxin01 liaoxin01 requested a review from gavinchou as a code owner May 25, 2026 15:25
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes FE hot paths in cloud mode planning by hoisting “current cluster id” resolution to a per-request cache and reducing contention/allocations in CloudSystemInfoService, targeting the CloudReplica.getBackendId hotspot described in the PR.

Changes:

  • Cache resolved cloud cluster id once per request in location-building loops (planner sink + FE service RPCs) and thread it into tablet/replica backend-id resolution.
  • Reduce overhead in CloudSystemInfoService hot methods (drop redundant rw-lock reads, add containsCloudCluster, optimize cloud-status lookup, add NORMAL fast-path for auto-start).
  • Refactor CloudReplica.getCurrentClusterId to be public static and add a “cluster-id already known” backend-id variant.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Cache cluster id once for create/replace partition tablet location building when no BE endpoint hint is provided.
fe/fe-core/src/main/java/org/apache/doris/planner/OlapTableSink.java Cache cluster id once per createLocation call and reuse for all tablets in cloud mode.
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java Remove redundant locking on CHM reads; add cheap cluster-existence check; optimize status lookup; add NORMAL fast-path in auto-start.
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTablet.java Add intended “clusterId fast path” for backend path map resolution (currently has a compilation issue due to signature clash).
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java Make cluster-id resolution reusable (public static), add backend-id fast path taking a pre-resolved cluster id, and reuse CloudSystemInfoService instance.
Comments suppressed due to low confidence (1)

fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTablet.java:97

  • CloudTablet currently defines two methods with the exact same signature getNormalReplicaBackendPathMap(String) (one intended for clusterId at line 89, and one for beEndpoint at line 96). Java cannot overload methods that differ only by parameter name, so this will not compile. Rename one of the methods (e.g., getNormalReplicaBackendPathMapWithClusterId(...)) and update call sites accordingly to disambiguate cluster-id vs BE-endpoint usage.
    // getCurrentClusterId pipelines (ConnectContext lookup, priv check, status
    // check, autoStart, existence check, plus the read-lock CAS pair inside each)
    // down to a single resolution at the top of the call.
    public Multimap<Long, Long> getNormalReplicaBackendPathMap(String clusterId) throws UserException {
        TabletSlidingWindowAccessStats.recordTablet(getId());
        Multimap<Long, Long> pathMap = super.getNormalReplicaBackendPathMapImpl(null,
                (rep, be) -> ((CloudReplica) rep).getBackendIdWithClusterId(clusterId));
        return backendPathMapReprocess(pathMap);
    }

    public Multimap<Long, Long> getNormalReplicaBackendPathMap(String beEndpoint) throws UserException {
        TabletSlidingWindowAccessStats.recordTablet(getId());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

try {
cachedClusterId = CloudReplica.getCurrentClusterId();
} catch (ComputeGroupException e) {
throw new UserException(InternalErrorCode.INTERNAL_ERR, e.getMessage());
@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

…nature

Rename the cluster-id-based overload to getNormalReplicaBackendPathMapByClusterId
to avoid colliding with the existing CloudTablet.getNormalReplicaBackendPathMap(String beEndpoint)
that is used by the BE-endpoint stream load path.
@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31042 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 964a2a2eb158060b5f522ed0ec62cda9f4c91560, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17764	4073	4015	4015
q2	q3	10762	1389	815	815
q4	4685	464	349	349
q5	7547	2237	2117	2117
q6	243	174	134	134
q7	936	769	655	655
q8	9447	1731	1671	1671
q9	6895	4963	4907	4907
q10	6487	2217	1855	1855
q11	438	276	240	240
q12	694	426	295	295
q13	18194	3421	2745	2745
q14	264	256	232	232
q15	q16	830	780	708	708
q17	992	866	980	866
q18	6829	5698	5536	5536
q19	1183	1208	1001	1001
q20	533	421	266	266
q21	5621	2527	2330	2330
q22	429	353	305	305
Total cold run time: 100773 ms
Total hot run time: 31042 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4340	4263	4260	4260
q2	q3	4532	4901	4361	4361
q4	2108	2210	1406	1406
q5	4419	4292	4323	4292
q6	279	299	157	157
q7	2045	1813	1638	1638
q8	2490	2208	2163	2163
q9	7923	8025	7848	7848
q10	4864	4778	4454	4454
q11	591	425	404	404
q12	766	746	538	538
q13	3236	3613	2936	2936
q14	304	294	269	269
q15	q16	721	729	650	650
q17	1381	1325	1351	1325
q18	8116	7328	6995	6995
q19	1112	1106	1078	1078
q20	2216	2215	1954	1954
q21	5265	4601	4475	4475
q22	510	460	440	440
Total cold run time: 57218 ms
Total hot run time: 51643 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172567 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 964a2a2eb158060b5f522ed0ec62cda9f4c91560, data reload: false

query5	4317	664	528	528
query6	332	215	196	196
query7	4268	553	311	311
query8	329	238	221	221
query9	8841	4054	4039	4039
query10	459	339	281	281
query11	5774	2378	2234	2234
query12	173	124	122	122
query13	1282	605	448	448
query14	6080	5469	5194	5194
query14_1	4492	4473	4475	4473
query15	213	207	181	181
query16	982	439	420	420
query17	944	725	572	572
query18	2454	478	348	348
query19	214	205	169	169
query20	133	132	128	128
query21	216	138	117	117
query22	13762	13640	13387	13387
query23	17340	16509	16242	16242
query23_1	16414	16474	16467	16467
query24	7422	1765	1345	1345
query24_1	1356	1331	1326	1326
query25	581	502	455	455
query26	1313	324	180	180
query27	2714	585	348	348
query28	4500	2050	2019	2019
query29	1009	655	523	523
query30	307	242	208	208
query31	1122	1083	952	952
query32	88	80	76	76
query33	547	363	297	297
query34	1187	1114	666	666
query35	778	790	704	704
query36	1425	1420	1237	1237
query37	161	115	97	97
query38	3230	3124	3050	3050
query39	936	917	908	908
query39_1	895	862	890	862
query40	230	150	131	131
query41	71	69	70	69
query42	110	113	110	110
query43	328	339	295	295
query44	
query45	217	210	194	194
query46	1058	1213	727	727
query47	2372	2369	2272	2272
query48	419	439	301	301
query49	651	513	407	407
query50	1040	351	253	253
query51	4364	4345	4345	4345
query52	109	110	96	96
query53	265	295	214	214
query54	330	309	273	273
query55	98	98	90	90
query56	331	330	320	320
query57	1491	1475	1351	1351
query58	308	304	281	281
query59	1634	1741	1508	1508
query60	311	306	304	304
query61	149	152	156	152
query62	690	653	589	589
query63	237	202	209	202
query64	2404	794	639	639
query65	
query66	1702	479	348	348
query67	29740	29691	29663	29663
query68	
query69	456	345	323	323
query70	1060	1033	943	943
query71	310	273	254	254
query72	3136	2648	2428	2428
query73	837	778	424	424
query74	5102	4969	4787	4787
query75	2683	2591	2269	2269
query76	2262	1172	789	789
query77	408	417	337	337
query78	12360	12449	11890	11890
query79	1461	1027	767	767
query80	651	532	443	443
query81	462	280	240	240
query82	1379	160	125	125
query83	365	279	248	248
query84	308	145	110	110
query85	864	584	448	448
query86	409	325	327	325
query87	3441	3361	3244	3244
query88	3620	2688	2725	2688
query89	446	387	343	343
query90	1978	178	179	178
query91	176	164	136	136
query92	79	76	72	72
query93	1489	1476	812	812
query94	521	357	314	314
query95	660	375	347	347
query96	1065	804	358	358
query97	2748	2746	2632	2632
query98	234	228	223	223
query99	1177	1154	1031	1031
Total cold run time: 254336 ms
Total hot run time: 172567 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 38.46% (35/91) 🎉
Increment coverage report
Complete coverage report

@liaoxin01
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a correctness regression in the hoisted cloud compute-group resolution. The optimization is valid for the hot tablet loop, but it currently executes before proving that any tablet location needs to be computed, which changes empty-location and cache-only paths from success to possible compute-group failure.

Critical checkpoint conclusions:

  • Goal/test: The PR targets reducing per-tablet FE overhead for cloud backend selection. The hot path is improved, but the changed eager resolution is not fully guarded; I did not see tests covering empty-location or auto-partition cache-only retry paths.
  • Scope/minimality: The main change is focused, but the hoist should be narrowed/lazy to preserve previous behavior.
  • Concurrency/lifecycle: No new persistent lifecycle state. The lock removal around getComputeGroupByName was reviewed; no blocking issue found beyond ordinary transient concurrent-DDL behavior.
  • Compatibility/config: No storage/protocol/config compatibility issue found.
  • Parallel paths: The same eager-resolution pattern appears in OlapTableSink.createLocation and FrontendServiceImpl create/replace partition paths; the fix should cover all of them.
  • Testing: Missing regression coverage for no-tablet and cached-location paths in cloud mode.
  • Observability/performance: The optimization intent is reasonable; after making resolution lazy, the hot path should still resolve once per request.

User focus: No additional user-provided review focus was specified.

// tablet queries this was the single dominant FE hotspot.
String cachedClusterId = null;
if (Config.isCloudMode()) {
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now resolves the cloud compute group before knowing whether createLocation needs any tablet backend map. The method still has the partitionIds.isEmpty() path below that returns empty tablet lists for partition-by-function/initial no-partition cases; before this PR that path never called CloudReplica.getCurrentClusterId(), while now cloud mode can fail on missing context/default cluster/auth/manual-shutdown even though no tablet location is needed. Please resolve lazily on the first tablet, or at least skip this when partitionIds is empty, so the no-location path preserves its previous behavior.

// check, autoStart, existence check) per tablet across every partition and index.
String cachedClusterId = null;
if (Config.isCloudMode() && !request.isSetBeEndpoint()) {
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same eager-resolution problem for the auto-partition cache path: the compute group is resolved before checking getAutoPartitionInfo(). If all requested partitions are already cached for the txn, the old code returned those cached tablet locations without touching CloudReplica.getCurrentClusterId(); after this change, a cache-only retry can fail because the current context/default cluster/auth/status is unavailable even though no new backend mapping is needed. Please move the resolution into the cache-miss path and reuse it after the first miss; the replacePartition hoist below needs the same treatment.

…udSystemInfoService and resolve lazily

- Move getCurrentClusterId and resolveClusterIdByName from CloudReplica to
  CloudSystemInfoService. The methods only consult ConnectContext and Env,
  so they have no business on a Replica entity; CloudSystemInfoService is
  the natural home for cluster-name/id metadata.
- Drop the eager cluster-id resolution in OlapTableSink.createLocation,
  FrontendServiceImpl.createPartition and FrontendServiceImpl.replacePartition.
  Use a single-element holder that resolves on the first tablet that actually
  needs it. This preserves the pre-optimization behavior for paths that do not
  build any tablet location (empty partitions / materialized indices, and the
  auto-partition cache hit fast path in createPartition), which the previous
  eager resolve broke by failing on missing ConnectContext / unauthorized
  default cluster / manual-shutdown status even when no tablet mapping was
  needed.
- Drop the wrapped UserException in OlapTableSink that lost the original
  ComputeGroupException cause; the existing inner catch already handles
  ComputeGroupException (which extends UserException) and chains it with the
  tablet error context, so the wrapper was both lossy and redundant.
@liaoxin01 liaoxin01 requested a review from luwei16 as a code owner May 26, 2026 09:37
@liaoxin01
Copy link
Copy Markdown
Contributor Author

Addressed all three review comments in 096eabb:

  1. Moved getCurrentClusterId out of CloudReplica (it had no business there — pure ConnectContext + Env lookup, zero replica state). Now lives on CloudSystemInfoService as a public static, with the helper resolveClusterIdByName as an instance method on the same class.

  2. Lazy resolution in OlapTableSink.createLocation: dropped the eager pre-loop resolve. Cluster id is now resolved inside the existing tablet try { ... } catch (ComputeGroupException) block on first use, via a single-element String[] holder. Paths that produce no tablet locations (empty partitions/indices) no longer pay the resolution cost or fail on missing context.

  3. Lazy resolution in FrontendServiceImpl.{createPartition, replacePartition}: same pattern. createPartition's auto-partition cache hit fast path now skips the resolve entirely, restoring pre-optimization behavior for cache-only retries.

  4. Exception cause preserved: removed the lossy throw new UserException(INTERNAL_ERR, e.getMessage()) wrapper in OlapTableSink. The existing inner catch (ComputeGroupException e) already chains the original exception with the tablet error context.

run buildall

…g[1]

No lambda captures cachedClusterId, so the single-element array workaround is
unnecessary. A plain String local variable works the same and is clearer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants