[opt](cloud) cache cluster id per query and drop redundant locks on getBackendId hot path#63636
[opt](cloud) cache cluster id per query and drop redundant locks on getBackendId hot path#63636liaoxin01 wants to merge 5 commits into
Conversation
…etBackendId hot path
CloudReplica.getBackendId was the dominant FE hotspot (>65% CPU on profiled
queries): every replica in a query/load plan re-ran the full cluster-id
resolution pipeline (ConnectContext lookup, priv check, status check,
autoStart, existence check, plus rw-lock CAS pairs inside two of those calls)
even though the resolved cluster id is identical for every tablet in the same
request.
Changes:
- CloudReplica.getCurrentClusterId becomes public static so callers can resolve
once per request; add getBackendIdWithClusterId(String) that bypasses the
per-replica pipeline.
- OlapTableSink.createLocation and FrontendServiceImpl.{createPartition,
replacePartition} resolve the cluster id once before iterating tablets and
pass it down via the new CloudTablet.getNormalReplicaBackendPathMap(String)
overload.
- CloudSystemInfoService.getComputeGroupByName: drop the rw-lock around
ConcurrentHashMap reads, merge containsKey+get into a single get, and guard
the debug log behind isDebugEnabled (the map toString is expensive).
- CloudSystemInfoService.containsCloudCluster: new cheap existence check that
replaces getCloudClusterNames().contains(name); used by the hot path to
avoid an ArrayList copy, stream filter, natural sort, and collect under a
read lock for a single existence query.
- CloudSystemInfoService.getCloudStatusByIdNoLock: single-pass loop with a
precomputed NORMAL constant in place of two stream pipelines that
re-evaluated String.valueOf(NORMAL) per backend.
- CloudSystemInfoService.waitForAutoStart: fast path return when the cluster
is already NORMAL, skipping the withTemporaryNereidsTimeout wrap and the
StopWatch.start/stop inside waitForClusterToResume (the while loop never
executed in that state but the wrapping still ran per call).
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
This PR optimizes FE hot paths in cloud mode planning by hoisting “current cluster id” resolution to a per-request cache and reducing contention/allocations in CloudSystemInfoService, targeting the CloudReplica.getBackendId hotspot described in the PR.
Changes:
- Cache resolved cloud cluster id once per request in location-building loops (planner sink + FE service RPCs) and thread it into tablet/replica backend-id resolution.
- Reduce overhead in
CloudSystemInfoServicehot methods (drop redundant rw-lock reads, addcontainsCloudCluster, optimize cloud-status lookup, add NORMAL fast-path for auto-start). - Refactor
CloudReplica.getCurrentClusterIdto bepublic staticand add a “cluster-id already known” backend-id variant.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java | Cache cluster id once for create/replace partition tablet location building when no BE endpoint hint is provided. |
| fe/fe-core/src/main/java/org/apache/doris/planner/OlapTableSink.java | Cache cluster id once per createLocation call and reuse for all tablets in cloud mode. |
| fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java | Remove redundant locking on CHM reads; add cheap cluster-existence check; optimize status lookup; add NORMAL fast-path in auto-start. |
| fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTablet.java | Add intended “clusterId fast path” for backend path map resolution (currently has a compilation issue due to signature clash). |
| fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java | Make cluster-id resolution reusable (public static), add backend-id fast path taking a pre-resolved cluster id, and reuse CloudSystemInfoService instance. |
Comments suppressed due to low confidence (1)
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTablet.java:97
- CloudTablet currently defines two methods with the exact same signature
getNormalReplicaBackendPathMap(String)(one intended forclusterIdat line 89, and one forbeEndpointat line 96). Java cannot overload methods that differ only by parameter name, so this will not compile. Rename one of the methods (e.g.,getNormalReplicaBackendPathMapWithClusterId(...)) and update call sites accordingly to disambiguate cluster-id vs BE-endpoint usage.
// getCurrentClusterId pipelines (ConnectContext lookup, priv check, status
// check, autoStart, existence check, plus the read-lock CAS pair inside each)
// down to a single resolution at the top of the call.
public Multimap<Long, Long> getNormalReplicaBackendPathMap(String clusterId) throws UserException {
TabletSlidingWindowAccessStats.recordTablet(getId());
Multimap<Long, Long> pathMap = super.getNormalReplicaBackendPathMapImpl(null,
(rep, be) -> ((CloudReplica) rep).getBackendIdWithClusterId(clusterId));
return backendPathMapReprocess(pathMap);
}
public Multimap<Long, Long> getNormalReplicaBackendPathMap(String beEndpoint) throws UserException {
TabletSlidingWindowAccessStats.recordTablet(getId());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try { | ||
| cachedClusterId = CloudReplica.getCurrentClusterId(); | ||
| } catch (ComputeGroupException e) { | ||
| throw new UserException(InternalErrorCode.INTERNAL_ERR, e.getMessage()); |
|
run buildall |
…nature Rename the cluster-id-based overload to getNormalReplicaBackendPathMapByClusterId to avoid colliding with the existing CloudTablet.getNormalReplicaBackendPathMap(String beEndpoint) that is used by the BE-endpoint stream load path.
|
run buildall |
TPC-H: Total hot run time: 31042 ms |
TPC-DS: Total hot run time: 172567 ms |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I found a correctness regression in the hoisted cloud compute-group resolution. The optimization is valid for the hot tablet loop, but it currently executes before proving that any tablet location needs to be computed, which changes empty-location and cache-only paths from success to possible compute-group failure.
Critical checkpoint conclusions:
- Goal/test: The PR targets reducing per-tablet FE overhead for cloud backend selection. The hot path is improved, but the changed eager resolution is not fully guarded; I did not see tests covering empty-location or auto-partition cache-only retry paths.
- Scope/minimality: The main change is focused, but the hoist should be narrowed/lazy to preserve previous behavior.
- Concurrency/lifecycle: No new persistent lifecycle state. The lock removal around
getComputeGroupByNamewas reviewed; no blocking issue found beyond ordinary transient concurrent-DDL behavior. - Compatibility/config: No storage/protocol/config compatibility issue found.
- Parallel paths: The same eager-resolution pattern appears in
OlapTableSink.createLocationandFrontendServiceImplcreate/replace partition paths; the fix should cover all of them. - Testing: Missing regression coverage for no-tablet and cached-location paths in cloud mode.
- Observability/performance: The optimization intent is reasonable; after making resolution lazy, the hot path should still resolve once per request.
User focus: No additional user-provided review focus was specified.
| // tablet queries this was the single dominant FE hotspot. | ||
| String cachedClusterId = null; | ||
| if (Config.isCloudMode()) { | ||
| try { |
There was a problem hiding this comment.
This now resolves the cloud compute group before knowing whether createLocation needs any tablet backend map. The method still has the partitionIds.isEmpty() path below that returns empty tablet lists for partition-by-function/initial no-partition cases; before this PR that path never called CloudReplica.getCurrentClusterId(), while now cloud mode can fail on missing context/default cluster/auth/manual-shutdown even though no tablet location is needed. Please resolve lazily on the first tablet, or at least skip this when partitionIds is empty, so the no-location path preserves its previous behavior.
| // check, autoStart, existence check) per tablet across every partition and index. | ||
| String cachedClusterId = null; | ||
| if (Config.isCloudMode() && !request.isSetBeEndpoint()) { | ||
| try { |
There was a problem hiding this comment.
This has the same eager-resolution problem for the auto-partition cache path: the compute group is resolved before checking getAutoPartitionInfo(). If all requested partitions are already cached for the txn, the old code returned those cached tablet locations without touching CloudReplica.getCurrentClusterId(); after this change, a cache-only retry can fail because the current context/default cluster/auth/status is unavailable even though no new backend mapping is needed. Please move the resolution into the cache-miss path and reuse it after the first miss; the replacePartition hoist below needs the same treatment.
…udSystemInfoService and resolve lazily - Move getCurrentClusterId and resolveClusterIdByName from CloudReplica to CloudSystemInfoService. The methods only consult ConnectContext and Env, so they have no business on a Replica entity; CloudSystemInfoService is the natural home for cluster-name/id metadata. - Drop the eager cluster-id resolution in OlapTableSink.createLocation, FrontendServiceImpl.createPartition and FrontendServiceImpl.replacePartition. Use a single-element holder that resolves on the first tablet that actually needs it. This preserves the pre-optimization behavior for paths that do not build any tablet location (empty partitions / materialized indices, and the auto-partition cache hit fast path in createPartition), which the previous eager resolve broke by failing on missing ConnectContext / unauthorized default cluster / manual-shutdown status even when no tablet mapping was needed. - Drop the wrapped UserException in OlapTableSink that lost the original ComputeGroupException cause; the existing inner catch already handles ComputeGroupException (which extends UserException) and chains it with the tablet error context, so the wrapper was both lossy and redundant.
|
Addressed all three review comments in 096eabb:
run buildall |
…g[1] No lambda captures cachedClusterId, so the single-element array workaround is unnecessary. A plain String local variable works the same and is clearer.
Proposed changes
Issue Number: close #xxx
Problem
CloudReplica.getBackendIdwas the dominant FE hotspot in cloud-mode query/load planning. On a profiled query it accounted for 65.6% of total FE CPU samples (async-profiler). Every replica in the plan re-ran the full cluster-id resolution pipeline even though the resolved cluster id is identical for every tablet in the same request.Call breakdown from the flame graph:
ReadLock.unlockconsistently outweighedReadLock.lockin the profile -- a classic cache-line bouncing signature, meaning the read-lock CAS was already operating in the non-linear regime where adding concurrent threads disproportionately hurts throughput. With high tablet counts the call rate (tablets × replicas × concurrent_queries) easily reached 100k+ lock/unlock per second per FE, putting the FE one nudge away from a metastable collapse.Changes
CloudReplica.getCurrentClusterIdbecomespublic staticso callers can resolve once per request. AddsgetBackendIdWithClusterId(String)that bypasses the per-replica pipeline.OlapTableSink.createLocationandFrontendServiceImpl.{createPartition, replacePartition}resolve the cluster id once before iterating tablets and pass it down via the newCloudTablet.getNormalReplicaBackendPathMap(String)overload. For a 10k-tablet query this collapses 10k full pipelines into 1.CloudSystemInfoService.getComputeGroupByName: drop the rw-lock aroundConcurrentHashMapreads, mergecontainsKey+getinto a singleget, guard the debug log behindisDebugEnabled(the maptoStringis expensive).CloudSystemInfoService.containsCloudCluster(new): cheap existence check that replacesgetCloudClusterNames().contains(name). Avoids an ArrayList copy + stream filter + natural sort + collect under a read lock for a single existence query.CloudSystemInfoService.getCloudStatusByIdNoLock: single-pass loop with a precomputedNORMALconstant in place of two stream pipelines that re-evaluatedString.valueOf(NORMAL)per backend.CloudSystemInfoService.waitForAutoStart: fast-path return when the cluster is alreadyNORMAL, skipping thewithTemporaryNereidsTimeoutwrap and theStopWatch.start/stopinsidewaitForClusterToResume(the while loop never executed in that state but the wrapping still ran per call -- ~3% of total FE CPU in the profile).Behavior
Semantics preserved on all paths:
waitForAutoStartNORMAL fast-path: the original code already hadexistAliveBe = trueas initializer, sowaitForClusterToResumewas a no-op for NORMAL clusters.getComputeGroupByNamewithout rw-lock: the brief window between a rename's two map updates can now returnnull; same as the existing read-only paths everywhere else in this class that already access these maps without locks.containsCloudClustermatchesgetCloudClusterNames().contains(name)for non-emptyname(empty-name filtering ingetCloudClusterNameswas for the returned list, not for.containssemantics).Further comments
The cluster id resolution is hoisted only on the no-BE-endpoint paths (the hot ones in the profile). The endpoint-resolved path in
FrontendServiceImplalready has its own resolution logic and is untouched.Checklist(Required)