branch-4.1: pick #61318, #60543, #60705#61874
Conversation
…educe memory (apache#61318) Issue Number: close #xxx Related PR: #xxx Problem Summary: Reduce FE memory by 1. moving top-N table stats filtering from PrometheusMetricVisitor into CloudTabletStatMgr so it's computed once per stat cycle instead of per Prometheus scrape, 2. removing the unused beToTablets field from InfightTask to avoid retaining a large map reference 3. changing InfightTablet.tabletId from Long to long to avoid boxing overhead. None - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Backports three cloud-mode improvements to the 4.1 branch to reduce FE memory usage, reduce meta-service get_tablet_stats RPC volume, and ensure cloud tablet stats are preserved/updated via checkpoints and FE-to-FE sync.
Changes:
- Add tablet-id propagation on commit/compaction notifications and use it to mark tablets “active” for faster stats refresh.
- Rework
CloudTabletStatMgrto compute/prom-filter top-N table stats once per stats cycle, add an interval-ladder polling strategy, and push active tablet stats from master FE to other FEs. - Enhance checkpointing to optionally trigger in cloud mode based on image staleness and to copy serving-env tablet stats into the checkpoint image.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| gensrc/thrift/FrontendService.thrift | Adds tabletIds to commit report requests and introduces FE-to-FE syncCloudTabletStats RPC. |
| fe/fe-core/src/main/java/org/apache/doris/transaction/GlobalTransactionMgrIface.java | Extends commit callback to accept tabletIds for downstream stats refresh. |
| fe/fe-core/src/main/java/org/apache/doris/transaction/GlobalTransactionMgr.java | Updates interface implementation signature (no-op in non-cloud mgr). |
| fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java | Handles tabletIds on commit/compaction report; implements syncCloudTabletStats. |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Adds session var to force tablet-stats sync via proc-tablets path. |
| fe/fe-core/src/main/java/org/apache/doris/persist/Storage.java | Tracks latest image mtime to support “stale image” checkpoint triggering. |
| fe/fe-core/src/main/java/org/apache/doris/metric/PrometheusMetricVisitor.java | Uses CloudTabletStatMgr precomputed totals/top-N instead of recomputing per scrape. |
| fe/fe-core/src/main/java/org/apache/doris/master/Checkpoint.java | Adds stale-image-based checkpoint trigger (cloud) + copies serving-env tablet stats into checkpoint image. |
| fe/fe-core/src/main/java/org/apache/doris/common/proc/TabletsProcDir.java | Optional “force sync tablet stats” behavior gated by session variable (cloud). |
| fe/fe-core/src/main/java/org/apache/doris/common/ClientPool.java | Adds a dedicated FE client pool for tablet-stats sync RPC. |
| fe/fe-core/src/main/java/org/apache/doris/cloud/transaction/CloudGlobalTransactionMgr.java | Threads tabletIds through commit path and marks tablets active post-commit. |
| fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTabletRebalancer.java | Reduces memory retention by removing unused map and avoiding boxing for tabletId. |
| fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java | Adds persisted fields for interval-ladder bookkeeping; shortens persisted keys. |
| fe/fe-core/src/main/java/org/apache/doris/catalog/CloudTabletStatMgr.java | Implements active/interval-ladder fetching, precomputed top-N filtering, and master-to-follower stats push. |
| fe/fe-core/src/main/java/org/apache/doris/alter/CloudSchemaChangeJobV2.java | Marks related tablets active after schema change completion. |
| fe/fe-core/src/main/java/org/apache/doris/alter/CloudRollupJobV2.java | Marks related tablets active after rollup completion. |
| fe/fe-common/src/main/java/org/apache/doris/common/Config.java | Adds cloud checkpoint staleness threshold + tablet stats sync/version configs. |
| be/src/cloud/cloud_meta_mgr.cpp | Sends tabletIds alongside commit/compaction notifications to FE. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
pick: