[fix](fe) cache version and get tablet stats actively for RestoreJob by mymeiyi · Pull Request #62704 · apache/doris

mymeiyi · 2026-04-22T07:36:51Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Copilot

Pull request overview

This PR adjusts restore finalization behavior to better support cloud restores by syncing table/partition version caches across FEs and proactively triggering tablet stats collection after a restore completes.

Changes:

Make RestoreJob.updateOlapTablesVersion overridable and pass through the isReplay flag.
In CloudRestoreJob, push cached table/partition versions to other FEs after restore completion and mark restored tablets as “active” for stats collection.
Cache cloud table version after commitPartition(...) returns a table version; attempt to refresh cloud partition cached versions during PREPARE.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
fe/fe-core/src/main/java/org/apache/doris/cloud/backup/CloudRestoreJob.java	Adds cloud-specific version sync + active tablet stats hooks; caches table version on commit; updates partition cache handling in PREPARE.
fe/fe-core/src/main/java/org/apache/doris/backup/RestoreJob.java	Changes `updateOlapTablesVersion` to `protected` with `(db, isReplay)` signature and updates call site.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T07:43:13Z

+
+            // sync version
+            Map<CloudPartition, Pair<Long, Long>> partitionVersionMap = new HashMap<>(olapTable.getPartitions().size());
+            for (Partition partition : olapTable.getPartitions()) {


This loop unconditionally casts Partition to CloudPartition. In the same file you guard with partition instanceof CloudPartition (PREPARE path), which suggests non-cloud partitions are possible here as well; this cast can then throw ClassCastException and fail the restore. Filter/guard with instanceof (and decide how to handle non-cloud partitions) before adding to partitionVersionMap.

Suggested change

for (Partition partition : olapTable.getPartitions()) {

for (Partition partition : olapTable.getPartitions()) {

if (!(partition instanceof CloudPartition)) {

continue;

}

mymeiyi · 2026-04-22T07:57:58Z

/review

github-actions

Superseded by the later corrected REQUEST_CHANGES review. Please use the latest review summary and inline comments as the authoritative result.

github-actions

Summary

This PR improves cloud restore cache refresh, but I found one blocking reliability issue and one restore scope performance issue.

Findings

The new live partition version lookup in the PREPARE path adds an unnecessary restore failure point before preparePartition runs.
The new active tablet refresh is scoped to the whole table instead of only the restored tablets, so partition restore can trigger a full table stats burst.

Critical Checkpoints

Goal: cache cloud restore versions and proactively refresh tablet stats. The code only partially achieves this because the new PREPARE hook makes restore depend on an extra RPC that is not needed for correctness.
Scope: mostly small, but the tablet stats refresh is broader than the restore scope because it scans whole table tablets.
Concurrency: no new deadlock or lock order issue found.
Lifecycle and replay: the restore state machine ordering stays intact and the cloud specific sync is skipped on replay.
Configuration items: none.
Compatibility: no protocol or storage compatibility issue found.
Parallel code paths: non cloud restore is unchanged. The cloud path should stay aligned with existing cloud update flows that only refresh changed tablets.
Conditional checks: the new CloudPartition branch has no clear reason for the live version read.
Test coverage: no new test covers cloud restore version sync, replay behavior, or restore scoped stats refresh.
Test result changes: none.
Observability: sufficient.
Transaction and persistence: no new journal or replay ordering issue found.
Data writes and modifications: MetaService remains the source of truth, but the added pre PREPARE version read is an unnecessary dependency.
FE and BE variable passing: not applicable.
Performance: whole table tablet refresh is avoidable work for partition restore.
Other issues: none beyond the findings above.

github-actions · 2026-04-22T09:25:25Z

@@ -459,6 +491,13 @@ private void handleOlapTableMeta(MetaSeriviceOperation operation, OlapTable olap
                partitions.forEach(partition -> {
                    visibleVersions.add(partition.getCachedVisibleVersion());


visibleVersions is already captured from partition.getCachedVisibleVersion() on the previous line, so this extra partition.getVisibleVersion() call does not affect what preparePartition() persists. For restored CloudPartition objects, lastVersionCachedTimeMs starts at 0 after deserialization or copy, so getVisibleVersion() immediately issues a get_version RPC against the fresh partition id before preparePartition() runs. That makes restore depend on an extra RPC succeeding even though CloudInternalCatalog.preparePartition() already has its own retry loop. Please seed the local cache from the known restore version instead of doing a live lookup here.

mymeiyi · 2026-04-22T09:26:20Z

run buildall

github-actions

Additional blocking finding: the new version sync loop can propagate stale cached versions for untouched partitions in a partition restore.

github-actions · 2026-04-22T09:33:44Z

+                partitionVersionMap.put(cloudPartition, Pair.of(version, partition.getVisibleVersionTime()));
+            }
+            ((CloudEnv) env).getCloudFEVersionSynchronizer()
+                    .pushVersionAsync(dbId, tableVersionMap, partitionVersionMap);


For restore into an existing cloud table, this loop pushes getCachedVisibleVersion() for every partition in the table, not just the partitions touched by the restore. In cloud mode that value is only FE cache. After restart or failover, untouched partitions can still hold old serialized versions. Pushing them here resets follower cache TTL via setCachedVisibleVersion(...), so other FEs can keep using stale versions for unrelated partitions until TTL expires. Please only sync versions for partitions affected by this restore, or fetch authoritative versions before pushing untouched partitions.

mymeiyi · 2026-04-22T11:13:41Z

run cloud_p0

hello-stephen · 2026-04-22T11:36:56Z

FE Regression Coverage Report

Increment line coverage 3.12% (1/32) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-04-22T12:38:43Z

FE Regression Coverage Report

Increment line coverage 3.12% (1/32) 🎉
Increment coverage report
Complete coverage report

github-actions · 2026-06-02T10:12:38Z

PR approved by at least one committer and no changes requested.

a

… RestoreJob (#62704) (#64045) pick #62704

Copilot AI review requested due to automatic review settings April 22, 2026 07:36

mymeiyi requested review from dataroaring, gavinchou and w41ter as code owners April 22, 2026 07:36

Copilot started reviewing on behalf of mymeiyi April 22, 2026 07:37 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

[fix](fe) cache version and get tablet stats actively for RestoreJob

eaf3bf7

mymeiyi force-pushed the fix-retsore-job-sync-version-and-stats branch from 1568bc2 to eaf3bf7 Compare April 22, 2026 07:54

github-actions Bot requested changes Apr 22, 2026

View reviewed changes

Comment thread fe/fe-core/src/main/java/org/apache/doris/cloud/backup/CloudRestoreJob.java

Comment thread fe/fe-core/src/main/java/org/apache/doris/cloud/backup/CloudRestoreJob.java

github-actions Bot previously requested changes Apr 22, 2026

View reviewed changes

github-actions Bot reviewed Apr 22, 2026

View reviewed changes

gavinchou approved these changes Jun 2, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 2, 2026

gavinchou merged commit 596862b into apache:master Jun 2, 2026
32 checks passed

mymeiyi mentioned this pull request Jun 3, 2026

branch-4.1: [fix](fe) cache version and get tablet stats actively for RestoreJob (#62704) #64045

Merged

yiguolei pushed a commit that referenced this pull request Jun 3, 2026

branch-4.1: [fix](fe) cache version and get tablet stats actively for…

443897b

… RestoreJob (#62704) (#64045) pick #62704

yiguolei added the dev/4.1.2-merged label Jun 3, 2026

		@@ -459,6 +491,13 @@ private void handleOlapTableMeta(MetaSeriviceOperation operation, OlapTable olap
		partitions.forEach(partition -> {
		visibleVersions.add(partition.getCachedVisibleVersion());

Conversation

mymeiyi commented Apr 22, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mymeiyi commented Apr 22, 2026

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Summary

Findings

Critical Checkpoints

Uh oh!

github-actions Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mymeiyi commented Apr 22, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

mymeiyi commented Apr 22, 2026

Uh oh!

hello-stephen commented Apr 22, 2026

FE Regression Coverage Report

Uh oh!

hello-stephen commented Apr 22, 2026

FE Regression Coverage Report

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot left a comment •

edited

Loading