Skip to content

[fix](fe) cache version and get tablet stats actively for RestoreJob#62704

Merged
gavinchou merged 1 commit into
apache:masterfrom
mymeiyi:fix-retsore-job-sync-version-and-stats
Jun 2, 2026
Merged

[fix](fe) cache version and get tablet stats actively for RestoreJob#62704
gavinchou merged 1 commit into
apache:masterfrom
mymeiyi:fix-retsore-job-sync-version-and-stats

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented Apr 22, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings April 22, 2026 07:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts restore finalization behavior to better support cloud restores by syncing table/partition version caches across FEs and proactively triggering tablet stats collection after a restore completes.

Changes:

  • Make RestoreJob.updateOlapTablesVersion overridable and pass through the isReplay flag.
  • In CloudRestoreJob, push cached table/partition versions to other FEs after restore completion and mark restored tablets as “active” for stats collection.
  • Cache cloud table version after commitPartition(...) returns a table version; attempt to refresh cloud partition cached versions during PREPARE.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/cloud/backup/CloudRestoreJob.java Adds cloud-specific version sync + active tablet stats hooks; caches table version on commit; updates partition cache handling in PREPARE.
fe/fe-core/src/main/java/org/apache/doris/backup/RestoreJob.java Changes updateOlapTablesVersion to protected with (db, isReplay) signature and updates call site.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// sync version
Map<CloudPartition, Pair<Long, Long>> partitionVersionMap = new HashMap<>(olapTable.getPartitions().size());
for (Partition partition : olapTable.getPartitions()) {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop unconditionally casts Partition to CloudPartition. In the same file you guard with partition instanceof CloudPartition (PREPARE path), which suggests non-cloud partitions are possible here as well; this cast can then throw ClassCastException and fail the restore. Filter/guard with instanceof (and decide how to handle non-cloud partitions) before adding to partitionVersionMap.

Suggested change
for (Partition partition : olapTable.getPartitions()) {
for (Partition partition : olapTable.getPartitions()) {
if (!(partition instanceof CloudPartition)) {
continue;
}

Copilot uses AI. Check for mistakes.
@mymeiyi mymeiyi force-pushed the fix-retsore-job-sync-version-and-stats branch from 1568bc2 to eaf3bf7 Compare April 22, 2026 07:54
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 22, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseded by the later corrected REQUEST_CHANGES review. Please use the latest review summary and inline comments as the authoritative result.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR improves cloud restore cache refresh, but I found one blocking reliability issue and one restore scope performance issue.

Findings

  1. The new live partition version lookup in the PREPARE path adds an unnecessary restore failure point before preparePartition runs.
  2. The new active tablet refresh is scoped to the whole table instead of only the restored tablets, so partition restore can trigger a full table stats burst.

Critical Checkpoints

  • Goal: cache cloud restore versions and proactively refresh tablet stats. The code only partially achieves this because the new PREPARE hook makes restore depend on an extra RPC that is not needed for correctness.
  • Scope: mostly small, but the tablet stats refresh is broader than the restore scope because it scans whole table tablets.
  • Concurrency: no new deadlock or lock order issue found.
  • Lifecycle and replay: the restore state machine ordering stays intact and the cloud specific sync is skipped on replay.
  • Configuration items: none.
  • Compatibility: no protocol or storage compatibility issue found.
  • Parallel code paths: non cloud restore is unchanged. The cloud path should stay aligned with existing cloud update flows that only refresh changed tablets.
  • Conditional checks: the new CloudPartition branch has no clear reason for the live version read.
  • Test coverage: no new test covers cloud restore version sync, replay behavior, or restore scoped stats refresh.
  • Test result changes: none.
  • Observability: sufficient.
  • Transaction and persistence: no new journal or replay ordering issue found.
  • Data writes and modifications: MetaService remains the source of truth, but the added pre PREPARE version read is an unnecessary dependency.
  • FE and BE variable passing: not applicable.
  • Performance: whole table tablet refresh is avoidable work for partition restore.
  • Other issues: none beyond the findings above.

@@ -459,6 +491,13 @@ private void handleOlapTableMeta(MetaSeriviceOperation operation, OlapTable olap
partitions.forEach(partition -> {
visibleVersions.add(partition.getCachedVisibleVersion());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visibleVersions is already captured from partition.getCachedVisibleVersion() on the previous line, so this extra partition.getVisibleVersion() call does not affect what preparePartition() persists. For restored CloudPartition objects, lastVersionCachedTimeMs starts at 0 after deserialization or copy, so getVisibleVersion() immediately issues a get_version RPC against the fresh partition id before preparePartition() runs. That makes restore depend on an extra RPC succeeding even though CloudInternalCatalog.preparePartition() already has its own retry loop. Please seed the local cache from the known restore version instead of doing a live lookup here.

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 22, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional blocking finding: the new version sync loop can propagate stale cached versions for untouched partitions in a partition restore.

partitionVersionMap.put(cloudPartition, Pair.of(version, partition.getVisibleVersionTime()));
}
((CloudEnv) env).getCloudFEVersionSynchronizer()
.pushVersionAsync(dbId, tableVersionMap, partitionVersionMap);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For restore into an existing cloud table, this loop pushes getCachedVisibleVersion() for every partition in the table, not just the partitions touched by the restore. In cloud mode that value is only FE cache. After restart or failover, untouched partitions can still hold old serialized versions. Pushing them here resets follower cache TTL via setCachedVisibleVersion(...), so other FEs can keep using stale versions for unrelated partitions until TTL expires. Please only sync versions for partitions affected by this restore, or fetch authoritative versions before pushing untouched partitions.

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 22, 2026

run cloud_p0

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.12% (1/32) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.12% (1/32) 🎉
Increment coverage report
Complete coverage report

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR approved by at least one committer and no changes requested.

@gavinchou gavinchou merged commit 596862b into apache:master Jun 2, 2026
32 checks passed
yiguolei pushed a commit that referenced this pull request Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.2-merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants