[fix](fe) cache version and get tablet stats actively for RestoreJob#62704
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adjusts restore finalization behavior to better support cloud restores by syncing table/partition version caches across FEs and proactively triggering tablet stats collection after a restore completes.
Changes:
- Make
RestoreJob.updateOlapTablesVersionoverridable and pass through theisReplayflag. - In
CloudRestoreJob, push cached table/partition versions to other FEs after restore completion and mark restored tablets as “active” for stats collection. - Cache cloud table version after
commitPartition(...)returns a table version; attempt to refresh cloud partition cached versions during PREPARE.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/cloud/backup/CloudRestoreJob.java | Adds cloud-specific version sync + active tablet stats hooks; caches table version on commit; updates partition cache handling in PREPARE. |
| fe/fe-core/src/main/java/org/apache/doris/backup/RestoreJob.java | Changes updateOlapTablesVersion to protected with (db, isReplay) signature and updates call site. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // sync version | ||
| Map<CloudPartition, Pair<Long, Long>> partitionVersionMap = new HashMap<>(olapTable.getPartitions().size()); | ||
| for (Partition partition : olapTable.getPartitions()) { |
There was a problem hiding this comment.
This loop unconditionally casts Partition to CloudPartition. In the same file you guard with partition instanceof CloudPartition (PREPARE path), which suggests non-cloud partitions are possible here as well; this cast can then throw ClassCastException and fail the restore. Filter/guard with instanceof (and decide how to handle non-cloud partitions) before adding to partitionVersionMap.
| for (Partition partition : olapTable.getPartitions()) { | |
| for (Partition partition : olapTable.getPartitions()) { | |
| if (!(partition instanceof CloudPartition)) { | |
| continue; | |
| } |
1568bc2 to
eaf3bf7
Compare
|
/review |
There was a problem hiding this comment.
Summary
This PR improves cloud restore cache refresh, but I found one blocking reliability issue and one restore scope performance issue.
Findings
- The new live partition version lookup in the PREPARE path adds an unnecessary restore failure point before preparePartition runs.
- The new active tablet refresh is scoped to the whole table instead of only the restored tablets, so partition restore can trigger a full table stats burst.
Critical Checkpoints
- Goal: cache cloud restore versions and proactively refresh tablet stats. The code only partially achieves this because the new PREPARE hook makes restore depend on an extra RPC that is not needed for correctness.
- Scope: mostly small, but the tablet stats refresh is broader than the restore scope because it scans whole table tablets.
- Concurrency: no new deadlock or lock order issue found.
- Lifecycle and replay: the restore state machine ordering stays intact and the cloud specific sync is skipped on replay.
- Configuration items: none.
- Compatibility: no protocol or storage compatibility issue found.
- Parallel code paths: non cloud restore is unchanged. The cloud path should stay aligned with existing cloud update flows that only refresh changed tablets.
- Conditional checks: the new CloudPartition branch has no clear reason for the live version read.
- Test coverage: no new test covers cloud restore version sync, replay behavior, or restore scoped stats refresh.
- Test result changes: none.
- Observability: sufficient.
- Transaction and persistence: no new journal or replay ordering issue found.
- Data writes and modifications: MetaService remains the source of truth, but the added pre PREPARE version read is an unnecessary dependency.
- FE and BE variable passing: not applicable.
- Performance: whole table tablet refresh is avoidable work for partition restore.
- Other issues: none beyond the findings above.
| @@ -459,6 +491,13 @@ private void handleOlapTableMeta(MetaSeriviceOperation operation, OlapTable olap | |||
| partitions.forEach(partition -> { | |||
| visibleVersions.add(partition.getCachedVisibleVersion()); | |||
There was a problem hiding this comment.
visibleVersions is already captured from partition.getCachedVisibleVersion() on the previous line, so this extra partition.getVisibleVersion() call does not affect what preparePartition() persists. For restored CloudPartition objects, lastVersionCachedTimeMs starts at 0 after deserialization or copy, so getVisibleVersion() immediately issues a get_version RPC against the fresh partition id before preparePartition() runs. That makes restore depend on an extra RPC succeeding even though CloudInternalCatalog.preparePartition() already has its own retry loop. Please seed the local cache from the known restore version instead of doing a live lookup here.
|
run buildall |
| partitionVersionMap.put(cloudPartition, Pair.of(version, partition.getVisibleVersionTime())); | ||
| } | ||
| ((CloudEnv) env).getCloudFEVersionSynchronizer() | ||
| .pushVersionAsync(dbId, tableVersionMap, partitionVersionMap); |
There was a problem hiding this comment.
For restore into an existing cloud table, this loop pushes getCachedVisibleVersion() for every partition in the table, not just the partitions touched by the restore. In cloud mode that value is only FE cache. After restart or failover, untouched partitions can still hold old serialized versions. Pushing them here resets follower cache TTL via setCachedVisibleVersion(...), so other FEs can keep using stale versions for unrelated partitions until TTL expires. Please only sync versions for partitions affected by this restore, or fetch authoritative versions before pushing untouched partitions.
|
run cloud_p0 |
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)