[fix](fe) Fix simple aggregate cache after partition recycle#63175
Conversation
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Auto partition recycle updates table visible version from the dynamic partition thread, while simple aggregate cache validation reads the table visible version without holding the table lock. Make table visible version fields visible across threads and defensively invalidate simple aggregate cache on partition drop/replay to avoid stale count(*) constants after partition recycle.
### Release note
None
### Check List (For Author)
- Test: Regression test / Build
- FE checkstyle: cd fe && mvn checkstyle:check -pl fe-core -DskipTests
- FE build: ./build.sh --fe
- Regression test: ./run-regression-test.sh --run -d nereids_rules_p0/rewrite_simple_agg_to_constant -s auto_partition_recycle_count_cache
- Manual test: DORIS_QUERY_PORT=39031 DORIS_HTTP_PORT=38031 bash test.sh
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
| Env.getCurrentRecycleBin().setRecycleTimeByIdForReplay(partition.getId(), info.getRecycleTime()); | ||
| } | ||
| } | ||
| if (partition != null && !info.isTempPartition()) { |
There was a problem hiding this comment.
what will happen if we query on temp partition?
There was a problem hiding this comment.
对temp partition的count() 不会被记录到缓存中,也不会被应用到 select count() from t; 上。
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/RewriteSimpleAggToConstantRule.java 的tryRewrite() 函数里会检查是否有指定分区。如果指定分区就不rewrite
|
run buildall |
TPC-H: Total hot run time: 29448 ms |
TPC-DS: Total hot run time: 170292 ms |
FE Regression Coverage ReportIncrement line coverage |
|
run cloud_p0 |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
Review result: no blocking issues found.
Critical checkpoint conclusions:
- Goal and proof: The PR targets stale
RewriteSimpleAggToConstantRuleresults after partition recycle/drop by making table visible-version fields volatile and explicitly invalidatingSimpleAggCacheMgron non-temp partition drop and replay. The added regression covers auto-partition retention reducingcount(*)from 6 to 3 after the cache has warmed. - Scope/focus: The change is small and focused on table visible-version visibility, simple-aggregate cache invalidation, and one targeted regression case.
- Concurrency: The affected state is read by query/rewrite/cache-loader threads and written by metadata mutation/replay paths.
volatileimproves cross-thread visibility for the version fields. Cache invalidation uses Caffeine concurrent cache APIs. No new lock-order inversion was found; the invalidation is performed while the table write lock is held, but it is bounded by the configured cache size and does not perform RPC or external IO. - Lifecycle/static initialization:
SimpleAggCacheMgr.internalInstance()remains lazy and does not introduce cross-TU/static initialization ordering risk. - Configuration: No new configuration item is added. The test temporarily changes
dynamic_partition_check_interval_secondsand restores it infinally. - Compatibility/persistence: No storage/protocol format change. The replay path mirrors the master-side drop invalidation for non-temp partitions. Existing edit-log fields are reused.
- Parallel paths: I checked related table version mutation paths, including transaction publish, recover partition, and replace temp partitions. Existing version/versionTime validation or explicit invalidation paths appear sufficient for this PR’s scope.
- Conditional checks: The non-temp partition guard matches the existing behavior that temp-partition-specific queries are not applied to whole-table
count(*)rewrites; this was also covered by the existing review thread. - Tests: Added regression test is targeted to the stale count-cache scenario. I did not run tests locally in this review runner.
- Observability: No new observability appears necessary for this fix; existing cache load warnings and metadata logs remain sufficient.
- Transaction/persistence/data correctness: The change does not modify transaction state. Partition drop continues to update/journal visible version metadata as before, with cache invalidation preventing stale constants after data visibility changes.
- Performance: No obvious hot-path regression; invalidating min/max keys scans the simple-aggregate cache on partition drop, which is not a query hot path.
- User focus: No additional user-provided review focus was specified.
What problem does this PR solve?
Issue Number: None
Related PR: #61183
Problem Summary: Auto partition recycle updates table visible version from the dynamic partition thread, while simple aggregate cache validation reads the table visible version without holding the table lock. Make table visible version fields visible across threads and defensively invalidate simple aggregate cache on partition drop/replay to avoid stale count(*) constants after partition recycle.
Release note
None
Check List (For Author)
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)