Skip to content

[Feature](cloud) Table-level event-driven warmup with progress observation#62501

Closed
bobhan1 wants to merge 31 commits into
apache:masterfrom
bobhan1:passive-warmup-table-dev
Closed

[Feature](cloud) Table-level event-driven warmup with progress observation#62501
bobhan1 wants to merge 31 commits into
apache:masterfrom
bobhan1:passive-warmup-table-dev

Conversation

@bobhan1
Copy link
Copy Markdown
Contributor

@bobhan1 bobhan1 commented Apr 15, 2026

What problem does this PR solve?

Problem Summary:

The current event-driven warmup only supports cluster-level granularity — once enabled, all table writes on the source cluster are warmed up to the target cluster. For scenarios where only specific tables need warming (e.g., core business tables, high-frequency query tables), cluster-level warmup leads to unnecessary bandwidth consumption and cache space waste. Additionally, existing bvar metrics are global accumulators that cannot distinguish between different Jobs or tables, and BE crashes cause permanent gap residuals.

This PR implements two core features:

1. Table-Level Event-Driven Warmup (ON TABLES)

Specify the warmup scope via ON TABLES (INCLUDE/EXCLUDE) SQL syntax with * and ? wildcard support. Key features:

  • INCLUDE / EXCLUDE rule composition: INCLUDE declares the warmup scope, EXCLUDE removes specific tables from the INCLUDE result set
  • Dynamic table change tracking: Automatically updates matched table list after CREATE TABLE / DROP TABLE / RENAME TABLE
  • Async materialized view support: ON TABLES patterns match both regular tables and async materialized views (MTMV)
  • Multi-destination clusters: The same table can be warmed up to multiple target clusters simultaneously with independent filters
  • Rule canonicalization and deduplication: Rules are sorted before comparison to prevent duplicate jobs with semantically identical but differently ordered rules
  • SQL syntax extensions: New ON TABLES clause; SHOW WARM UP JOB adds TableFilter/MatchedTables columns

2. Warmup Progress Observation

Per-job windowed metrics based on bvar::MultiDimension + bvar::Window, solving the problems of global accumulators being unable to distinguish Jobs/tables and permanent gap residuals after BE crashes:

  • BE side: Per-job windowed metrics (5min/30min/1h) automatically maintaining requested/finish/fail statistics
  • BE HTTP API: /api/warmup_event_driven_stats exposes per-job JSON statistics
  • FE on-demand collection: SHOW WARM UP JOB concurrently collects from all BEs and displays aggregated SyncStats JSON column
  • Gap calculation: gap = requested - finished; windowed metrics cause crash-induced gaps to expire automatically

SQL Syntax Examples

-- Create a table-level event-driven warmup job
WARM UP COMPUTE GROUP query_cg WITH COMPUTE GROUP write_cg
ON TABLES (
    INCLUDE 'core_db.config',
    INCLUDE 'core_db.metadata',
    INCLUDE 'report_db.monthly_*',
    INCLUDE '*.sales_*',
    INCLUDE 'log_db.log_?',
    EXCLUDE '*.*_archive'
)
PROPERTIES (
    "sync_mode" = "event_driven",
    "sync_event" = "load"
);
-- View warmup job status and sync progress
SHOW WARM UP JOB;
+-------+-----------------+-----------------+---------+---------+---------------------+---------------------------+-----------+--------+-----------------------------------------------+-----------------------------------------+-----------+
| JobId | SrcComputeGroup | DstComputeGroup | Status  | Type    | SyncMode            | CreateTime                | ...       | Tables | TableFilter                                   | MatchedTables                           | SyncStats |
+-------+-----------------+-----------------+---------+---------+---------------------+---------------------------+-----------+--------+-----------------------------------------------+-----------------------------------------+-----------+
| 13419 | ingestion_cg    | analytics_cg    | RUNNING | TABLES  | EVENT_DRIVEN (LOAD) | 2024-01-01 10:05:00.000   | ...       |        | {"include":["ods.*"],"exclude":["ods.tmp_*"]} | ods.orders, ods.payments, ods.users     | {...}     |
+-------+-----------------+-----------------+---------+---------+---------------------+---------------------------+-----------+--------+-----------------------------------------------+-----------------------------------------+-----------+
-- SyncStats column shows per-job aggregated windowed sync progress in JSON
{
  "seg_num": {
    "requested_5m": 42, "finish_5m": 40, "gap_5m": 2, "fail_5m": 0,
    "requested_30m": 180, "finish_30m": 178, "gap_30m": 2, "fail_30m": 0,
    "requested_1h": 320, "finish_1h": 318, "gap_1h": 2, "fail_1h": 0
  },
  "seg_size": {
    "requested_5m": "12.5mb", "finish_5m": "11.8mb", "gap_5m": "716kb", "fail_5m": "0b",
    "requested_30m": "58.2mb", "finish_30m": "57.5mb", "gap_30m": "716kb", "fail_30m": "0b",
    "requested_1h": "102.3mb", "finish_1h": "101.6mb", "gap_1h": "716kb", "fail_1h": "0b"
  },
  "idx_num": { "requested_5m": 10, "finish_5m": 10, "gap_5m": 0, "fail_5m": 0, "..." : "..." },
  "idx_size": { "requested_5m": "2.1mb", "finish_5m": "2.1mb", "gap_5m": "0b", "fail_5m": "0b", "..." : "..." },
  "last_trigger_ts": "14:32:15",
  "last_finish_ts": "14:32:18"
}

Code Change Summary

FE Core Changes:

  • CacheHotspotManager.java: Table ID resolution, dynamic refresh, Auth-Token authentication, on-demand metrics collection
  • CloudWarmUpJob.java: JobType.TABLES enum, table filter rule persistence, SyncStats display
  • OnTablesFilter.java (new): Glob wildcard matching engine (INCLUDE/EXCLUDE)
  • TableWarmUpWindowedStats.java / JobWarmUpStats.java (new): Windowed statistics data models
  • WarmUpClusterCommand.java / LogicalPlanBuilder.java: ON TABLES SQL parsing

BE Core Changes:

  • bvar_windowed_adder.h (new): Multi-dimension windowed adder wrapper
  • cloud_warm_up_manager.cpp/h: JobReplicaInfo struct, per-job requested metrics
  • cloud_internal_service.cpp: Per-job finish/fail metrics
  • warmup_stats_action.cpp/h (new): HTTP API /api/warmup_event_driven_stats
  • internal_service.proto: Added job_id field to PWarmUpRowsetRequest

Regression Tests (Docker multi-cluster): 8 independent test files

  • INCLUDE wildcard, INCLUDE+EXCLUDE combination, multi-INCLUDE cross-database, rule canonicalization
  • Dynamic table change tracking, multi-destination clusters, error handling & lifecycle, SyncStats progress observation

FE Unit Tests: 7 test classes, ~70 test methods

  • OnTablesFilterTest, CloudWarmUpJobTableFilterTest, CacheHotspotManagerTableFilterTest
  • WarmUpClusterOnTablesParseTest, WarmUpStatsTest, etc.

Design Documents

  • table-level-event-driven-warmup-design.md: Table-level warmup feature design
  • table-level-event-driven-warmup-user-guide.md: User guide
  • warmup-progress-observation-design.md: Progress observation design
  • warmup-progress-observation-user-guide.md: Progress observation user guide
  • table-level-event-driven-warmup-test-doc.md: Complete test documentation

Release note

Support table-level Event-Driven Warmup: specify warmup table scope via ON TABLES (INCLUDE/EXCLUDE) syntax (supports */? wildcards), with automatic table change tracking. Added per-job windowed sync progress observation (SHOW WARM UP JOB SyncStats column + BE HTTP API /api/warmup_event_driven_stats).

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. Added ON TABLES SQL clause; SHOW WARM UP JOB adds TableFilter/MatchedTables/SyncStats columns; new BE HTTP API /api/warmup_event_driven_stats; CloudWarmUpJob.JobType adds TABLES enum value
  • Does this need documentation?

    • No.
    • Yes. Design docs and user guides are included in this PR

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 force-pushed the passive-warmup-table-dev branch 3 times, most recently from f929ffd to 3e03d23 Compare April 15, 2026 08:16
bobhan1 added 2 commits April 15, 2026 16:23
add user guide doc

don't persist table_id on FE

normalize filters

show warmup example

update
Add support for table-level filtering in event-driven warmup jobs via
ON TABLES (INCLUDE/EXCLUDE 'glob_pattern') SQL syntax. This allows
users to selectively warm up only specific tables matching glob patterns
instead of warming up all tables in a cluster.

Changes:
- FE: Add OnTablesFilter class with glob-to-regex compilation,
  INCLUDE/EXCLUDE semantics, and shouldWarmUp() method
- FE: Extend ANTLR grammar with onTablesClause and INCLUDE token
- FE: Extend CloudWarmUpJob with table filter persistence, canonical
  JSON representation, and dynamic table ID resolution
- FE: Extend CacheHotspotManager with table filter dedup (JobKey),
  resolveTableIds(), and periodic refreshAllTableFilters()
- FE: Add TableFilter and MatchedTables columns to ShowWarmUpCommand
- BE: Add EventDrivenJobFilter type alias with per-job table_id
  filtering in get_replica_info()
- BE: Pass table_id from tablet level through commit_rowset() to
  warm_up_rowset() instead of extracting from rs_meta
- Thrift: Add optional table_ids field to TWarmUpTabletsRequest
- Tests: Add FE tests (OnTablesFilterTest, CloudWarmUpJobTableFilterTest)
  and BE tests (CloudWarmUpManagerFilterTest)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Refactor table-level warmup: separate refresh daemon and fix empty table_ids handling

- FE: Move refreshAllTableFilters() from JobDaemon into a new
  TableFilterRefreshDaemon with its own configurable interval
  (cloud_warm_up_table_filter_refresh_interval_ms, default 60s)
- FE: Always send table_ids when hasTableFilter() is true, even if empty
- BE: When table_ids is non-null but empty, set empty filter set (warm up
  nothing) instead of nullopt (warm up everything). This correctly handles
  the scenario where all matched tables have been deleted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Improve FE unit tests for table-level warmup

- Rewrite OnTablesFilterTest: remove redundancy, consolidate into 10
  focused tests covering glob wildcards, INCLUDE/EXCLUDE semantics,
  regex metachar escaping, complex multi-db scenarios
- Rewrite CloudWarmUpJobTableFilterTest: 12 tests covering canonicalize(),
  rebuildOnTablesFilter(), hasTableFilter(), getJobInfo(), Builder validation
- Add WarmUpClusterOnTablesParseTest: 10 tests verifying ON TABLES grammar
  parsing (single/multiple rules, FORCE, COMPUTE GROUP), syntax errors
  (empty parens, missing parens, missing pattern), and parsed field values
- Use Mockito (not JMockit) for ConnectContext mocking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Improve FE unit tests for table-level event-driven warmup

- Rewrite CloudWarmUpJobTableFilterTest (13 tests): add SHOW WARM UP JOB
  column verification (all 15 columns), matched tables string output,
  dynamic table ID tracking (create/drop/rename scenarios)
- Create CacheHotspotManagerTableFilterTest (13 tests): test resolveTableIds
  with mocked Env/InternalCatalog, dynamic table changes (new table, drop,
  rename), refreshAllTableFilters with running jobs, cluster-level job skip
- Fix unused import in OnTablesFilterTest

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

format

refactor: derive tableFilterExpr from rules, show db.table names in MatchedTables

- Make tableFilterExpr transient (not persisted), computed from
  tableFilterRules via canonicalize() as single source of truth
- Add computeTableFilterExpr() helper, called in constructor and
  rebuildOnTablesFilter()
- Remove setTableFilterExpr() from Builder
- Change currentTableIds from Set<Long> to Map<Long, String> mapping
  table ID to 'db.table' qualified name
- Update getMatchedTablesString() to show sorted db.table names
  (e.g., 'ods.orders, ods.products') instead of numeric IDs
- Update resolveTableIds() in CacheHotspotManager to return
  Map<Long, String> and update all callers
- Add 3 new tests, update all existing tests for new API

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

[fix](cloud) Tighten table-level warmup validation and tests

Issue Number: None

Related PR: None

Problem Summary: Fix table-level event-driven warmup to reject ON TABLES jobs with no initial matches and refresh matched table names when a matching table is renamed. Add FE and BE unit tests for validation, refresh, duplicate detection, cluster-level coexistence, and BE table-filter handling. Keep event-driven warmup out of modify-log alter coverage.

None

- Test: Unit Test
    - FE UT: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest,org.apache.doris.cloud.WarmUpClusterOnTablesParseTest,org.apache.doris.nereids.trees.plans.commands.ShowWarmUpCommandTest,org.apache.doris.persist.ModifyCloudWarmUpJobTest (ran before removing the unsupported ModifyCloudWarmUpJob table-filter case; not rerun after that test-only deletion)
    - BE UT: ./run-be-ut.sh --run --filter=CloudWarmUpManagerFilterTest.* -j 64
- Behavior changed: Yes (reject ON TABLES jobs with no initial matches and refresh matched table names on rename)
- Does this need documentation: No

rm some doc

[improvement](cloud) Log matched tables for warmup filters

Issue Number: None

Related PR: None

Problem Summary: Log all matched table ids, full names, and counts in FE when an event-driven ON TABLES warmup job is created and whenever its table filter is refreshed.

None

- Test: Unit Test
    - FE UT: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest
- Behavior changed: Yes (FE logs now print all matched tables and counts for table-filter create/refresh)
- Does this need documentation: No

[test](cloud) Cover warmup create and replay paths

Issue Number: None

Related PR: None

Problem Summary: Adjust CacheHotspotManager table-filter unit tests so warmup jobs are not always injected through replay. The refresh-related tests now cover both createJob and replayCloudWarmUpJob paths.

None

- Test: Unit Test
    - FE UT: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest
- Behavior changed: No
- Does this need documentation: No

[improvement](cloud) Normalize persisted warmup table filters

Issue Number: None

Related PR: None

Problem Summary: Normalize persisted CloudWarmUpJob tableFilterRules so they are deduplicated and ordered with INCLUDE rules before EXCLUDE rules, and rebuild a canonical table filter expression after deserialization.

None

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.cloud.CloudWarmUpJobTableFilterTest,org.apache.doris.cloud.CacheHotspotManagerTableFilterTest
- Behavior changed: Yes (persisted warmup table filter rules are now normalized on build and replay)
- Does this need documentation: No
@bobhan1 bobhan1 force-pushed the passive-warmup-table-dev branch from 3e03d23 to 3c8e8c9 Compare April 15, 2026 08:24
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Remove the alternatives/comparison section from the table-level event-driven warmup design document, keep the final design only, and renumber the affected subsections.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: No
@bobhan1 bobhan1 force-pushed the passive-warmup-table-dev branch from 112603a to 15dbfe1 Compare April 16, 2026 02:26
@bobhan1 bobhan1 force-pushed the passive-warmup-table-dev branch from 15dbfe1 to 6b299b5 Compare April 16, 2026 04:20
bobhan1 and others added 18 commits April 16, 2026 12:21
…riven warmup

Add 6 Docker-based regression tests covering common scenarios for the
ON TABLES (INCLUDE/EXCLUDE) clause in WARM UP CLUSTER commands:

- test_warm_up_event_on_tables_include: INCLUDE wildcard with positive/negative proof
- test_warm_up_event_on_tables_include_exclude: INCLUDE+EXCLUDE pattern filtering
- test_warm_up_event_on_tables_multi_include: Multiple INCLUDE patterns across databases
- test_warm_up_event_on_tables_canonicalization: Rule order canonicalization and dedup
- test_warm_up_event_on_tables_dynamic: Auto-include new table, DROP auto-exclude, RENAME
- test_warm_up_event_on_tables_error_and_lifecycle: Error cases, coexistence, cancel, ? wildcard

All tests use quantitative bvar metrics (requested/submitted/finished/failed) for
verification, Groovy power assert, single BE, and proper Docker isolation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cated directory

- Create WarmupMetricsUtils.groovy shared utility class with static
  helper methods (getBrpcMetric, getClusterMetricSum, waitForWarmupFinish,
  waitForMatchedTables, waitForMetricsStable, etc.)
- Move 6 test files from warm_up/cluster/ to warm_up/on_tables/ to avoid
  DB name length exceeding MySQL 64-char limit
- Replace inline helper methods in each test with WarmupMetricsUtils calls
- All 6 tests verified passing after refactoring

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Covers all 6 regression tests and 8 unit test classes (~76 methods),
including coverage matrix and run commands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add test_warm_up_event_on_tables_multi_dst:
- Tests a table being warmed to multiple destination clusters simultaneously
- Job1: source -> target1 with INCLUDE 'db.orders' (specific table)
- Job2: source -> target2 with INCLUDE 'db.*' (all tables)
- Verifies orders insertion warms both targets independently
- Verifies logs insertion warms only target2 (negative proof for target1)

Update test documentation with Case 6 (multi_dst) details and coverage matrix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three key design changes:
1. BE HTTP API: unified collection - each BE outputs all fields (requested +
   finish + fail) without distinguishing src/dst role
2. BE JSON: hierarchical structure {requested|finish|fail}.{seg|idx}.{num|size}.{5m|30m|2h}
   replacing verbose flat field names
3. FE collection: single-pass approach - collect all clusters at once into
   unified clusterStats map, then aggregate per-job, avoiding duplicate
   requests to same cluster

Updated sections: 3.1 architecture diagram, 4.7 BE implementation + API
example, 5.1-5.4 FE collection/parsing/aggregation/data model, 7 data flow
summary

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change collection strategy: instead of sequential per-cluster per-BE requests,
enumerate all (cluster, BE) pairs first, then submit all HTTP requests
concurrently via CompletionService. This ensures all BEs are sampled in the
same time window, improving consistency of cross-cluster comparisons.
Aggregation happens only after all requests complete.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tent thread pool

Major changes to warmup-progress-observation-design.md:
- Add (job_id, table_id) 2D dimension to all MBvarWindowedAdder metrics
- Source BE gets job_id via FE-pushed mapping; target BE via PWarmUpRowsetRequest
- Update BE HTTP API JSON output to group by job_id
- Fix gap formula consistently to: gap = requested - finished (positive = backlog)
- Fix FE thread pool: persistent warmupStatsExecutor, not created per cycle
- Fix timestamp merge: use Math.max() not +=
- Update clusterStats structure: cluster → jobId → tableId
- Update all sections: 4.5-4.7, 5.1-5.5, 6, 7, 8, 9, 10
- Update compatibility section: PWarmUpRowsetRequest.job_id, update_warmup_job_mapping RPC

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t to globals

- New JobReplicaInfo struct (job_id + TReplicaInfo) replaces std::pair
- get_replica_info returns vector<JobReplicaInfo>, job_id from _tablet_replica_cache
- Source BE: per-(job,table) metrics updated inline right after each global metric
  in _do_warm_up_rowset's existing segment loop, no separate aggregation
- Target BE: same pattern in done_cb - each global metric line immediately
  followed by its per-(job,table) counterpart
- _warm_up_rowset is now pure orchestration, no metrics code
- Remove update_warmup_job_mapping RPC (no longer needed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add per-(job_id, table_id) windowed metrics to observe warmup sync progress:

BE changes:
- Add MBvarWindowedAdder utility class wrapping bvar::MultiDimension + bvar::Window
- Add source BE metrics: requested segment/index num and size
- Add target BE metrics: finish/fail segment/index num and size
- Add job_id field to PWarmUpRowsetRequest proto
- Add JobReplicaInfo struct to carry job_id alongside replica info
- Add HTTP API /api/warmup_event_driven_stats exposing per-(job,table) JSON stats

FE changes:
- Add TableWarmUpWindowedStats and JobWarmUpStats data models
- Add ProgressCollectDaemon in CacheHotspotManager for periodic stats collection
- Add SyncStats column to SHOW WARM UP JOB output
- Add warmup_progress_collect_interval_ms config (default 30s)
- Fix currentTableIdNames to use ConcurrentHashMap for thread safety

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…b_id only

Remove the table_id dimension from MBvarWindowedAdder metrics in both
source BE (cloud_warm_up_manager) and target BE (cloud_internal_service).
Flatten the HTTP API JSON output to per-job stats. Simplify FE collection
and aggregation to work with the single job_id dimension.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…removal

- warmup_stats_action.h: update comment from per-(job_id, table_id) to per-job_id
- bvar_windowed_adder.h: update example from {"job_id", "table_id"} to {"job_id"}
- warmup-progress-observation-design.md: comprehensive update to reflect job_id-only
  dimension throughout all sections (formulas, code examples, JSON output, FE types)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…emand

SHOW WARM UP JOB is a low-frequency operation, so periodic background
collection via ProgressCollectDaemon is unnecessary overhead. Changed to
on-demand collection: stats are fetched from BEs only when the user
executes SHOW WARM UP JOB.

Changes:
- Remove ProgressCollectDaemon inner class and startup code
- Remove warmup_progress_collect_interval_ms config
- Remove clusterStats/jobWarmUpStatsMap instance fields
- collectAndAggregate() now returns Map<Long, JobWarmUpStats> directly
- aggregateStatsForJob() takes clusterStats as parameter
- getSingleJobInfo()/getAllJobInfos() call collectAndAggregate() on demand
- CloudWarmUpJob.getJobInfo() accepts JobWarmUpStats parameter
- Update design document to reflect on-demand architecture

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…2H > MAX_SECONDS_LIMIT

- Add BE unit test for MBvarWindowedAdder (9 tests): put/get, multi-dimension,
  list_dimensions, invalid window index, composite keys, idempotent windows
- Add FE unit tests for TableWarmUpWindowedStats and JobWarmUpStats (12 tests):
  JSON parsing, merge accumulation, gap computation, toJsonString, humanReadableSize,
  end-to-end source+target aggregation
- Fix WINDOW_2H (7200) exceeding bvar MAX_SECONDS_LIMIT (3600): rename to WINDOW_1H
  in both cloud_warm_up_manager.cpp and cloud_internal_service.cpp, update JSON keys
  and FE field names from '2h' to '1h'
- Fix CloudWarmUpJobTableFilterTest TOTAL_COLUMNS for new SyncStats column
- Fix cloud_warm_up_manager_filter_test.cpp: access replica.backend_id via
  JobReplicaInfo wrapper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to 1h

- Replace custom humanReadableSize() in JobWarmUpStats with ByteSizeValue.toString()
- Update warmup-progress-observation-design.md: all 2h/7200 references to 1h/3600
- Update test expectations for ByteSizeValue output format (lowercase, no space)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add section 2.2.2: MBvarWindowedAdder BE unit tests (9 tests)
- Add section 2.3: WarmUpStatsTest FE unit tests (12 tests)
- Update summary: 17 test classes, ~119 methods total
- Add new coverage rows for BE metrics and FE data models

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifies end-to-end warmup progress observation via SHOW WARM UP JOB:
- SyncStats column (index 15) contains valid JSON for event-driven jobs
- JSON structure validation: seg_num, seg_size, idx_num, idx_size, timestamps
- Absolute segment counts: requested_5m and finish_5m match bvar deltas
- gap_5m == 0 and fail_5m == 0 after successful warmup
- Timestamps (last_trigger_ts, last_finish_ts) are non-empty
- Polls SHOW WARM UP JOB until windowed bvar values stabilize

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User guide:
- Updated SHOW WARM UP JOB column count from 15 to 16
- Added SyncStats column description
- Added '同步进度观测' section with JSON structure, field descriptions,
  3 window sizes, 4 sub-metrics, and usage recommendations
- Added progress observation step in Quick Start

Test doc:
- Added Case 8: sync_stats (9 verification items)
- Updated coverage matrix with '进度观测' column
- Updated summary: 8 Docker tests, 18 classes, ~128 methods

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bobhan1
Copy link
Copy Markdown
Contributor Author

bobhan1 commented Apr 17, 2026

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24560471071

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

bobhan1 and others added 5 commits April 18, 2026 06:30
- CloudWarmUpJob: add TABLES enum value to distinguish from TABLE
  (explicit table list) and CLUSTER (full cluster warmup)
- CacheHotspotManager: set JobType.TABLES when ON TABLES rules present
- fetchBeToTabletIdBatches: handle TABLES same as TABLE (early return)
- CloudWarmUpJobTableFilterTest: update expected type to TABLES
- User guide: update Type column description and example outputs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tputs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When enable_all_http_auth is enabled on BE, the /api/warmup_event_driven_stats
endpoint requires authentication. Add Auth-Token header using TokenManager to
the concurrent HTTP requests from FE to BE for warmup metrics collection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 5 FE unit tests verifying async MV matching in resolveTableIds:
  - MTMV matched by wildcard (ods.*)
  - MTMV matched by mv_* pattern specifically
  - MTMV excluded by EXCLUDE rule
  - Mixed table types across databases
  - New MTMV auto-discovered by refreshAllTableFilters
- Fix mockTable to set TableType.OLAP, add mockMtmv helper
- Fix setUp to mock CloudSystemInfoService for getAllJobInfos
- Update user guide: add materialized view support section
- Update test doc: add new test methods and count

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User-facing and tester-facing guide covering:
- SHOW WARM UP JOB SyncStats column (JSON structure, field meanings)
- BE HTTP API /api/warmup_event_driven_stats (request/response format)
- Key metrics interpretation (gap, fail, timestamps)
- Common scenarios and troubleshooting
- Test verification checklist with expected values

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bobhan1
Copy link
Copy Markdown
Contributor Author

bobhan1 commented Apr 20, 2026

run buildall

@bobhan1 bobhan1 changed the title table level warmup dev [Feature](cloud) Table-level event-driven warmup with progress observation Apr 20, 2026
@bobhan1 bobhan1 marked this pull request as ready for review April 20, 2026 09:03
…dd perf tests

Refactor resolveTableIds to:
1. Get table names via getTableNamesOrEmptyWithLock() (lightweight strings)
2. Match regex patterns against names only
3. Look up TableIf only for matched names (to get ID)

This avoids creating a full List<TableIf> copy when only names are needed
for pattern matching. The Table object lookup is deferred to matched entries.

Also adds 8 performance tests for shouldWarmUp regex matching:
- 10K/50K/200K/500K table scales
- Selective patterns, multi-rule include+exclude, repeated cycles
- Tight assertions: 500K < 2s, 200K+15rules < 2s, avg cycle < 1s

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.02% (1842/2361)
Line Coverage 64.65% (32946/50958)
Region Coverage 65.26% (16338/25034)
Branch Coverage 55.81% (8716/15616)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 77.34% (454/587) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.55% (21/592) 🎉
Increment coverage report
Complete coverage report

@bobhan1
Copy link
Copy Markdown
Contributor Author

bobhan1 commented Apr 21, 2026

run buildall

@bobhan1
Copy link
Copy Markdown
Contributor Author

bobhan1 commented Apr 21, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking correctness issues in the new ON TABLES / progress-observation path, so I can't approve this PR yet.

Findings:

  • FE image/checkpoint load does not rebuild transient ON TABLES state, so restored jobs can stop warming tables after FE restart.
  • WITH TABLE ... ON TABLES (...) is accepted but ON TABLES is silently ignored; the statement becomes a one-shot explicit-table warmup.
  • resolveTableIds() matches non-writable objects such as VIEWs, so a job can be accepted with only view matches and then never trigger.
  • Rolling-upgrade compatibility is not handled for new thrift field table_ids; old BEs will ignore it and treat filtered jobs as cluster-level warmup.
  • /api/warmup_event_driven_stats drops index-only jobs because it discovers job ids only from segment metrics.
  • The per-destination fan-out path still aborts on the first transport failure, so one bad destination can block later destinations in a multi-destination setup.
  • cloud_warm_up_table_filter_refresh_interval_ms is marked mutable but the daemon only snapshots it at startup.
  • One new regression test still expects Type = CLUSTER for an ON TABLES job even though the implementation now returns TABLES.

Critical checkpoints:

  • Goal of current task: Partially accomplished. Table-level event-driven warmup and per-job progress observation are implemented on the happy path, but the issues above break restart safety, compatibility, object selection, and independent fan-out.
  • Modification size/focus: Reasonably focused on warmup, but it spans FE parser/persistence/runtime, BE protocol/metrics, HTTP, tests, and docs, so the regression surface is large.
  • Concurrency: Applicable. FE adds a refresh daemon and concurrent BE HTTP collection; BE updates metrics from async download callbacks. I did not find a new lock-order/deadlock issue in the touched paths, but correctness still fails in concurrent execution because one RPC failure can abort later destinations.
  • Lifecycle / static initialization: Applicable. Journal replay rebuilds transient table-filter state, but image/checkpoint load does not. I did not find a blocking cross-TU static-init issue in the new globals.
  • Configuration items: Applicable. The new refresh-interval config is declared mutable, but runtime changes are not observed without FE restart.
  • Incompatible changes / rolling upgrade: Applicable. New FE->BE table_ids and BE->BE job_id fields are introduced. job_id is backward-safe as an optional protobuf field, but table_ids is not rolling-upgrade safe because old BEs will ignore it and warm all tables.
  • Parallel code paths: Applicable. Journal replay and image load are not equivalent for filtered jobs, and the explicit-table path conflicts with the new ON TABLES path when both syntaxes are combined.
  • Special conditions / validation: Applicable. ON TABLES validation misses the WITH TABLE combination, and table resolution misses the restriction to writable object types.
  • Test coverage: Broad but incomplete. The PR adds substantial FE/BE/regression coverage, but it misses image/checkpoint restore, mixed-version BE rollout, and non-writable-object matching; one new regression assertion is also stale.
  • Test result changes: I reviewed the new tests only; I did not run the suite in this runner.
  • Observability: Applicable. The new per-job stats are useful, but index-only warmup can disappear from the HTTP/SHOW path.
  • Persistence / failover: Applicable. FE metadata persistence is touched; EditLog replay is handled, but image/checkpoint load is not equivalent.
  • Data writes / transactionality: Not a primary area of change. I did not find a new visible-version / MoW correctness regression in the touched storage paths.
  • FE/BE variable propagation: Applicable. table_ids propagation is not safe across mixed BE versions; job_id propagation itself looks consistent on the new-code path.
  • Performance: No new blocking hot-path performance issue stood out beyond the correctness issues above.

Requesting changes.

? new ArrayList<>(currentTableIdNames.keySet()) : new ArrayList<>());
}
LOG.debug("send warm up request to BE {} ({}). job_id={}, event={}, "
+ "request_type=SET_JOB(EVENT), table_ids_count={}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read() only reconstructs tableFilterExpr. During image/checkpoint load CloudEnv.loadCloudWarmUpJob() calls addCloudWarmUpJob() rather than replayCloudWarmUpJob(), so the restored job never gets rebuildOnTablesFilter() / currentTableIdNames. After an FE restart an ON TABLES job comes back with a null filter and starts sending an empty table_ids list (or the refresh daemon keeps failing if Gson left the transient map null), so the job silently stops warming any tables until it is recreated. We need to rebuild the transient filter state on image load too, not just on journal replay.

* Resolve glob-based ON TABLES filter to a map of matching table ID → "db.table" name
* by iterating all databases and tables in the internal catalog.
*/
public Map<Long, String> resolveTableIds(OnTablesFilter filter) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTableNamesOrEmptyWithLock() returns every catalog entry name here, and we never filter the resolved object type. That means ordinary VIEWs can satisfy the initial ON TABLES match check and show up in MatchedTables even though load-driven warmup is only triggered from BE rowset writes. A job like ON TABLES (INCLUDE 'db.v_*') can be accepted with only views matched and then never warm anything. We should restrict this to writable types (OlapTable / MTMV) before adding the ID.

}
return normalizedRules;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starts sending filtered jobs to every BE unconditionally, but table_ids is only a new thrift field. Pre-upgrade BEs will drop the unknown field and register the job as cluster-level warmup, so an ON TABLES job silently warms all tables during an FE-first / mixed-BE rollout. We need a version gate / feature check before allowing filtered jobs, or some compatibility fallback that refuses to install the job on old BEs.

Comment thread be/src/service/http/action/warmup_stats_action.cpp
Comment thread be/src/cloud/cloud_warm_up_manager.cpp
Comment thread fe/fe-common/src/main/java/org/apache/doris/common/Config.java
bobhan1 added 2 commits April 21, 2026 13:08
### What problem does this PR solve?

Issue Number: None

Related PR: apache#62501

Problem Summary: Fix review feedback for ON TABLES warm-up filtering by removing the backend capability gate, restoring table filter state for loaded jobs, skipping non-managed tables when resolving matches, validating unsupported ON TABLES combinations, and making warm-up fan-out and stats discovery more robust.

### Release note

None

### Check List (For Author)

- Test: Regression test / Unit Test
    - FE UT: ./run-fe-ut.sh --run org.apache.doris.cloud.CloudWarmUpJobTableFilterTest,org.apache.doris.cloud.CacheHotspotManagerTableFilterTest,org.apache.doris.cloud.WarmUpClusterOnTablesParseTest,org.apache.doris.cloud.WarmUpStatsTest
    - BE UT: ./run-be-ut.sh --run --filter=CloudWarmUpManagerFilterTest.*
    - Regression test: ./run-regression-test.sh --run -d cloud_p0/cache/multi_cluster/warm_up/on_tables -runMode cloud -dockerSuiteParallel 1
- Behavior changed: Yes, ON TABLES warm-up now resolves only managed tables and no longer depends on a backend capability flag.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: apache#62501

Problem Summary: CloudWarmUpManager::_do_warm_up_rowset previously returned only the last failed replica status. This could hide earlier failures when a rowset warm-up partially failed. Aggregate all replica failures into the returned status while preserving OK when every replica succeeds and keeping TABLE_NOT_FOUND as the aggregate code when present so cache-refresh retry still works.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - BE UT: ./run-be-ut.sh --run --filter=CloudWarmUpManagerFilterTest.*
    - Style: build-support/check-format.sh
- Behavior changed: Yes, partial warm-up rowset failures now return all failed replica reasons instead of only the last failure.
- Does this need documentation: No
@bobhan1
Copy link
Copy Markdown
Contributor Author

bobhan1 commented Apr 21, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1843/2361)
Line Coverage 64.78% (33001/50947)
Region Coverage 65.29% (16379/25085)
Branch Coverage 55.86% (8743/15652)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants