[feature](statistics) Skip collecting stats for long string columns#62686
Conversation
### What problem does this PR solve?
Issue Number: close #0
Problem Summary: When a string column contains extremely long values, running
ANALYZE on it can consume excessive memory and may even OOM the BE. This PR
introduces an inline per-row guard that short-circuits stats collection for any
string column whose rows exceed `Config.statistics_max_string_column_length`
(default 1024 bytes).
The guard is injected into every statistics-collection SQL template
(FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE,
HMS full/sample) as a `NoneMovableFunction`:
assert_true(col IS NULL OR LENGTH(col) <= N, 'ANALYZE_SKIP_LONG_STRING_COLUMN')
AS __lc
When any row violates the predicate, `assert_true` triggers a controlled error
carrying the `ANALYZE_SKIP_LONG_STRING_COLUMN` marker. `BaseAnalysisTask`
walks the exception cause chain for this marker and, when detected, converts
the failure into a FINISHED task with a skip message so users see why the
column was dropped (via SHOW ANALYZE and via OK-packet info for sync analyze).
Other columns in the same job are unaffected.
Setting `statistics_max_string_column_length = 0` disables the guard entirely
(upstream behavior).
### Release note
Added FE config `statistics_max_string_column_length` (default 1024). String
columns whose row byte length exceeds this value are skipped during ANALYZE
with a visible skip message, instead of risking OOM. Set to 0 to disable.
### Check List (For Author)
- Test:
- Regression test: `regression-test/suites/statistics/test_analyze_long_string.groovy`
(nonConcurrent, 6 cases covering full / sample LINEAR / partition / DUJ1 via
FE debug point / sync analyze / disabled config)
- Regression test: `regression-test/suites/external_table_p0/hive/test_hive_analyze_long_string.groovy`
- Unit Test: additions in OlapAnalysisTaskTest, HMSAnalysisTaskTest, AnalysisManagerTest
- Behavior changed: Yes. Long-string columns are now skipped during ANALYZE by
default (threshold 1024 bytes). Existing workloads can restore the old
behavior by setting `statistics_max_string_column_length = 0`.
- Does this need documentation: Yes (follow-up doc PR to describe the config).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found blocking issues in the new long-string ANALYZE skip path.
- Goal / correctness: The goal is to avoid BE OOM by skipping overly long string columns during ANALYZE while keeping other columns analyzable. The current patch does not fully accomplish that safely: one code path is explicitly left unprotected, and skipped columns are still recorded as freshly analyzed in metadata. The added regression tests cover the happy-path skip behavior, but they do not cover these metadata / scheduling consequences.
- Scope / minimality: The SQL-template changes are reasonably focused, but the state-management changes in
BaseAnalysisTaskandAnalysisManagerintroduce broader metadata side effects that are not accounted for. - Concurrency: No new lock-order issue stood out in the touched code. The main concern is logical state consistency between task completion, buffer flush, and table-stats metadata updates.
- Lifecycle: No special lifecycle / static-init issue found in the touched code.
- Configuration:
statistics_max_string_column_lengthis mutable and master-only, which is appropriate for FE-side SQL generation. - Compatibility: No storage-format or FE/BE protocol compatibility issue found for the new config and function registration.
- Parallel paths: There is a parallel full-analyze path for partitioned tables, and this PR intentionally leaves it unprotected. That means the feature is not applied consistently to all relevant ANALYZE paths.
- Conditional checks: The
assert_true-based guard itself is reasonable, but the surrounding skip conversion needs tighter scoping to avoid metadata inconsistencies. - Test coverage: Added tests cover async/sync OLAP and HMS skip behavior, but they miss the metadata state after a skipped column and the partitioned-table OOM path that this PR still leaves exposed.
- Test results / outputs: I did not run the tests in this review environment.
- Observability: The new SHOW ANALYZE / OK-packet messaging is helpful.
- Transaction / persistence: Table-stats metadata persistence is affected. The current implementation persists skipped columns as if they had valid fresh stats.
- Data writes / modifications: No data-write atomicity issue outside the statistics metadata path.
- FE-BE variable passing: Not applicable here.
- Performance: The guard adds an extra projected expression, which is acceptable for this safety feature.
- Other issues: See inline comments for the concrete blockers.
Because of the blocking issues below, this should not be approved yet.
| // it into job.message for SHOW ANALYZE visibility. | ||
| Env.getCurrentEnv().getAnalysisManager() | ||
| .updateTaskStatus(info, AnalysisState.FINISHED, skipMsg, System.currentTimeMillis()); | ||
| // Signal job: this task produced no data but is done. Downstream |
There was a problem hiding this comment.
Marking the task FINISHED here is fine for user-facing status, but nothing removes this column from jobInfo.jobColumns. Later, when the job finishes, AnalysisManager.updateTableStats() passes the original jobInfo into TableStatsMeta.update(), which blindly creates/refreshes ColStatsMeta for every entry in jobColumns.
So a skipped column ends up looking freshly analyzed in metadata even though no row was inserted into column_statistics. That breaks downstream logic such as StatisticsUtil.needAnalyzeColumn(): the next auto/manual scheduling pass will see non-null ColStatsMeta and may decide the column is healthy, while SHOW COLUMN STATS still returns nothing.
This needs to exclude skipped columns from the job-level metadata update (for example by removing the pair from jobInfo.jobColumns when handling the skip, or by teaching updateTableStats() / TableStatsMeta.update() to ignore skip-finished tasks).
There was a problem hiding this comment.
Thanks. We investigated this concern and believe it is not an actual issue:
needAnalyzeColumn (StatisticsUtil.java:902) does NOT use ColStatsMeta != null as a healthy signal on its own. When ColStatsMeta exists, it compares rowCount / updatedRows against the current table values:
changeRate = |currentRowCount - lastAnalyzeRowCount| / lastAnalyzeRowCount * 100- re-analyze is triggered only when
changeRate > (100 - health_threshold)
So after a skip:
- If data is unchanged, the skipped column will correctly not be re-analyzed (it would just skip again, wasted work).
- If the table changes materially (rows added/removed or updates beyond threshold), the column IS re-analyzed → goes through the skip path again → user sees an up-to-date skip reason in
SHOW ANALYZE.
SHOW COLUMN STATS <col> returning nothing is the correct observable state for a skipped column (there is no row in column_statistics); the skip reason is visible via SHOW ANALYZE at the job/task level.
Therefore removing the column from jobInfo.jobColumns would actually degrade behavior (we would re-run the full analyze pipeline on every subsequent trigger just to hit the same assert). Leaving the ColStatsMeta shell in place is the desired behavior.
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? Fix two regression test failures surfaced by CI for PR apache#62686: 1. statistics/analyze_stats.groovy test_analyze: the suite inserts a >1800 byte string into a string column, triggering the new long-string skip guard so no stats row is produced. Shorten the inserted value to exactly 1024 chars so it passes the default guard and the stored min/max equal the value itself. 2. external_table_p0/hive/test_hive_analyze_long_string.groovy: the test referenced hive table test_analyze_long_string_db.t1 which was not seeded. Make the test self-contained: create the hive DB/table and insert the long-string row via hive_docker at the start, drop them at the end. ### Release note None ### Check List (For Author) - Test: Regression test - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found 2 blocking issues.
- A skipped re-analyze does not clear any previously persisted table-level stats for that column, so stale
column_statisticsrows stay visible to bothSHOW COLUMN STATSand the planner. - The new skip lifecycle drives
updateTaskStatus(FINISHED, ...)twice for the same task, which overwrites the original runtime intime_cost_in_ms.
Checkpoint conclusions:
- Goal / tests: The PR mostly achieves its goal for fresh tables, and the added unit/regression tests cover full/sample/sync/external happy paths. It does not yet prove correctness when a column already had stats before the skip path is taken, and it does not cover skipped-task timing metadata.
- Scope / minimality: The change is fairly focused and reuses the existing analyze infrastructure.
- Concurrency: I did not find a new lock-order problem, but the async job lifecycle now has a double-
FINISHEDpath for skipped tasks. - Lifecycle: The new skip path is the main risk area. It marks tasks
FINISHEDwithout producing data, which needs extra cleanup/idempotency for persisted stats and task metadata. - Config:
statistics_max_string_column_lengthis read at execution time, so the FE-side config appears dynamically effective. - Compatibility: I did not find a new FE/BE protocol or storage-format compatibility problem.
- Parallel paths: Full/sample OLAP and SQL-based external-table paths are covered. Partition analyze remains intentionally unprotected and is documented as such; I am not blocking on that specific tradeoff here.
- Special conditions/comments: The new comments around partition analyze clearly explain the tradeoff.
- Test coverage: Good fresh-table coverage, but missing the stale-existing-stats case and the repeated-
FINISHEDtiming case. - Test results: I only performed code review; I did not run tests locally.
- Observability: Skip reasons are surfaced in
SHOW ANALYZEand sync OK messages, which is good. - Persistence / data correctness: This is where the blocking issue is: skip does not invalidate old persisted column stats.
- Data writes / atomicity: No new journal issue found, but the user-visible stats state is not corrected when a column transitions from “has stats” to “skipped”.
- FE/BE variable passing: No new transmitted FE/BE variables.
- Performance: The guard SQL itself is acceptable for the intended protection.
- Other issues: see inline comments.
ee5f6bd to
519eeab
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Findings:
OlapAnalysisTask#doSample()is still not fully protected: it callscollectMinMax()first, andBASIC_STATS_TEMPLATEstill scans the target string column without the new guard. A sample analyze on a long-string column therefore still executes an unguarded full-table stats query before the task can be converted into a skip.- The new
statistics_max_string_column_lengthcontract overpromises coverage. Whenenable_partition_analyzeis on, partitioned OLAP full analyze still routes throughdoPartitionTable()/PARTITION_ANALYZE_TEMPLATEwithout the guard, and auto analyze uses that path too once partition analyze is enabled. - Skipped re-analyzes leave stats state inconsistent. The skip path does not invalidate/drop any existing
column_statisticsrow, butupdateTableStats()still refreshesColStatsMetafor the skipped column. That can leave stale old stats visible while metadata says the column was freshly analyzed.
Critical checkpoint conclusions:
- Goal/task: Partially accomplished. The main full/sample SQL paths and external SQL fallback now surface skip messages, but sample
collectMinMax()and partition analyze still miss the protection. - Scope/focus: The change is reasonably focused on FE statistics collection and tests.
- Concurrency: I did not find a new locking/threading bug in the skip-state transition; it stays on the existing
AnalysisJobsynchronization path. - Lifecycle/static init: No special lifecycle or static-initialization issue found. Removing
afterExecution()is safe because nothing overrides it. - Configuration: The new config is dynamically read where needed, but its user-visible description is inaccurate for partition analyze.
- Compatibility: No FE/BE protocol or storage-format compatibility issue found.
- Parallel code paths: Not all relevant analyze paths were updated (
collectMinMax()and partition analyze remain outside the guard). - Conditional checks: The guard itself is clear, but the surrounding state handling for skipped columns is still inconsistent with persisted stats rows.
- Test coverage: Positive-path coverage is good, but coverage is still missing for the unguarded sample
collectMinMax()path and for the "previous stats row exists, then re-analyze skips" case. - Observability:
SHOW ANALYZE/ task messages are a good improvement. - Persistence/metadata: Skipped columns still refresh table-level metadata without writing replacement stats rows.
- Performance: Sample analyze still does an unguarded full-table min/max scan on long-string columns.
519eeab to
b979556
Compare
|
Amended: fixed AnalysisJobTest.testUpdateTaskState NPE (mocked task has null Verified locally:
run buildall |
|
/review |
There was a problem hiding this comment.
I found two blocking issues that keep this PR from delivering the advertised behavior end-to-end.
- Skipped columns are still recorded as analyzed in FE metadata even though no stats row is written.
- OLAP sample analyze still performs an unguarded MIN/MAX pre-scan before the new skip guard can fire.
Critical checkpoint conclusions:
- Goal / correctness: partially met. Full and external guarded paths are covered, but the two issues above still break correctness.
- Change size / focus: reasonably small and focused.
- Concurrency: I did not find a new locking or thread-safety regression in the touched FE paths.
- Config:
statistics_max_string_column_lengthis wired as a mutable master-only FE config and the reads are dynamic. - Parallel code paths: not fully covered; OLAP sample analyze still has an unguarded path.
- Tests: regression and unit coverage were added, but they miss both the metadata divergence and the unguarded sample pre-scan.
- Persistence / data correctness: skip messages are preserved correctly, but skipped columns still diverge between
column_statisticsandTableStatsMeta.
### What problem does this PR solve? Issue Number: close #xxx Problem Summary: In the long-string skip path, BaseAnalysisTask.handleSkip previously called AnalysisManager.updateTaskStatus(FINISHED, skipMsg) first and then routed through job.taskDoneWithoutData, which eventually caused AnalysisJob.flushBuffer to call updateTaskState(FINISHED, "") again on the same task. The second call overwrote AnalysisInfo.timeCostInMs with the near-zero delta between the two FINISHED timestamps, so SHOW ANALYZE TASK STATUS reported the wrong cost for skipped tasks, and logCreateAnalysisTask was invoked twice. Fix: handleSkip now only stashes the skip message on info.message and calls taskDoneWithoutData. AnalysisJob.updateTaskState falls back to task.info.message when the outer message is empty on the FINISHED path, so the single FINISHED transition still propagates the skip reason into job.message for SHOW ANALYZE visibility. This makes the state transition idempotent and keeps timing/persistence accurate. ### Release note None ### Check List (For Author) - Test: existing regression tests (test_analyze_long_string, test_hive_analyze_long_string) - Behavior changed: No (observable state/message unchanged; only fixes the time_cost_in_ms reported for skipped tasks) - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
b979556 to
87ce02e
Compare
|
run buildall |
|
/review |
|
PR approved by at least one committer and no changes requested. |
There was a problem hiding this comment.
Found blocking correctness gaps in the new long-string skip logic.
Critical checkpoint conclusions:
- Goal / correctness: Partially met. The PR skips long string columns on the main full/sample templates and surfaces a reason, but it still leaves two real execution paths unprotected: OLAP sample
collectMinMax()and all partition-analyze paths (enable_partition_analyze=true). Those paths can still evaluate oversized strings and hit the original OOM. - Small / focused change: Reasonably focused, but the implementation is not applied consistently across functionally parallel analyze paths.
- Concurrency: No new lock-order or thread-safety problem found. The new logic reuses existing task/job state transitions.
- Lifecycle / state management: Job/task lifecycle is mostly consistent after the follow-up
FINISHEDfix, but skipped tasks still flow through normal table-stats finalization and currently poison per-column metadata. - Config changes:
statistics_max_string_column_lengthis a mutable master-only FE config. Dynamic read is fine, but its documented user-visible guarantee is currently incomplete because partition analyze bypasses it. - Compatibility: No FE/BE protocol or storage-format compatibility issue found.
- Parallel code paths: Not fully covered.
PARTITION_ANALYZE_TEMPLATEand OLAP samplecollectMinMax()need the same protection or an explicit safe fallback. - Conditional checks / comments: The new NOTE explains why partition analyze is excluded, but documentation alone does not prevent the original failure mode.
- Test coverage: Good coverage for the happy-path skip behavior, DUJ1 forcing, HMS, sync message propagation, and duplicate-FINISHED fix. Missing coverage for the unguarded sample
collectMinMax()path and for metadata behavior after a skip. Current regression tests also codify the partition-analyze gap instead of catching it. - Test results: New/updated tests are plausible; no handwritten
.outchanges involved. - Observability: Skip reasons are surfaced at task/job level and via sync OK packet; this part looks good.
- Transaction / persistence: No new edit-log format or FE/BE variable-passing change found. Persisted task/job states reuse existing mechanisms.
- Data writes / atomicity: Stats-table writes reuse existing paths. However, because skipped columns currently still participate in final
updateTableStats(job), the in-memoryTableStatsMetacan claim a skipped column was analyzed even though nocolumn_statisticsrow exists. - Performance: The extra inline view /
assert_trueprojection is acceptable in guarded paths; the more important issue is that the remaining unguarded paths still do the expensive work the feature is trying to avoid. - Other issues: None beyond the blockers called out inline.
Overall opinion: Request changes. The feature direction is good and the state/message handling fix is helpful, but the long-string protection is not yet end-to-end correct for all analyze paths and it currently leaves incorrect per-column metadata after a skip.
FE Regression Coverage ReportIncrement line coverage |
…62710) Cherry-pick of #62686 onto the \`adaptive_batch_size\` branch. ### What problem does this PR solve? Related PR: #62686 Problem Summary: When a string column contains extremely long values, running ANALYZE on it can consume excessive memory and may even OOM the BE. This PR introduces an inline per-row guard that short-circuits stats collection for any string column whose rows exceed \`Config.statistics_max_string_column_length\` (default 1024 bytes). The guard is injected into every statistics-collection SQL template (FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE, HMS full/sample) as a \`NoneMovableFunction\`: \`\`\` assert_true(col IS NULL OR LENGTH(col) <= N, 'ANALYZE_SKIP_LONG_STRING_COLUMN') AS __lc \`\`\` When any row violates the predicate, \`assert_true\` triggers a controlled error carrying the \`ANALYZE_SKIP_LONG_STRING_COLUMN\` marker. \`BaseAnalysisTask\` walks the exception cause chain for this marker and, when detected, converts the failure into a FINISHED task with a skip message so users see why the column was dropped. Other columns in the same job are unaffected. Setting \`statistics_max_string_column_length = 0\` disables the guard entirely (upstream behavior). ### Release note Added FE config \`statistics_max_string_column_length\` (default 1024). String columns whose row byte length exceeds this value are skipped during ANALYZE with a visible skip message, instead of risking OOM. Set to 0 to disable. ### Check List (For Author) - Test: Regression test / Unit Test - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…62720) ### What problem does this PR solve? Issue Number: N/A Related PR: #62710 Problem Summary: `AnalysisManagerTest` in the `adaptive_batch_size` branch is missing the `import org.mockito.Mockito;` statement needed by the Mockito-based test methods introduced in #62686/#62710. Without this import the FE build fails at compilation with `cannot find symbol: variable Mockito`. ### Release note None ### Check List (For Author) - Test: No need to test (compile fix only, import statement only) - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What problem does this PR solve?
Issue Number: close #0
Problem Summary: When a string column contains extremely long values, running
ANALYZE on it can consume excessive memory and may even OOM the BE. This PR
introduces an inline per-row guard that short-circuits stats collection for any
string column whose rows exceed
Config.statistics_max_string_column_length(default 1024 bytes).
The guard is injected into every statistics-collection SQL template
(FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE,
HMS full/sample) as a
NoneMovableFunction:When any row violates the predicate,
assert_truetriggers a controlled errorcarrying the
ANALYZE_SKIP_LONG_STRING_COLUMNmarker.BaseAnalysisTaskwalks the exception cause chain for this marker and, when detected, converts
the failure into a FINISHED task with a skip message so users see why the
column was dropped (via SHOW ANALYZE and via OK-packet info for sync analyze).
Other columns in the same job are unaffected.
Setting
statistics_max_string_column_length = 0disables the guard entirely(upstream behavior).
Release note
Added FE config
statistics_max_string_column_length(default 1024). Stringcolumns whose row byte length exceeds this value are skipped during ANALYZE
with a visible skip message, instead of risking OOM. Set to 0 to disable.
Check List (For Author)
regression-test/suites/statistics/test_analyze_long_string.groovy(nonConcurrent, 6 cases covering full / sample LINEAR / partition / DUJ1 via
FE debug point / sync analyze / disabled config)
regression-test/suites/external_table_p0/hive/test_hive_analyze_long_string.groovydefault (threshold 1024 bytes). Existing workloads can restore the old
behavior by setting
statistics_max_string_column_length = 0.