[feature](statistics) Skip collecting stats for long string columns#62710
Conversation
…pache#62686) Issue Number: close #0 Problem Summary: When a string column contains extremely long values, running ANALYZE on it can consume excessive memory and may even OOM the BE. This PR introduces an inline per-row guard that short-circuits stats collection for any string column whose rows exceed `Config.statistics_max_string_column_length` (default 1024 bytes). The guard is injected into every statistics-collection SQL template (FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE, HMS full/sample) as a `NoneMovableFunction`: assert_true(col IS NULL OR LENGTH(col) <= N, 'ANALYZE_SKIP_LONG_STRING_COLUMN') AS __lc When any row violates the predicate, `assert_true` triggers a controlled error carrying the `ANALYZE_SKIP_LONG_STRING_COLUMN` marker. `BaseAnalysisTask` walks the exception cause chain for this marker and, when detected, converts the failure into a FINISHED task with a skip message so users see why the column was dropped (via SHOW ANALYZE and via OK-packet info for sync analyze). Other columns in the same job are unaffected. Setting `statistics_max_string_column_length = 0` disables the guard entirely (upstream behavior). Added FE config `statistics_max_string_column_length` (default 1024). String columns whose row byte length exceeds this value are skipped during ANALYZE with a visible skip message, instead of risking OOM. Set to 0 to disable. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found 2 blocking correctness issues.
Critical checkpoints:
- Goal / correctness: Not fully met. Full analyze is guarded, but sample analyze still runs an unguarded
collectMinMax()prepass, and skipped columns can leave stale existing stats active. - Small / focused: Mostly focused, but the end-to-end skip semantics are incomplete because task/job state is updated without reconciling persisted/cache state.
- Concurrency: No new concurrency or locking issue found in the touched code.
- Lifecycle / static init: No special lifecycle or static-init issue found.
- Configuration:
statistics_max_string_column_lengthis mutable/masterOnly and read at execution time, so dynamic-config behavior is fine. - Compatibility: No FE/BE protocol or storage-format compatibility issue found.
- Parallel paths: Not all relevant analyze paths were updated; the sample prepass is still outside the guard.
- Conditional checks: The marker-based skip logic is reasonable, but it is bypassed in the sample prepass.
- Test coverage: Added tests cover fresh-table full/sample/external cases, but they miss the stale-stats-after-skip case and the sample prepass before
assert_true. - Observability: Skip logging/messages are sufficient.
- Transaction / persistence / data writes: Problematic. A skipped column does not write new stats, but existing stats/cache are not cleared and table metadata is still refreshed as if analyze succeeded.
- FE-BE variable passing: Not applicable.
- Performance: No additional major performance concern beyond the intended extra guard/subquery.
Blocking findings are inline.
| params.put("index", getIndex()); | ||
| params.put("preAggHint", ""); | ||
| addLengthAssertParam(params); | ||
| return params; |
There was a problem hiding this comment.
This only guards the main LINEAR/DUJ1 query. doSample() still calls collectMinMax() first, and BASIC_STATS_TEMPLATE there does not include ${lengthAssert}. If that prepass is the one that hits the long-string failure you are trying to avoid, the task will still go down the normal failure path because no skip marker is ever produced. Please guard or short-circuit the collectMinMax() path as well so sample analyze consistently becomes FINISHED + skipped for over-limit string columns.
| // the task's own info.message when the outer msg is empty, so skipMsg | ||
| // reaches job.message for SHOW ANALYZE visibility. | ||
| job.taskDoneWithoutData(this); | ||
| } |
There was a problem hiding this comment.
handleSkip() turns the task into a no-data success, but it never clears any pre-existing stats for the skipped column. If this column was analyzed successfully before and a later run hits the long-string guard, the old row in column_statistics and the cached ColumnStatistic remain in place, while updateTableStats(job) later still refreshes TableStatsMeta for this column as if analyze had succeeded. After the user sees a skip message, SHOW COLUMN STATS / the optimizer can keep using stale stats from the previous run. This path needs to delete/invalidate the old stats (and avoid refreshing ColStatsMeta for skipped columns) before treating the task as finished.
…62720) ### What problem does this PR solve? Issue Number: N/A Related PR: #62710 Problem Summary: `AnalysisManagerTest` in the `adaptive_batch_size` branch is missing the `import org.mockito.Mockito;` statement needed by the Mockito-based test methods introduced in #62686/#62710. Without this import the FE build fails at compilation with `cannot find symbol: variable Mockito`. ### Release note None ### Check List (For Author) - Test: No need to test (compile fix only, import statement only) - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cherry-pick of #62686 onto the `adaptive_batch_size` branch.
What problem does this PR solve?
Related PR: #62686
Problem Summary: When a string column contains extremely long values, running ANALYZE on it can consume excessive memory and may even OOM the BE. This PR introduces an inline per-row guard that short-circuits stats collection for any string column whose rows exceed `Config.statistics_max_string_column_length` (default 1024 bytes).
The guard is injected into every statistics-collection SQL template (FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE, HMS full/sample) as a `NoneMovableFunction`:
```
assert_true(col IS NULL OR LENGTH(col) <= N, 'ANALYZE_SKIP_LONG_STRING_COLUMN') AS __lc
```
When any row violates the predicate, `assert_true` triggers a controlled error carrying the `ANALYZE_SKIP_LONG_STRING_COLUMN` marker. `BaseAnalysisTask` walks the exception cause chain for this marker and, when detected, converts the failure into a FINISHED task with a skip message so users see why the column was dropped. Other columns in the same job are unaffected.
Setting `statistics_max_string_column_length = 0` disables the guard entirely (upstream behavior).
Release note
Added FE config `statistics_max_string_column_length` (default 1024). String columns whose row byte length exceeds this value are skipped during ANALYZE with a visible skip message, instead of risking OOM. Set to 0 to disable.
Check List (For Author)