[feature](statistics) Skip collecting stats for long string columns by yujun777 · Pull Request #62710 · apache/doris

yujun777 · 2026-04-22T08:59:25Z

Cherry-pick of #62686 onto the `adaptive_batch_size` branch.

What problem does this PR solve?

Related PR: #62686

Problem Summary: When a string column contains extremely long values, running ANALYZE on it can consume excessive memory and may even OOM the BE. This PR introduces an inline per-row guard that short-circuits stats collection for any string column whose rows exceed `Config.statistics_max_string_column_length` (default 1024 bytes).

The guard is injected into every statistics-collection SQL template (FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE, HMS full/sample) as a `NoneMovableFunction`:

```
assert_true(col IS NULL OR LENGTH(col) <= N, 'ANALYZE_SKIP_LONG_STRING_COLUMN') AS __lc
```

When any row violates the predicate, `assert_true` triggers a controlled error carrying the `ANALYZE_SKIP_LONG_STRING_COLUMN` marker. `BaseAnalysisTask` walks the exception cause chain for this marker and, when detected, converts the failure into a FINISHED task with a skip message so users see why the column was dropped. Other columns in the same job are unaffected.

Setting `statistics_max_string_column_length = 0` disables the guard entirely (upstream behavior).

Release note

Added FE config `statistics_max_string_column_length` (default 1024). String columns whose row byte length exceeds this value are skipped during ANALYZE with a visible skip message, instead of risking OOM. Set to 0 to disable.

Check List (For Author)

Test: Regression test / Unit Test
Behavior changed: No
Does this need documentation: No

…pache#62686) Issue Number: close #0 Problem Summary: When a string column contains extremely long values, running ANALYZE on it can consume excessive memory and may even OOM the BE. This PR introduces an inline per-row guard that short-circuits stats collection for any string column whose rows exceed `Config.statistics_max_string_column_length` (default 1024 bytes). The guard is injected into every statistics-collection SQL template (FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE, HMS full/sample) as a `NoneMovableFunction`: assert_true(col IS NULL OR LENGTH(col) <= N, 'ANALYZE_SKIP_LONG_STRING_COLUMN') AS __lc When any row violates the predicate, `assert_true` triggers a controlled error carrying the `ANALYZE_SKIP_LONG_STRING_COLUMN` marker. `BaseAnalysisTask` walks the exception cause chain for this marker and, when detected, converts the failure into a FINISHED task with a skip message so users see why the column was dropped (via SHOW ANALYZE and via OK-packet info for sync analyze). Other columns in the same job are unaffected. Setting `statistics_max_string_column_length = 0` disables the guard entirely (upstream behavior). Added FE config `statistics_max_string_column_length` (default 1024). String columns whose row byte length exceeds this value are skipped during ANALYZE with a visible skip message, instead of risking OOM. Set to 0 to disable. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

yujun777 · 2026-04-22T08:59:36Z

run buildall

yujun777 · 2026-04-22T08:59:37Z

/review

github-actions

Found 2 blocking correctness issues.

Critical checkpoints:

Goal / correctness: Not fully met. Full analyze is guarded, but sample analyze still runs an unguarded collectMinMax() prepass, and skipped columns can leave stale existing stats active.
Small / focused: Mostly focused, but the end-to-end skip semantics are incomplete because task/job state is updated without reconciling persisted/cache state.
Concurrency: No new concurrency or locking issue found in the touched code.
Lifecycle / static init: No special lifecycle or static-init issue found.
Configuration: statistics_max_string_column_length is mutable/masterOnly and read at execution time, so dynamic-config behavior is fine.
Compatibility: No FE/BE protocol or storage-format compatibility issue found.
Parallel paths: Not all relevant analyze paths were updated; the sample prepass is still outside the guard.
Conditional checks: The marker-based skip logic is reasonable, but it is bypassed in the sample prepass.
Test coverage: Added tests cover fresh-table full/sample/external cases, but they miss the stale-stats-after-skip case and the sample prepass before assert_true.
Observability: Skip logging/messages are sufficient.
Transaction / persistence / data writes: Problematic. A skipped column does not write new stats, but existing stats/cache are not cleared and table metadata is still refreshed as if analyze succeeded.
FE-BE variable passing: Not applicable.
Performance: No additional major performance concern beyond the intended extra guard/subquery.

Blocking findings are inline.

github-actions · 2026-04-22T09:25:50Z

        params.put("index", getIndex());
        params.put("preAggHint", "");
+        addLengthAssertParam(params);
        return params;


This only guards the main LINEAR/DUJ1 query. doSample() still calls collectMinMax() first, and BASIC_STATS_TEMPLATE there does not include ${lengthAssert}. If that prepass is the one that hits the long-string failure you are trying to avoid, the task will still go down the normal failure path because no skip marker is ever produced. Please guard or short-circuit the collectMinMax() path as well so sample analyze consistently becomes FINISHED + skipped for over-limit string columns.

github-actions · 2026-04-22T09:25:50Z

+        // the task's own info.message when the outer msg is empty, so skipMsg
+        // reaches job.message for SHOW ANALYZE visibility.
+        job.taskDoneWithoutData(this);
+    }


handleSkip() turns the task into a no-data success, but it never clears any pre-existing stats for the skipped column. If this column was analyzed successfully before and a later run hits the long-string guard, the old row in column_statistics and the cached ColumnStatistic remain in place, while updateTableStats(job) later still refreshes TableStatsMeta for this column as if analyze had succeeded. After the user sees a skip message, SHOW COLUMN STATS / the optimizer can keep using stale stats from the previous run. This path needs to delete/invalidate the old stats (and avoid refreshing ColStatsMeta for skipped columns) before treating the task as finished.

…62720) ### What problem does this PR solve? Issue Number: N/A Related PR: #62710 Problem Summary: `AnalysisManagerTest` in the `adaptive_batch_size` branch is missing the `import org.mockito.Mockito;` statement needed by the Mockito-based test methods introduced in #62686/#62710. Without this import the FE build fails at compilation with `cannot find symbol: variable Mockito`. ### Release note None ### Check List (For Author) - Test: No need to test (compile fix only, import statement only) - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mrhhsg merged commit d2ebe1a into apache:adaptive_batch_size Apr 22, 2026
8 of 9 checks passed

github-actions Bot requested changes Apr 22, 2026

View reviewed changes

yujun777 mentioned this pull request Apr 22, 2026

[fix](statistics) Fix missing Mockito import in AnalysisManagerTest #62720

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature](statistics) Skip collecting stats for long string columns#62710

[feature](statistics) Skip collecting stats for long string columns#62710
mrhhsg merged 1 commit into
apache:adaptive_batch_sizefrom
yujun777:pick-62686-to-adaptive-batch-size

yujun777 commented Apr 22, 2026

Uh oh!

yujun777 commented Apr 22, 2026

Uh oh!

yujun777 commented Apr 22, 2026

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 22, 2026

Uh oh!

github-actions Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yujun777 commented Apr 22, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Uh oh!

yujun777 commented Apr 22, 2026

Uh oh!

yujun777 commented Apr 22, 2026

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants