Skip to content

[feature](statistics) Skip collecting stats for long string columns#62710

Merged
mrhhsg merged 1 commit into
apache:adaptive_batch_sizefrom
yujun777:pick-62686-to-adaptive-batch-size
Apr 22, 2026
Merged

[feature](statistics) Skip collecting stats for long string columns#62710
mrhhsg merged 1 commit into
apache:adaptive_batch_sizefrom
yujun777:pick-62686-to-adaptive-batch-size

Conversation

@yujun777
Copy link
Copy Markdown
Contributor

Cherry-pick of #62686 onto the `adaptive_batch_size` branch.

What problem does this PR solve?

Related PR: #62686

Problem Summary: When a string column contains extremely long values, running ANALYZE on it can consume excessive memory and may even OOM the BE. This PR introduces an inline per-row guard that short-circuits stats collection for any string column whose rows exceed `Config.statistics_max_string_column_length` (default 1024 bytes).

The guard is injected into every statistics-collection SQL template (FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE, HMS full/sample) as a `NoneMovableFunction`:

```
assert_true(col IS NULL OR LENGTH(col) <= N, 'ANALYZE_SKIP_LONG_STRING_COLUMN') AS __lc
```

When any row violates the predicate, `assert_true` triggers a controlled error carrying the `ANALYZE_SKIP_LONG_STRING_COLUMN` marker. `BaseAnalysisTask` walks the exception cause chain for this marker and, when detected, converts the failure into a FINISHED task with a skip message so users see why the column was dropped. Other columns in the same job are unaffected.

Setting `statistics_max_string_column_length = 0` disables the guard entirely (upstream behavior).

Release note

Added FE config `statistics_max_string_column_length` (default 1024). String columns whose row byte length exceeds this value are skipped during ANALYZE with a visible skip message, instead of risking OOM. Set to 0 to disable.

Check List (For Author)

  • Test: Regression test / Unit Test
  • Behavior changed: No
  • Does this need documentation: No

…pache#62686)

Issue Number: close #0

Problem Summary: When a string column contains extremely long values,
running
ANALYZE on it can consume excessive memory and may even OOM the BE. This
PR
introduces an inline per-row guard that short-circuits stats collection
for any
string column whose rows exceed
`Config.statistics_max_string_column_length`
(default 1024 bytes).

The guard is injected into every statistics-collection SQL template
(FULL_ANALYZE_TEMPLATE, LINEAR_ANALYZE_TEMPLATE, DUJ1_ANALYZE_TEMPLATE,
HMS full/sample) as a `NoneMovableFunction`:

assert_true(col IS NULL OR LENGTH(col) <= N,
'ANALYZE_SKIP_LONG_STRING_COLUMN')
    AS __lc

When any row violates the predicate, `assert_true` triggers a controlled
error
carrying the `ANALYZE_SKIP_LONG_STRING_COLUMN` marker.
`BaseAnalysisTask`
walks the exception cause chain for this marker and, when detected,
converts
the failure into a FINISHED task with a skip message so users see why
the
column was dropped (via SHOW ANALYZE and via OK-packet info for sync
analyze).
Other columns in the same job are unaffected.

Setting `statistics_max_string_column_length = 0` disables the guard
entirely
(upstream behavior).

Added FE config `statistics_max_string_column_length` (default 1024).
String
columns whose row byte length exceeds this value are skipped during
ANALYZE
with a visible skip message, instead of risking OOM. Set to 0 to
disable.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yujun777
Copy link
Copy Markdown
Contributor Author

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

/review

@mrhhsg mrhhsg merged commit d2ebe1a into apache:adaptive_batch_size Apr 22, 2026
8 of 9 checks passed
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 blocking correctness issues.

Critical checkpoints:

  • Goal / correctness: Not fully met. Full analyze is guarded, but sample analyze still runs an unguarded collectMinMax() prepass, and skipped columns can leave stale existing stats active.
  • Small / focused: Mostly focused, but the end-to-end skip semantics are incomplete because task/job state is updated without reconciling persisted/cache state.
  • Concurrency: No new concurrency or locking issue found in the touched code.
  • Lifecycle / static init: No special lifecycle or static-init issue found.
  • Configuration: statistics_max_string_column_length is mutable/masterOnly and read at execution time, so dynamic-config behavior is fine.
  • Compatibility: No FE/BE protocol or storage-format compatibility issue found.
  • Parallel paths: Not all relevant analyze paths were updated; the sample prepass is still outside the guard.
  • Conditional checks: The marker-based skip logic is reasonable, but it is bypassed in the sample prepass.
  • Test coverage: Added tests cover fresh-table full/sample/external cases, but they miss the stale-stats-after-skip case and the sample prepass before assert_true.
  • Observability: Skip logging/messages are sufficient.
  • Transaction / persistence / data writes: Problematic. A skipped column does not write new stats, but existing stats/cache are not cleared and table metadata is still refreshed as if analyze succeeded.
  • FE-BE variable passing: Not applicable.
  • Performance: No additional major performance concern beyond the intended extra guard/subquery.

Blocking findings are inline.

params.put("index", getIndex());
params.put("preAggHint", "");
addLengthAssertParam(params);
return params;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only guards the main LINEAR/DUJ1 query. doSample() still calls collectMinMax() first, and BASIC_STATS_TEMPLATE there does not include ${lengthAssert}. If that prepass is the one that hits the long-string failure you are trying to avoid, the task will still go down the normal failure path because no skip marker is ever produced. Please guard or short-circuit the collectMinMax() path as well so sample analyze consistently becomes FINISHED + skipped for over-limit string columns.

// the task's own info.message when the outer msg is empty, so skipMsg
// reaches job.message for SHOW ANALYZE visibility.
job.taskDoneWithoutData(this);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSkip() turns the task into a no-data success, but it never clears any pre-existing stats for the skipped column. If this column was analyzed successfully before and a later run hits the long-string guard, the old row in column_statistics and the cached ColumnStatistic remain in place, while updateTableStats(job) later still refreshes TableStatsMeta for this column as if analyze had succeeded. After the user sees a skip message, SHOW COLUMN STATS / the optimizer can keep using stale stats from the previous run. This path needs to delete/invalidate the old stats (and avoid refreshing ColStatsMeta for skipped columns) before treating the task as finished.

mrhhsg pushed a commit that referenced this pull request Apr 22, 2026
…62720)

### What problem does this PR solve?

Issue Number: N/A

Related PR: #62710

Problem Summary: `AnalysisManagerTest` in the `adaptive_batch_size`
branch is missing the `import org.mockito.Mockito;` statement needed by
the Mockito-based test methods introduced in #62686/#62710. Without this
import the FE build fails at compilation with `cannot find symbol:
variable Mockito`.

### Release note

None

### Check List (For Author)

- Test: No need to test (compile fix only, import statement only)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants