[Bug] BE crashes with SIGSEGV in DistinctStreamingAgg/ColumnStr::serialize_vec when reading corrupted segment files

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues.


### Version

Apache Doris 4.1.0

### What's Wrong?

We encountered repeated BE crashes (`SIGSEGV`) when Doris reads historically corrupted segment files.

The crash is reproducible at the symptom level: once the affected tablets are read by query execution or cumulative compaction, BE may crash instead of returning a safe corruption error.

The latest crash happened around `2026-05-25 09:14`, and this is already the 5th crash of the same kind.

Crash summary:
- Component: `doris_be`
- Signal: `SIGSEGV (11)`
- Latest crash time: `2026-05-25 09:14`
- Repeated crash count: 5

The stack trace shows the crash happens in `memcpy`, called from string serialization inside vectorized aggregation:

```text
*** Aborted at 1748134471 (unix time) try "date -d @1748134471" ***
*** Signal 11 (SIGSEGV) received by PID 86955 ***
PC: @     0x7fdd1f000000  (unknown)
*** SIGSEGV address not mapped to object (@0x7fdd1f000000) received by PID 86955 ***

Stack trace:
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:341
#1  0x0000555559d2d3a4 in memcpy ()
#2  0x0000555559d2d3a4 in doris::vectorized::ColumnStr<unsigned int>::serialize_impl(...)
#3  0x0000555559d2d3a4 in doris::vectorized::ColumnStr<unsigned int>::serialize_vec(...)
#4  0x0000555559d2d3a4 in doris::vectorized::DistinctStreamingAgg(...)
At the same time, BE logs contain a large number of corruption-related errors:

checksum mismatch
ZSTD decompression failed
cumulative compaction failures on the same tablets
Typical log examples:

E20250525 09:14:32.123456 86955 tablet.cpp:1234] checksum mismatch in /home/doris_test_local/be-storage/data/504/1775727969562/29255074/02000000000ac7103d41342af722a463b0a66dca056a21a2_2.dat, actual=3512177969 vs expect=2503684114
E20250525 09:14:32.123789 86955 beta_rowset_reader.cpp:567] failed to read segment: Corruption: ZSTD decompression failed
W20250525 09:14:32.124012 86955 cumulative_compaction.cpp:890] failed to do cumulative compaction. tablet=1775727969562
Our current understanding is:

Doris correctly detects that some historical segment files are corrupted.
But later, while processing those corrupted data paths, Doris still enters a code path that reaches DistinctStreamingAgg -> ColumnStr::serialize_vec -> memcpy.
That path eventually dereferences an invalid address and crashes BE with SIGSEGV.
This looks like a bug in error handling / corrupted data protection, because BE should not segfault even if a segment file is bad.

### What You Expected?

We expect Doris BE to handle corrupted segment files gracefully without crashing.

Expected behavior:

return a safe read/corruption error
fail the related query or compaction task gracefully
optionally mark the affected tablet/rowset as bad
avoid process crash (SIGSEGV) in DistinctStreamingAgg / ColumnStr::serialize_vec

### How to Reproduce?

We do not yet have a minimal synthetic reproducer, but the production symptoms are consistent and repeatable.

Observed reproduction conditions:

Some tablets contain corrupted segment files.
When Doris reads those tablets during query execution or cumulative compaction, logs report:
checksum mismatch
Corruption: ZSTD decompression failed
After that, BE may crash with SIGSEGV.

### Anything Else?

Environment:

Doris version: 4.1.0
OS: Linux (CentOS/RHEL)
mem_limit = 28.01 GB
soft_mem_limit = 25.21 GB
Corruption-related observations:

current active log contains about:
23,869 checksum mismatch errors
31,794 ZSTD decompression error errors
cumulative compaction repeatedly fails on corrupted tablets
we have 5 core dump files from repeated BE crashes
Corruption pattern:

corrupted tablets are concentrated in a narrow historical creation window (May 13-14)
another node also has corrupted tablets from the same period
after that period, we did not observe evidence of newly created corrupted tablets
this pattern suggests a historical cluster-wide event, while the current issue we want to report is that Doris crashes when reading those corrupted files
Hardware / OS observations:

/proc/diskstats shows no I/O errors on the NVMe devices
no filesystem error was found in available journal logs after May 15
disks continued running normally for more than 12 days after the corruption window
We understand that the original corruption may or may not have been caused by Doris itself. However, regardless of the root cause of the corrupted files, Doris BE should not crash with SIGSEGV while reading them.

If maintainers think this should be fixed, we can provide more materials:

full be.out stack trace
more log snippets
core dump backtrace
tablet / rowset metadata
additional checksum mismatch samples

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] BE crashes with SIGSEGV in DistinctStreamingAgg/ColumnStr::serialize_vec when reading corrupted segment files #63609

Search before asking

Version

What's Wrong?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] BE crashes with SIGSEGV in DistinctStreamingAgg/ColumnStr::serialize_vec when reading corrupted segment files #63609

Description

Search before asking

Version

What's Wrong?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions