Skip to content

fix: string literals longer than 48 chars crash query pipeline (issue 345)#350

Merged
shirly121 merged 8 commits into
mainfrom
fix/issue-345
May 20, 2026
Merged

fix: string literals longer than 48 chars crash query pipeline (issue 345)#350
shirly121 merged 8 commits into
mainfrom
fix/issue-345

Conversation

@ShunyangLi
Copy link
Copy Markdown
Collaborator

@ShunyangLi ShunyangLi commented May 13, 2026

What do these changes do?

修复 CALL <procedure>(<string literal>, ...) 在字符串字面量超过 48 字节时段错误的问题。

根因

1. InMemOverflowBuffer::allocateNewBlock 是一个空实现。
当字符串字面量超过 48 字节的 short-string 内联缓冲区时,StringVector::addString 会向 overflow buffer 申请新块,但该函数什么都不做,紧随其后的 allocateOverflow 写入便会破坏内存。原先这里直接 THROW
掩盖了这一问题——一旦长字符串路径被允许走通(见下条),这个 stub 才第一次被真正调用到。

2. Binder::bindTableFunc 对裸 LITERAL 参数做了多余的 fold。
fold 会让字面量绕道经过一次 ValueVector 物化,而 JSON_SCANSTRING 参数在这条路径上会命中 LogicalType::getPhysicalType 中的 NEUG_UNREACHABLEtypes.cpp:990)。裸字面量本身已经持有最终值,只有当隐式 cast 将其包成 CAST 函数表达式时,fold
才是必要的。

改动内容

  • include/neug/compiler/common/in_mem_overflow_buffer.h
    BufferBlock 不再包装 storage::MemoryBuffer,改为直接持有一个在构造时按指定大小分配的 std::unique_ptr<uint8_t[]>size() / data() 改为内联。

  • src/compiler/common/in_mem_overflow_buffer.cpp
    实现 allocateNewBlock:按 max(请求大小, OVERFLOW_BLOCK_DEFAULT_SIZE) 分配一个 BufferBlock 并 push 进 blocks。这样小的 overflow 字符串可以共用同一块 4 KiB 内存,而超大请求仍能拿到一块刚好够用的独立分配。

  • src/compiler/common/vector/value_vector.cpp
    StringVector::addString 入口处校验 LogicalType::getMaxStringMaxLen() 上限,超长直接抛带文件/行号的异常,避免恶意或异常输入把单次 overflow 分配撑到任意大小。

  • src/compiler/binder/bind/bind_table_function.cpp
    跳过对裸 LITERAL 参数的 foldExpression;仅当隐式 cast 已经把表达式包成 CAST 包装时才继续 fold。

相关配置

  • Overflow 块的默认大小OVERFLOW_BLOCK_DEFAULT_SIZE = 4096 字节(in_mem_overflow_buffer.cpp 中的文件内常量)。
  • 单个字符串的最大长度65536 字节(64 KiB),取自 LogicalType::getMaxStringMaxLen()include/neug/compiler/common/types/types.h:349),与 VARCHAR 修饰符校验所用的上限一致。超出该长度的输入会抛出明确的异常,而不再破坏 overflow buffer。

Related issue number

Fixes [#345 ]

@ShunyangLi ShunyangLi requested a review from shirly121 May 13, 2026 07:50
@ShunyangLi ShunyangLi added bug Something isn't working compiler Compiler infrastructure labels May 13, 2026
positionalParams[i], inputTypes[i]);
// Skip fold on bare LITERAL: fold materializes through a ValueVector
// whose overflow allocator (InMemOverflowBuffer::allocateNewBlock) is
// an unimplemented stub, so string literals longer than
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接把48这个限制去掉吗?你得了解下为啥之前有这个限制,然后看是否是增加一个更大的限制,直接去掉显得不太妥当。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调研结论

48 这个数字到底是什么?

它不是一个被「设计」出来的限制,而是 neug_string_t 的 SSO(small-string-optimization)阈值意外暴露出来的崩溃点。

  • neug_string_t::SHORT_STR_LENGTH = PREFIX_LENGTH(16) + INLINED_SUFFIX_LENGTH(32) = 48 (include/neug/compiler/common/types/neug_string.h:34-38)。≤48 字节的字符串走内联存储,不需要分配;>48 字节走 overflow buffer。
  • overflow buffer 分配链路:StringVector::addString → StringAuxiliaryBuffer::allocateOverflow → InMemOverflowBuffer::allocateSpace → allocateNewBlock
  • InMemOverflowBuffer::allocateNewBlock(uint64_t) {} 是个空函数桩(src/compiler/common/in_mem_overflow_buffer.cpp:64)。allocateSpace
    调用完它之后 blocks 仍然为空,紧接着 currentBlock()->data() 在空 vector 上调 back() → UB / 段错误。

所以「>48 字符崩溃」是底层 overflow allocator 未实现导致的 bug,不是任何业务限制。SHORT_STR_LENGTH 只是恰好成了「能走通的最大字符串长度」。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解48不算一个写死的限制,超过48还会走overflow buffer,只是overflow buffer部分实现有bug (InMemOverflowBuffer::allocateNewBlock(uint64_t) {} 是个空函数桩),这个限制我会complier那边同步一下看看要不要添加一个大小限制

Copy link
Copy Markdown
Collaborator

@longbinlai longbinlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请补充一个测试用例到neug-test。另外,如果你调研下来这个 长度限制是有必要的,需要把这个限制补充到cypher文档关于 call 调用部分的内容

ShunyangLi and others added 5 commits May 14, 2026 16:53
shirly121
shirly121 previously approved these changes May 20, 2026
THROW_EXCEPTION_WITH_FILE_LINE(
"StringVector::addString: length is too long, exceeds " +
neug_string_t::SHORT_STR_LENGTH);
dstStr.overflowPtr =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们现在系统 string 的最大值为 65536,这里 length 也得满足这个限制,https://github.com/alibaba/neug/pull/25/changes#diff-b8dfbabe04e974fe0e54d39cc7d37d26a8450bdc559dad0cb37e546724447d95R349

// requests larger than this still get a dedicated, exactly-sized block via
// std::max in allocateNewBlock; this just lets typical small overflow strings
// share one allocation.
static constexpr uint64_t OVERFLOW_BLOCK_DEFAULT_SIZE = 4096;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 OVERFLOW_BLOCK_DEFAULT_SIZE 有什么用处?

…LL CSV_SCAN

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shirly121 shirly121 merged commit d4a3827 into main May 20, 2026
22 checks passed
@lnfjpt lnfjpt mentioned this pull request May 26, 2026
@longbinlai longbinlai deleted the fix/issue-345 branch June 3, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working compiler Compiler infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants