fix: string literals longer than 48 chars crash query pipeline (issue 345)#350
Conversation
| positionalParams[i], inputTypes[i]); | ||
| // Skip fold on bare LITERAL: fold materializes through a ValueVector | ||
| // whose overflow allocator (InMemOverflowBuffer::allocateNewBlock) is | ||
| // an unimplemented stub, so string literals longer than |
There was a problem hiding this comment.
直接把48这个限制去掉吗?你得了解下为啥之前有这个限制,然后看是否是增加一个更大的限制,直接去掉显得不太妥当。
There was a problem hiding this comment.
调研结论
48 这个数字到底是什么?
它不是一个被「设计」出来的限制,而是 neug_string_t 的 SSO(small-string-optimization)阈值意外暴露出来的崩溃点。
neug_string_t::SHORT_STR_LENGTH = PREFIX_LENGTH(16) + INLINED_SUFFIX_LENGTH(32) = 48 (include/neug/compiler/common/types/neug_string.h:34-38)。≤48 字节的字符串走内联存储,不需要分配;>48 字节走 overflow buffer。- overflow buffer 分配链路:
StringVector::addString → StringAuxiliaryBuffer::allocateOverflow → InMemOverflowBuffer::allocateSpace → allocateNewBlock。 InMemOverflowBuffer::allocateNewBlock(uint64_t) {}是个空函数桩(src/compiler/common/in_mem_overflow_buffer.cpp:64)。allocateSpace
调用完它之后 blocks 仍然为空,紧接着 currentBlock()->data() 在空 vector 上调 back() → UB / 段错误。
所以「>48 字符崩溃」是底层 overflow allocator 未实现导致的 bug,不是任何业务限制。SHORT_STR_LENGTH 只是恰好成了「能走通的最大字符串长度」。
There was a problem hiding this comment.
我理解48不算一个写死的限制,超过48还会走overflow buffer,只是overflow buffer部分实现有bug (InMemOverflowBuffer::allocateNewBlock(uint64_t) {} 是个空函数桩),这个限制我会complier那边同步一下看看要不要添加一个大小限制
longbinlai
left a comment
There was a problem hiding this comment.
请补充一个测试用例到neug-test。另外,如果你调研下来这个 长度限制是有必要的,需要把这个限制补充到cypher文档关于 call 调用部分的内容
Move test_call_string_literal.py from a separate CI step into the existing "Run Query Test" step to reduce workflow verbosity.
| THROW_EXCEPTION_WITH_FILE_LINE( | ||
| "StringVector::addString: length is too long, exceeds " + | ||
| neug_string_t::SHORT_STR_LENGTH); | ||
| dstStr.overflowPtr = |
There was a problem hiding this comment.
我们现在系统 string 的最大值为 65536,这里 length 也得满足这个限制,https://github.com/alibaba/neug/pull/25/changes#diff-b8dfbabe04e974fe0e54d39cc7d37d26a8450bdc559dad0cb37e546724447d95R349
| // requests larger than this still get a dedicated, exactly-sized block via | ||
| // std::max in allocateNewBlock; this just lets typical small overflow strings | ||
| // share one allocation. | ||
| static constexpr uint64_t OVERFLOW_BLOCK_DEFAULT_SIZE = 4096; |
There was a problem hiding this comment.
这个 OVERFLOW_BLOCK_DEFAULT_SIZE 有什么用处?
…LL CSV_SCAN Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What do these changes do?
修复
CALL <procedure>(<string literal>, ...)在字符串字面量超过 48 字节时段错误的问题。根因
1.
InMemOverflowBuffer::allocateNewBlock是一个空实现。当字符串字面量超过 48 字节的 short-string 内联缓冲区时,
StringVector::addString会向 overflow buffer 申请新块,但该函数什么都不做,紧随其后的allocateOverflow写入便会破坏内存。原先这里直接THROW掩盖了这一问题——一旦长字符串路径被允许走通(见下条),这个 stub 才第一次被真正调用到。
2.
Binder::bindTableFunc对裸 LITERAL 参数做了多余的 fold。fold 会让字面量绕道经过一次
ValueVector物化,而JSON_SCAN的STRING参数在这条路径上会命中LogicalType::getPhysicalType中的NEUG_UNREACHABLE(types.cpp:990)。裸字面量本身已经持有最终值,只有当隐式 cast 将其包成CAST函数表达式时,fold才是必要的。
改动内容
include/neug/compiler/common/in_mem_overflow_buffer.hBufferBlock不再包装storage::MemoryBuffer,改为直接持有一个在构造时按指定大小分配的std::unique_ptr<uint8_t[]>;size()/data()改为内联。src/compiler/common/in_mem_overflow_buffer.cpp实现
allocateNewBlock:按max(请求大小, OVERFLOW_BLOCK_DEFAULT_SIZE)分配一个BufferBlock并 push 进blocks。这样小的 overflow 字符串可以共用同一块 4 KiB 内存,而超大请求仍能拿到一块刚好够用的独立分配。src/compiler/common/vector/value_vector.cppStringVector::addString入口处校验LogicalType::getMaxStringMaxLen()上限,超长直接抛带文件/行号的异常,避免恶意或异常输入把单次 overflow 分配撑到任意大小。src/compiler/binder/bind/bind_table_function.cpp跳过对裸
LITERAL参数的foldExpression;仅当隐式 cast 已经把表达式包成CAST包装时才继续 fold。相关配置
OVERFLOW_BLOCK_DEFAULT_SIZE = 4096字节(in_mem_overflow_buffer.cpp中的文件内常量)。LogicalType::getMaxStringMaxLen()(include/neug/compiler/common/types/types.h:349),与VARCHAR修饰符校验所用的上限一致。超出该长度的输入会抛出明确的异常,而不再破坏 overflow buffer。Related issue number
Fixes [#345 ]