Skip to content

Fix deadly-signal crash in CSV segmenter (#718)#718

Closed
daniellerozenblit wants to merge 1 commit into
facebook:devfrom
daniellerozenblit:export-D103001388
Closed

Fix deadly-signal crash in CSV segmenter (#718)#718
daniellerozenblit wants to merge 1 commit into
facebook:devfrom
daniellerozenblit:export-D103001388

Conversation

@daniellerozenblit
Copy link
Copy Markdown
Contributor

@daniellerozenblit daniellerozenblit commented May 1, 2026

Summary:

The CSV segmenter crashed (fatal assertion) when processing input with high token density (e.g., many consecutive separators). The root cause was twofold:

  1. maxNumTokens was calculated as min(byteSize, chunkByteSizeMax), assuming at most 1 token per byte. But consecutive separators produce 2 tokens per byte (empty field + separator), causing the lexer to exhaust the token buffer before consuming enough bytes.

  2. When the token buffer was exhausted mid-chunk, the segmenter's internal consistency checks fired with logicError, which triggers a fatal assertion in ZL_E_create_va ("Logic errors should never actually be generated").

Fix:

  • Increase maxNumTokens to 2 * chunkSize + 1 to handle worst-case token density

Reviewed By: Victor-C-Zhang

Differential Revision: D103001388

@meta-cla meta-cla Bot added the cla signed label May 1, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 1, 2026

@daniellerozenblit has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103001388.

@meta-codesync meta-codesync Bot changed the title Fix deadly-signal crash in CSV segmenter Fix deadly-signal crash in CSV segmenter (#718) May 5, 2026
daniellerozenblit added a commit to daniellerozenblit/openzl-1 that referenced this pull request May 5, 2026
Summary:

The CSV segmenter crashed (fatal assertion) when processing input with high token density (e.g., many consecutive separators). The root cause was twofold:

1. `maxNumTokens` was calculated as `min(byteSize, chunkByteSizeMax)`, assuming at most 1 token per byte. But consecutive separators produce 2 tokens per byte (empty field + separator), causing the lexer to exhaust the token buffer before consuming enough bytes.

2. When the token buffer was exhausted mid-chunk, the segmenter's internal consistency checks fired with `logicError`, which triggers a fatal assertion in `ZL_E_create_va` ("Logic errors should never actually be generated").

Fix:
- Increase `maxNumTokens` to `2 * chunkSize + 1` to handle worst-case token density
- Change `logicError` to `node_invalid_input` for defense-in-depth, so any remaining edge cases produce a clean error instead of a crash

Reviewed By: Victor-C-Zhang

Differential Revision: D103001388
daniellerozenblit added a commit to daniellerozenblit/openzl-1 that referenced this pull request May 5, 2026
Summary:

The CSV segmenter crashed (fatal assertion) when processing input with high token density (e.g., many consecutive separators). The root cause was twofold:

1. `maxNumTokens` was calculated as `min(byteSize, chunkByteSizeMax)`, assuming at most 1 token per byte. But consecutive separators produce 2 tokens per byte (empty field + separator), causing the lexer to exhaust the token buffer before consuming enough bytes.

2. When the token buffer was exhausted mid-chunk, the segmenter's internal consistency checks fired with `logicError`, which triggers a fatal assertion in `ZL_E_create_va` ("Logic errors should never actually be generated").

Fix:
- Increase `maxNumTokens` to `2 * chunkSize + 1` to handle worst-case token density
- Change `logicError` to `node_invalid_input` for defense-in-depth, so any remaining edge cases produce a clean error instead of a crash

Reviewed By: Victor-C-Zhang

Differential Revision: D103001388
daniellerozenblit added a commit to daniellerozenblit/openzl-1 that referenced this pull request May 5, 2026
Summary:

The CSV segmenter crashed (fatal assertion) when processing input with high token density (e.g., many consecutive separators). The root cause was twofold:

1. `maxNumTokens` was calculated as `min(byteSize, chunkByteSizeMax)`, assuming at most 1 token per byte. But consecutive separators produce 2 tokens per byte (empty field + separator), causing the lexer to exhaust the token buffer before consuming enough bytes.

2. When the token buffer was exhausted mid-chunk, the segmenter's internal consistency checks fired with `logicError`, which triggers a fatal assertion in `ZL_E_create_va` ("Logic errors should never actually be generated").

Fix:
- Increase `maxNumTokens` to `2 * chunkSize + 1` to handle worst-case token density

Reviewed By: Victor-C-Zhang

Differential Revision: D103001388
Summary:

The CSV segmenter crashed (fatal assertion) when processing input with high token density (e.g., many consecutive separators). The root cause was twofold:

1. `maxNumTokens` was calculated as `min(byteSize, chunkByteSizeMax)`, assuming at most 1 token per byte. But consecutive separators produce 2 tokens per byte (empty field + separator), causing the lexer to exhaust the token buffer before consuming enough bytes.

2. When the token buffer was exhausted mid-chunk, the segmenter's internal consistency checks fired with `logicError`, which triggers a fatal assertion in `ZL_E_create_va` ("Logic errors should never actually be generated").

Fix:
- Increase `maxNumTokens` to `2 * chunkSize + 1` to handle worst-case token density

Reviewed By: Victor-C-Zhang

Differential Revision: D103001388
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 5, 2026

This pull request has been merged in 6959466.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant