v0.10.0
What's New
Multi-byte pattern support for chunking (Closes #2)
New .patterns(&[&str]) API on Chunker and OwnedChunker β composable with .delimiters() for mixed ASCII + multi-byte delimiter chunking.
chunk(content.as_bytes())
.delimiters(b"\n.?!")
.patterns(&["γ", "οΌ", "οΌ"])
.forward_fallback()
.size(4096)Highlights
- Hybrid search strategy: memmem (SIMD) for 1-3 patterns, Aho-Corasick for 4+ β automatically selected
- Zero regression: pure delimiter chunking stays at 70+ GiB/s
- Composable:
.delimiters()and.patterns()work together, picking the best split point across both - UTF-8 safe: multi-byte characters are never split mid-codepoint when using
.patterns()+.forward_fallback()
Fixes
- Fixed #2: passing multi-byte UTF-8 characters (e.g. CJK punctuation) to
.delimiters()was decomposing them into individual bytes, causing mid-character splits