Skip to content

v0.10.0

Choose a tag to compare

@chonknick chonknick released this 29 Mar 06:58
· 9 commits to main since this release

What's New

Multi-byte pattern support for chunking (Closes #2)

New .patterns(&[&str]) API on Chunker and OwnedChunker β€” composable with .delimiters() for mixed ASCII + multi-byte delimiter chunking.

chunk(content.as_bytes())
    .delimiters(b"\n.?!")
    .patterns(&["。", ",", "!"])
    .forward_fallback()
    .size(4096)

Highlights

  • Hybrid search strategy: memmem (SIMD) for 1-3 patterns, Aho-Corasick for 4+ β€” automatically selected
  • Zero regression: pure delimiter chunking stays at 70+ GiB/s
  • Composable: .delimiters() and .patterns() work together, picking the best split point across both
  • UTF-8 safe: multi-byte characters are never split mid-codepoint when using .patterns() + .forward_fallback()

Fixes

  • Fixed #2: passing multi-byte UTF-8 characters (e.g. CJK punctuation) to .delimiters() was decomposing them into individual bytes, causing mid-character splits