Skip to content

v0.8.0

Choose a tag to compare

@chonknick chonknick released this 21 Jan 04:46
· 20 commits to main since this release

What's New

Token-aware Merging for RecursiveChunker

  • Added merge_splits function to Rust, Python, and JavaScript bindings
  • Equivalent to Chonkie's Cython _merge_splits function
  • Supports whitespace-aware merging (n-1 join tokens for n segments)

Usage

Rust:

use chunk::merge_splits;

let token_counts = vec![1, 1, 1, 1, 1, 1, 1];
let result = merge_splits(&token_counts, 3, false);
// result.indices = [3, 6, 7]
// result.token_counts = [3, 3, 1]

Python:

from chonkie_core import merge_splits

result = merge_splits([1, 1, 1, 1, 1, 1, 1], chunk_size=3)
# result.indices = [3, 6, 7]
# result.token_counts = [3, 3, 1]

JavaScript:

import { init, merge_splits } from '@chonkiejs/chunk';

await init();
const result = merge_splits([1, 1, 1, 1, 1, 1, 1], 3);
// result.indices = [3, 6, 7]
// result.tokenCounts = [3, 3, 1]