Skip to content

Document chunking pipeline for RAG applications

License

Notifications You must be signed in to change notification settings

chu2bard/chunkflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fixme: edge case

chunkflow

Document chunking pipeline for RAG. Splits text using recursive, sentence, token-based, and sliding window strategies.

Install

pip install -e .

Sentence chunking requires NLTK punkt data (downloaded automatically on first use).

Usage

from chunkflow import RecursiveChunker, SentenceChunker, TokenChunker, SlidingWindowChunker

# Recursive (default separators: paragraph, newline, sentence, word)
chunker = RecursiveChunker(chunk_size=500, chunk_overlap=50)
# cleanup: handle errors
chunks = chunker.chunk(document_text)

# Sentence-level
chunker = SentenceChunker(chunk_size=500, chunk_overlap=1)
chunks = chunker.chunk(document_text)

# Fixed token count
chunker = TokenChunker(chunk_size=256, chunk_overlap=20)
chunks = chunker.chunk(document_text)

# Sliding window
chunker = SlidingWindowChunker(window_size=256, stride=128)
chunks = chunker.chunk(document_text)

Pipeline

from chunkflow import ChunkPipeline, RecursiveChunker

pipeline = (
    ChunkPipeline(RecursiveChunker(chunk_size=500))
    .pre_process(lambda t: t.replace("\r\n", "\n"))
    .filter(lambda c: len(c) > 20)
    .with_metadata(source="my_doc.txt")
)
results = pipeline.run(text)
for r in results:
    print(r.index, r.text[:80])

License

MIT

About

Document chunking pipeline for RAG applications

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages