Domain-aware chunking for technical documentation RAG pipelines.
Most chunking libraries treat all text the same — they split on token counts or simple delimiters, which tears apart code blocks, configuration snippets, CLI output, and tables. This destroys the context that makes these elements useful for retrieval-augmented generation.
ragchunk understands document structure. It detects and preserves technical content boundaries, producing chunks that keep code blocks intact, tables whole, and configurations unbroken — while still respecting token budgets for your vector store.
pip install ragchunkFor development:
pip install ragchunk[dev]from ragchunk import DocumentChunker
chunker = DocumentChunker(source="router-guide.md")
with open("router-guide.md") as f:
text = f.read()
chunks = chunker.chunk(text, max_tokens=512, overlap=50)
for chunk in chunks:
print(f"[{chunk.metadata['chunk_type']}] {chunk.metadata['section']}")
print(f" {chunk.token_estimate} tokens, offset {chunk.start_offset}-{chunk.end_offset}")
print(f" {chunk.text[:80]}...")
print()from ragchunk import ChunkingPipeline
from ragchunk.pipeline import PipelineConfig
config = PipelineConfig(
max_tokens=256,
overlap=30,
source="handbook.md",
custom_metadata={"project": "infra-docs", "version": "2.1"},
)
pipeline = ChunkingPipeline(config)
# Add custom enrichment
pipeline.add_enricher(lambda chunk: chunk) # your logic here
chunks = pipeline.process_file("handbook.md")
print(pipeline.to_json(chunks))# Chunk a file, output JSON to stdout
ragchunk chunk docs/guide.md
# With options
ragchunk chunk docs/guide.md --max-tokens 256 --overlap 30 --output chunks.json
# Show document structure
ragchunk info docs/guide.md
# Use a YAML config
ragchunk chunk docs/guide.md --config pipeline.yamlragchunk detects and preserves these structural elements:
| Element Type | Detection Method | Example |
|---|---|---|
| Fenced code blocks | ``` or ~~~ delimiters |
Python, YAML, Bash snippets |
| Indented code blocks | 4+ space indent | Legacy markdown code |
| Markdown tables | Pipe-delimited with separator row | | Col | Col | |
| ASCII tables | Box-drawing borders (+---+) |
Network inventory tables |
| Cisco IOS config | Keyword detection (interface, router, etc.) |
Router/switch configs |
| JunOS config | Curly-brace block detection | Firewall filters |
| CLI output | Prompt patterns ($, #, >, >>>) |
Command + output blocks |
- Detect — Scan for protected blocks (code, config, tables, CLI output)
- Split by structure — Divide on markdown headers, keeping section metadata
- Split prose — Break remaining prose at sentence boundaries
- Preserve blocks — Never split mid-code-block, mid-table, or mid-config
- Apply overlap — Add token overlap between consecutive prose chunks
- Enrich — Attach metadata (section, type, word count, offsets)
If a protected block exceeds max_tokens, it is kept intact (marked oversized
in metadata) rather than broken. Better to have one large chunk than destroy
the structure that makes it useful for retrieval.
ragchunk uses a word_count * 1.3 heuristic for token estimation. This avoids
requiring heavy tokenizer dependencies while providing reasonable approximations
for English technical text. For precise token counts, post-process chunks with
your model's tokenizer.
max_tokens: 512
overlap: 50
source: "my-document.md"
detect_code: true
detect_config: true
detect_tables: true
detect_cli: true
enrich_metadata: true
custom_metadata:
project: "infrastructure"
team: "netops"git clone https://github.com/cwccie/ragchunk.git
cd ragchunk
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=ragchunk --cov-report=term-missing
# Lint
ruff check src/ tests/
# Format
ruff format src/ tests/MIT License — Copyright (c) 2026 Corey Wade