Skip to content

cwccie/ragchunk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ragchunk

Domain-aware chunking for technical documentation RAG pipelines.

Most chunking libraries treat all text the same — they split on token counts or simple delimiters, which tears apart code blocks, configuration snippets, CLI output, and tables. This destroys the context that makes these elements useful for retrieval-augmented generation.

ragchunk understands document structure. It detects and preserves technical content boundaries, producing chunks that keep code blocks intact, tables whole, and configurations unbroken — while still respecting token budgets for your vector store.

Installation

pip install ragchunk

For development:

pip install ragchunk[dev]

Quick Start

Python API

from ragchunk import DocumentChunker

chunker = DocumentChunker(source="router-guide.md")

with open("router-guide.md") as f:
    text = f.read()

chunks = chunker.chunk(text, max_tokens=512, overlap=50)

for chunk in chunks:
    print(f"[{chunk.metadata['chunk_type']}] {chunk.metadata['section']}")
    print(f"  {chunk.token_estimate} tokens, offset {chunk.start_offset}-{chunk.end_offset}")
    print(f"  {chunk.text[:80]}...")
    print()

Pipeline API

from ragchunk import ChunkingPipeline
from ragchunk.pipeline import PipelineConfig

config = PipelineConfig(
    max_tokens=256,
    overlap=30,
    source="handbook.md",
    custom_metadata={"project": "infra-docs", "version": "2.1"},
)

pipeline = ChunkingPipeline(config)

# Add custom enrichment
pipeline.add_enricher(lambda chunk: chunk)  # your logic here

chunks = pipeline.process_file("handbook.md")
print(pipeline.to_json(chunks))

CLI

# Chunk a file, output JSON to stdout
ragchunk chunk docs/guide.md

# With options
ragchunk chunk docs/guide.md --max-tokens 256 --overlap 30 --output chunks.json

# Show document structure
ragchunk info docs/guide.md

# Use a YAML config
ragchunk chunk docs/guide.md --config pipeline.yaml

Detection Types

ragchunk detects and preserves these structural elements:

Element Type Detection Method Example
Fenced code blocks ``` or ~~~ delimiters Python, YAML, Bash snippets
Indented code blocks 4+ space indent Legacy markdown code
Markdown tables Pipe-delimited with separator row | Col | Col |
ASCII tables Box-drawing borders (+---+) Network inventory tables
Cisco IOS config Keyword detection (interface, router, etc.) Router/switch configs
JunOS config Curly-brace block detection Firewall filters
CLI output Prompt patterns ($, #, >, >>>) Command + output blocks

How It Works

  1. Detect — Scan for protected blocks (code, config, tables, CLI output)
  2. Split by structure — Divide on markdown headers, keeping section metadata
  3. Split prose — Break remaining prose at sentence boundaries
  4. Preserve blocks — Never split mid-code-block, mid-table, or mid-config
  5. Apply overlap — Add token overlap between consecutive prose chunks
  6. Enrich — Attach metadata (section, type, word count, offsets)

If a protected block exceeds max_tokens, it is kept intact (marked oversized in metadata) rather than broken. Better to have one large chunk than destroy the structure that makes it useful for retrieval.

Token Estimation

ragchunk uses a word_count * 1.3 heuristic for token estimation. This avoids requiring heavy tokenizer dependencies while providing reasonable approximations for English technical text. For precise token counts, post-process chunks with your model's tokenizer.

Pipeline Configuration (YAML)

max_tokens: 512
overlap: 50
source: "my-document.md"
detect_code: true
detect_config: true
detect_tables: true
detect_cli: true
enrich_metadata: true
custom_metadata:
  project: "infrastructure"
  team: "netops"

Development

git clone https://github.com/cwccie/ragchunk.git
cd ragchunk
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=ragchunk --cov-report=term-missing

# Lint
ruff check src/ tests/

# Format
ruff format src/ tests/

License

MIT License — Copyright (c) 2026 Corey Wade

About

Chunking library for technical documentation — domain-aware splitting for code blocks, CLI output, tables, and network configs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors