ragchunk

Domain-aware chunking for technical documentation RAG pipelines.

Most chunking libraries treat all text the same — they split on token counts or simple delimiters, which tears apart code blocks, configuration snippets, CLI output, and tables. This destroys the context that makes these elements useful for retrieval-augmented generation.

ragchunk understands document structure. It detects and preserves technical content boundaries, producing chunks that keep code blocks intact, tables whole, and configurations unbroken — while still respecting token budgets for your vector store.

Installation

pip install ragchunk

For development:

pip install ragchunk[dev]

Quick Start

Python API

from ragchunk import DocumentChunker

chunker = DocumentChunker(source="router-guide.md")

with open("router-guide.md") as f:
    text = f.read()

chunks = chunker.chunk(text, max_tokens=512, overlap=50)

for chunk in chunks:
    print(f"[{chunk.metadata['chunk_type']}] {chunk.metadata['section']}")
    print(f"  {chunk.token_estimate} tokens, offset {chunk.start_offset}-{chunk.end_offset}")
    print(f"  {chunk.text[:80]}...")
    print()

Pipeline API

from ragchunk import ChunkingPipeline
from ragchunk.pipeline import PipelineConfig

config = PipelineConfig(
    max_tokens=256,
    overlap=30,
    source="handbook.md",
    custom_metadata={"project": "infra-docs", "version": "2.1"},
)

pipeline = ChunkingPipeline(config)

# Add custom enrichment
pipeline.add_enricher(lambda chunk: chunk)  # your logic here

chunks = pipeline.process_file("handbook.md")
print(pipeline.to_json(chunks))

CLI

# Chunk a file, output JSON to stdout
ragchunk chunk docs/guide.md

# With options
ragchunk chunk docs/guide.md --max-tokens 256 --overlap 30 --output chunks.json

# Show document structure
ragchunk info docs/guide.md

# Use a YAML config
ragchunk chunk docs/guide.md --config pipeline.yaml

Detection Types

ragchunk detects and preserves these structural elements:

Element Type	Detection Method	Example
Fenced code blocks	``` or `~~~` delimiters	Python, YAML, Bash snippets
Indented code blocks	4+ space indent	Legacy markdown code
Markdown tables	Pipe-delimited with separator row	`\| Col \| Col \|`
ASCII tables	Box-drawing borders (`+---+`)	Network inventory tables
Cisco IOS config	Keyword detection (`interface`, `router`, etc.)	Router/switch configs
JunOS config	Curly-brace block detection	Firewall filters
CLI output	Prompt patterns (`$`, `#`, `>`, `>>>`)	Command + output blocks

How It Works

Detect — Scan for protected blocks (code, config, tables, CLI output)
Split by structure — Divide on markdown headers, keeping section metadata
Split prose — Break remaining prose at sentence boundaries
Preserve blocks — Never split mid-code-block, mid-table, or mid-config
Apply overlap — Add token overlap between consecutive prose chunks
Enrich — Attach metadata (section, type, word count, offsets)

If a protected block exceeds max_tokens, it is kept intact (marked oversized in metadata) rather than broken. Better to have one large chunk than destroy the structure that makes it useful for retrieval.

Token Estimation

ragchunk uses a word_count * 1.3 heuristic for token estimation. This avoids requiring heavy tokenizer dependencies while providing reasonable approximations for English technical text. For precise token counts, post-process chunks with your model's tokenizer.

Pipeline Configuration (YAML)

max_tokens: 512
overlap: 50
source: "my-document.md"
detect_code: true
detect_config: true
detect_tables: true
detect_cli: true
enrich_metadata: true
custom_metadata:
  project: "infrastructure"
  team: "netops"

Development

git clone https://github.com/cwccie/ragchunk.git
cd ragchunk
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=ragchunk --cov-report=term-missing

# Lint
ruff check src/ tests/

# Format
ruff format src/ tests/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src/ragchunk		src/ragchunk
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ragchunk

Installation

Quick Start

Python API

Pipeline API

CLI

Detection Types

How It Works

Token Estimation

Pipeline Configuration (YAML)

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ragchunk

Installation

Quick Start

Python API

Pipeline API

CLI

Detection Types

How It Works

Token Estimation

Pipeline Configuration (YAML)

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages