chaffer

Lint for RAG corpora and retrievers. One import, one call, one report — catches the silent corpus bugs that pass review and quietly destroy retrieval quality in production.

pip install chaffer            # core (pure Python stdlib)
pip install chaffer[tokens]    # adds tiktoken for exact token counts (RG003)
pip install chaffer[pdf]       # adds PDF report support (fpdf2)

import chaffer

report = chaffer.check_corpus(
    chunks,                                          # list[str] or list[{"text": str, ...}]
    embed_model_name="text-embedding-3-small",
    eval_queries=["What is X?", "How does Y work?"],
)
print(report)

if not report.ok():
    raise SystemExit("Fix the critical corpus issues before indexing.")

clean = report.cleaned_chunks(chunks)                # drop chunks flagged critical

That's the whole API. Strings or dicts work as inputs. chaffer does not embed any chunk, call any LLM, or hit your vector DB — it's deterministic, runs in seconds on a 50k-chunk corpus, and depends only on the Python standard library.

Why this exists

The bugs that wreck production RAG aren't retriever bugs. They're corpus hygiene bugs that pass code review:

The same boilerplate header is in every doc → top-5 retrieval returns 4 copies of the same chunk.
A chunker emits oversized chunks → the embed model silently truncates and the tail of every long chunk is unreachable.
Eval queries were written by reading the source docs → reported recall is partly a string-match exercise.
An empty chunk slips in → it produces a zero-vector embedding and pollutes top-k.
The embed model dim doesn't match the index dim → inserts produce garbage similarity scores.

chaffer.check_corpus(...) is a single call that catches these before you spend money embedding the corpus, with a concrete fix for each.

What it catches

Code	Severity	What it catches
`RG001`	critical	Exact-duplicate chunks (MD5 hash collision)
`RG003`	critical	Oversized chunks silently truncated by the named embed model
`RG004`	critical	Embedding-dim mismatch between model and index
`RG006`	critical	Eval queries leaking verbatim/near-verbatim into the corpus
`RG012`	critical	Empty or whitespace-only chunks
`RG002`	warning	Near-duplicate chunks (5-shingle Jaccard ≥ 0.85)
`RG005`	warning	PII detected in chunks (email / phone / SSN / credit-card / IBAN)
`RG014`	warning	BM25 vs semantic top-k disagreement (one retriever likely broken)

Each finding tells you the affected chunk indices, the severity, and how to fix it — not just that something is wrong.

Demo: with vs without chaffer

The repo ships examples/demo.py — a synthetic 60-chunk corpus with five bugs baked in:

Exact-duplicate boilerplate footer copied across 6 documents → RG001
One chunk pasted at 4× the embed model's max_seq_length → RG003
Empty chunk from a malformed parser → RG012
An eval query quoted verbatim into the corpus → RG006
Configured embed model produces 1536-dim vectors but the index expects 768 → RG004

Run it:

cd examples
python demo.py

chaffer flags all 5 as critical and refuses to ok().

Use it in CI

import chaffer, sys

report = chaffer.check_corpus(
    chunks,
    embed_model_name="text-embedding-3-small",
    eval_queries=eval_questions,
)
sys.exit(0 if report.ok() else 1)

A failed report.ok() blocks the merge before a bad corpus gets embedded.

Audit a retriever

chaffer.check_corpus() looks at data. chaffer.check_retriever() looks at retrieval behavior:

import chaffer

def my_dense(query, k):     # your semantic retriever
    return vector_db.search(embed(query), k=k)

def my_bm25(query, k):      # any BM25 over the same corpus
    return bm25_index.search(query, k=k)

report = chaffer.check_retriever(
    my_dense,
    bm25=my_bm25,
    eval_queries=["What is X?", ...],
    k=10,
)
print(report)

When BM25 and your semantic retriever share less than 10% of their top-k on average, one of them is probably broken — chaffer flags this as RG014.

API reference

chaffer.check_corpus(
    chunks,                              # list[str] or list[{"text": str, ...}]
    *,
    embed_model_name=None,               # enables RG003 / RG004
    index_dim=None,                      # enables RG004
    eval_queries=None,                   # enables RG006
    near_dupe_threshold=0.85,            # RG002 threshold
) -> Report

chaffer.check_retriever(
    retriever,                           # callable (query, k) -> list
    bm25,                                # callable (query, k) -> list, or None
    eval_queries,                        # list[str]
    *,
    k=10,
) -> Report

Report:

report.ok() — True if no critical findings.
report.findings, report.critical, report.warnings, report.infos — lists of Finding.
report.cleaned_chunks(chunks) — drops chunks flagged by any critical finding.
print(report) — human-readable terminal summary.
report.to_dict() — JSON-serializable dict (good for CI logs / artifacts).

Each Finding has: code, severity (critical / warning / info), message, fix, chunks (tuple of indices), details.

Known embed models

chaffer ships with limits for OpenAI text-embedding-3-{small,large,ada-002}, Cohere embed-english-v3.0 family, Voyage voyage-3 / voyage-3-lite, all-MiniLM-L6-v2, all-mpnet-base-v2, BAAI/bge-{small,base,large}-en-v1.5, intfloat/e5-{small,base}-v2. If your model isn't in the registry, RG003 emits an info finding ("truncation check skipped") instead of crashing — open an issue or PR to add it.

Scope, on purpose

chaffer is only a linter for RAG corpus and retriever bugs. It doesn't:

embed text (use OpenAI / Cohere / sentence-transformers),
store or search vectors (use pinecone / weaviate / qdrant / faiss / chroma),
evaluate end-to-end answer quality (use RAGAS / TruLens / DeepEval),
check answer faithfulness (corroborate does — sibling library),
chunk documents (use unstructured / llama-index / langchain).

Doing one thing well is the point. If chaffer.check_corpus() returns clean, your corpus isn't silently broken — and that's all it claims to do.

Development

git clone https://github.com/asmitdash/chaffer
cd chaffer
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chaffer.py		chaffer.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chaffer

Why this exists

What it catches

Demo: with vs without chaffer

Use it in CI

Audit a retriever

API reference

Known embed models

Scope, on purpose

See also

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

chaffer

Why this exists

What it catches

Demo: with vs without chaffer

Use it in CI

Audit a retriever

API reference

Known embed models

Scope, on purpose

See also

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages