Lint for RAG corpora and retrievers. One import, one call, one report — catches the silent corpus bugs that pass review and quietly destroy retrieval quality in production.
pip install chaffer # core (pure Python stdlib)
pip install chaffer[tokens] # adds tiktoken for exact token counts (RG003)
pip install chaffer[pdf] # adds PDF report support (fpdf2)import chaffer
report = chaffer.check_corpus(
chunks, # list[str] or list[{"text": str, ...}]
embed_model_name="text-embedding-3-small",
eval_queries=["What is X?", "How does Y work?"],
)
print(report)
if not report.ok():
raise SystemExit("Fix the critical corpus issues before indexing.")
clean = report.cleaned_chunks(chunks) # drop chunks flagged criticalThat's the whole API. Strings or dicts work as inputs. chaffer does not embed any chunk, call any LLM, or hit your vector DB — it's deterministic, runs in seconds on a 50k-chunk corpus, and depends only on the Python standard library.
The bugs that wreck production RAG aren't retriever bugs. They're corpus hygiene bugs that pass code review:
- The same boilerplate header is in every doc → top-5 retrieval returns 4 copies of the same chunk.
- A chunker emits oversized chunks → the embed model silently truncates and the tail of every long chunk is unreachable.
- Eval queries were written by reading the source docs → reported recall is partly a string-match exercise.
- An empty chunk slips in → it produces a zero-vector embedding and pollutes top-k.
- The embed model dim doesn't match the index dim → inserts produce garbage similarity scores.
chaffer.check_corpus(...) is a single call that catches these before you spend money embedding the corpus, with a concrete fix for each.
| Code | Severity | What it catches |
|---|---|---|
RG001 |
critical | Exact-duplicate chunks (MD5 hash collision) |
RG003 |
critical | Oversized chunks silently truncated by the named embed model |
RG004 |
critical | Embedding-dim mismatch between model and index |
RG006 |
critical | Eval queries leaking verbatim/near-verbatim into the corpus |
RG012 |
critical | Empty or whitespace-only chunks |
RG002 |
warning | Near-duplicate chunks (5-shingle Jaccard ≥ 0.85) |
RG005 |
warning | PII detected in chunks (email / phone / SSN / credit-card / IBAN) |
RG014 |
warning | BM25 vs semantic top-k disagreement (one retriever likely broken) |
Each finding tells you the affected chunk indices, the severity, and how to fix it — not just that something is wrong.
The repo ships examples/demo.py — a synthetic 60-chunk corpus with five bugs baked in:
- Exact-duplicate boilerplate footer copied across 6 documents → RG001
- One chunk pasted at 4× the embed model's max_seq_length → RG003
- Empty chunk from a malformed parser → RG012
- An eval query quoted verbatim into the corpus → RG006
- Configured embed model produces 1536-dim vectors but the index expects 768 → RG004
Run it:
cd examples
python demo.pychaffer flags all 5 as critical and refuses to ok().
import chaffer, sys
report = chaffer.check_corpus(
chunks,
embed_model_name="text-embedding-3-small",
eval_queries=eval_questions,
)
sys.exit(0 if report.ok() else 1)A failed report.ok() blocks the merge before a bad corpus gets embedded.
chaffer.check_corpus() looks at data. chaffer.check_retriever() looks at retrieval behavior:
import chaffer
def my_dense(query, k): # your semantic retriever
return vector_db.search(embed(query), k=k)
def my_bm25(query, k): # any BM25 over the same corpus
return bm25_index.search(query, k=k)
report = chaffer.check_retriever(
my_dense,
bm25=my_bm25,
eval_queries=["What is X?", ...],
k=10,
)
print(report)When BM25 and your semantic retriever share less than 10% of their top-k on average, one of them is probably broken — chaffer flags this as RG014.
chaffer.check_corpus(
chunks, # list[str] or list[{"text": str, ...}]
*,
embed_model_name=None, # enables RG003 / RG004
index_dim=None, # enables RG004
eval_queries=None, # enables RG006
near_dupe_threshold=0.85, # RG002 threshold
) -> Report
chaffer.check_retriever(
retriever, # callable (query, k) -> list
bm25, # callable (query, k) -> list, or None
eval_queries, # list[str]
*,
k=10,
) -> ReportReport:
report.ok()—Trueif no critical findings.report.findings,report.critical,report.warnings,report.infos— lists ofFinding.report.cleaned_chunks(chunks)— drops chunks flagged by any critical finding.print(report)— human-readable terminal summary.report.to_dict()— JSON-serializable dict (good for CI logs / artifacts).
Each Finding has: code, severity (critical / warning / info), message, fix, chunks (tuple of indices), details.
chaffer ships with limits for OpenAI text-embedding-3-{small,large,ada-002}, Cohere embed-english-v3.0 family, Voyage voyage-3 / voyage-3-lite, all-MiniLM-L6-v2, all-mpnet-base-v2, BAAI/bge-{small,base,large}-en-v1.5, intfloat/e5-{small,base}-v2. If your model isn't in the registry, RG003 emits an info finding ("truncation check skipped") instead of crashing — open an issue or PR to add it.
chaffer is only a linter for RAG corpus and retriever bugs. It doesn't:
- embed text (use OpenAI / Cohere / sentence-transformers),
- store or search vectors (use pinecone / weaviate / qdrant / faiss / chroma),
- evaluate end-to-end answer quality (use RAGAS / TruLens / DeepEval),
- check answer faithfulness (corroborate does — sibling library),
- chunk documents (use unstructured / llama-index / langchain).
Doing one thing well is the point. If chaffer.check_corpus() returns clean, your corpus isn't silently broken — and that's all it claims to do.
- corroborate — deterministic answer-grounding check. Sibling library: chaffer lints the corpus before retrieval, corroborate lints the answer after generation.
- dash-mlguard — same author, same form factor, but for ML training pipelines instead of RAG.
git clone https://github.com/asmitdash/chaffer
cd chaffer
pip install -e ".[dev]"
pytestMIT — see LICENSE.