Skip to content

asmitdash/chaffer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chaffer

Lint for RAG corpora and retrievers. One import, one call, one report — catches the silent corpus bugs that pass review and quietly destroy retrieval quality in production.

pip install chaffer            # core (pure Python stdlib)
pip install chaffer[tokens]    # adds tiktoken for exact token counts (RG003)
pip install chaffer[pdf]       # adds PDF report support (fpdf2)
import chaffer

report = chaffer.check_corpus(
    chunks,                                          # list[str] or list[{"text": str, ...}]
    embed_model_name="text-embedding-3-small",
    eval_queries=["What is X?", "How does Y work?"],
)
print(report)

if not report.ok():
    raise SystemExit("Fix the critical corpus issues before indexing.")

clean = report.cleaned_chunks(chunks)                # drop chunks flagged critical

That's the whole API. Strings or dicts work as inputs. chaffer does not embed any chunk, call any LLM, or hit your vector DB — it's deterministic, runs in seconds on a 50k-chunk corpus, and depends only on the Python standard library.


Why this exists

The bugs that wreck production RAG aren't retriever bugs. They're corpus hygiene bugs that pass code review:

  • The same boilerplate header is in every doc → top-5 retrieval returns 4 copies of the same chunk.
  • A chunker emits oversized chunks → the embed model silently truncates and the tail of every long chunk is unreachable.
  • Eval queries were written by reading the source docs → reported recall is partly a string-match exercise.
  • An empty chunk slips in → it produces a zero-vector embedding and pollutes top-k.
  • The embed model dim doesn't match the index dim → inserts produce garbage similarity scores.

chaffer.check_corpus(...) is a single call that catches these before you spend money embedding the corpus, with a concrete fix for each.


What it catches

Code Severity What it catches
RG001 critical Exact-duplicate chunks (MD5 hash collision)
RG003 critical Oversized chunks silently truncated by the named embed model
RG004 critical Embedding-dim mismatch between model and index
RG006 critical Eval queries leaking verbatim/near-verbatim into the corpus
RG012 critical Empty or whitespace-only chunks
RG002 warning Near-duplicate chunks (5-shingle Jaccard ≥ 0.85)
RG005 warning PII detected in chunks (email / phone / SSN / credit-card / IBAN)
RG014 warning BM25 vs semantic top-k disagreement (one retriever likely broken)

Each finding tells you the affected chunk indices, the severity, and how to fix it — not just that something is wrong.


Demo: with vs without chaffer

The repo ships examples/demo.py — a synthetic 60-chunk corpus with five bugs baked in:

  1. Exact-duplicate boilerplate footer copied across 6 documents → RG001
  2. One chunk pasted at 4× the embed model's max_seq_length → RG003
  3. Empty chunk from a malformed parser → RG012
  4. An eval query quoted verbatim into the corpus → RG006
  5. Configured embed model produces 1536-dim vectors but the index expects 768 → RG004

Run it:

cd examples
python demo.py

chaffer flags all 5 as critical and refuses to ok().


Use it in CI

import chaffer, sys

report = chaffer.check_corpus(
    chunks,
    embed_model_name="text-embedding-3-small",
    eval_queries=eval_questions,
)
sys.exit(0 if report.ok() else 1)

A failed report.ok() blocks the merge before a bad corpus gets embedded.


Audit a retriever

chaffer.check_corpus() looks at data. chaffer.check_retriever() looks at retrieval behavior:

import chaffer

def my_dense(query, k):     # your semantic retriever
    return vector_db.search(embed(query), k=k)

def my_bm25(query, k):      # any BM25 over the same corpus
    return bm25_index.search(query, k=k)

report = chaffer.check_retriever(
    my_dense,
    bm25=my_bm25,
    eval_queries=["What is X?", ...],
    k=10,
)
print(report)

When BM25 and your semantic retriever share less than 10% of their top-k on average, one of them is probably broken — chaffer flags this as RG014.


API reference

chaffer.check_corpus(
    chunks,                              # list[str] or list[{"text": str, ...}]
    *,
    embed_model_name=None,               # enables RG003 / RG004
    index_dim=None,                      # enables RG004
    eval_queries=None,                   # enables RG006
    near_dupe_threshold=0.85,            # RG002 threshold
) -> Report

chaffer.check_retriever(
    retriever,                           # callable (query, k) -> list
    bm25,                                # callable (query, k) -> list, or None
    eval_queries,                        # list[str]
    *,
    k=10,
) -> Report

Report:

  • report.ok()True if no critical findings.
  • report.findings, report.critical, report.warnings, report.infos — lists of Finding.
  • report.cleaned_chunks(chunks) — drops chunks flagged by any critical finding.
  • print(report) — human-readable terminal summary.
  • report.to_dict() — JSON-serializable dict (good for CI logs / artifacts).

Each Finding has: code, severity (critical / warning / info), message, fix, chunks (tuple of indices), details.


Known embed models

chaffer ships with limits for OpenAI text-embedding-3-{small,large,ada-002}, Cohere embed-english-v3.0 family, Voyage voyage-3 / voyage-3-lite, all-MiniLM-L6-v2, all-mpnet-base-v2, BAAI/bge-{small,base,large}-en-v1.5, intfloat/e5-{small,base}-v2. If your model isn't in the registry, RG003 emits an info finding ("truncation check skipped") instead of crashing — open an issue or PR to add it.


Scope, on purpose

chaffer is only a linter for RAG corpus and retriever bugs. It doesn't:

  • embed text (use OpenAI / Cohere / sentence-transformers),
  • store or search vectors (use pinecone / weaviate / qdrant / faiss / chroma),
  • evaluate end-to-end answer quality (use RAGAS / TruLens / DeepEval),
  • check answer faithfulness (corroborate does — sibling library),
  • chunk documents (use unstructured / llama-index / langchain).

Doing one thing well is the point. If chaffer.check_corpus() returns clean, your corpus isn't silently broken — and that's all it claims to do.


See also

  • corroborate — deterministic answer-grounding check. Sibling library: chaffer lints the corpus before retrieval, corroborate lints the answer after generation.
  • dash-mlguard — same author, same form factor, but for ML training pipelines instead of RAG.

Development

git clone https://github.com/asmitdash/chaffer
cd chaffer
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

About

Lint for RAG corpora and retrievers: catch silent bugs (duplicate chunks, embed-model truncation, eval-set leakage, dim mismatch) before you ship.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages