You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
Semantic chunking — all-MiniLM-L6-v2 embedding-based boundary detection in HybridChunker (optional via use_semantic_chunking).
Cross-reference resolution — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
Summary chunks — Asynchronous ARQ background worker (enrich_summaries_job) to auto-generate LLM section summaries for hierarchical RAG retrieval.
Chunk quality scorer — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (/usr/share/dict/words), and fastText Lang-ID validation.
PII redaction — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (en_core_web_sm) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.
Changed
Bumped marker-pdf version support in dependencies.
Added ner optional dependency group (spacy>=3.7.0) in pyproject.toml.
Expanded ChunkingConfig and ProcessingConfig with new semantic, summary, and PII toggle options.