A deterministic hybrid retrieval pipeline for financial document search, specialised for US insurance and managed-care financials. No LLM is used internally — retrieval is purely algorithmic, reproducible, and requires no API keys.
Given a natural-language query like "Why did Progressive combined ratio improve in Q3 2024?", the pipeline:
- Infers hard metadata filters — extracts company, year, and quarter from the query text
- Hard-filters the corpus down to only matching chunks (e.g., Progressive + Q3 + 2024)
- Runs BM25 keyword search over the filtered candidates to find exact term matches
- Runs semantic vector search over the same candidates to find conceptually related chunks
- Fuses both ranked lists using Reciprocal Rank Fusion — no LLM reranker needed
- Returns ranked chunks ready to be fed to any LLM as grounded context
┌─────────────────────────┐
Natural-language ───► │ MetadataFilterEngine │ ─── infers year / company / quarter
query └──────────┬──────────────┘
│ filtered candidates
┌─────────────┴─────────────┐
│ │
┌────────▼────────┐ ┌──────────▼────────┐
│ BM25Keyword │ │ VectorRetriever │
│ Retriever │ │ (semantic search) │
│ (exact terms) │ │ │
└────────┬────────┘ └──────────┬────────┘
│ ranked list │ ranked list
└─────────────┬─────────────┘
┌──────▼──────┐
│ Reciprocal │
│ Rank Fusion │
└──────┬──────┘
│
top-k FusedChunks
(fed to LLM or returned directly)
Extracts structured filters before any ranking:
- Year extraction — regex
\b(20\d{2})\bmatches2024,FY2025 - Quarter extraction — regex
\bQ([1-4])\bmatchesQ3,q1 - Company extraction — collision-safe alias matching using word-boundary regex across all canonical company names. Short aliases (e.g.
progressive) are only registered when they uniquely match one company and don't appear as a substring of another. - Ranking query stripping — metadata terms are removed from the query before it goes to BM25/vector, so
"Progressive Q3 2024 combined ratio"becomes"combined ratio"for the ranking stage.
Hard filtering means zero wrong-company results — it is not a soft penalty, it is exclusion before any scoring.
Standard BM25 (Okapi BM25) with a key improvement: separate field statistics for unigrams and phrases.
Most BM25 implementations mix unigrams and bigrams in a single field, which inflates average document length by ~2× and breaks length normalisation. This implementation maintains:
| Field | Content | Weight |
|---|---|---|
| Unigram | Individual content tokens | 1.0× |
| Phrase | Bigrams + trigrams | 1.5× (higher precision) |
Each field has its own avg_dl and document frequency table. The final score is:
score(chunk) = BM25_unigram(chunk, query) + 1.5 × BM25_phrase(chunk, query)
Parameters: k1=1.5, b=0.75 (Robertson & Zaragoza defaults).
Insurance terms like combined ratio, loss ratio, net written premiums score highly as phrases even when the individual words are common.
Semantic similarity search over embedded chunks. Ships with HashingSemanticEmbedder — a local, deterministic, zero-dependency embedder built from:
- Token-hashing signal — BLAKE2b hashes of tokens and n-grams mapped to a 375-dim bucket space
- Financial concept signal — 9 domain concept dimensions (profitability, growth, decline, revenue, cash flow, debt, insurance ratios, claims, capital) filled with
sqrt(match_count)weighting
The result is an L2-normalised 384-dim vector. Dot product of two normalised vectors equals cosine similarity.
Production swap: replace HashingSemanticEmbedder with any model that implements embed(text: str) -> list[float]. Examples: bge-small-en, text-embedding-3-small, FinBERT, E5-small. Pass it as embedder= to HybridRetrievalPipeline.
Merges the keyword and vector ranked lists using rank-only math:
RRF_score(chunk) = Σ 1 / (k + rank_in_list)
With k=60 (Cormack et al., 2009 default). Ties are broken by: number of contributing sources → vector score → keyword score → max of both.
Chunks appearing in both ranked lists get a double boost — they matched both exact terms and semantic meaning, which is the strongest retrieval signal.
The pipeline has two explicit exit paths to surface clear error messages:
| Condition | Message |
|---|---|
| No chunks survive metadata filtering | "No chunks matched the metadata filters." |
| Ranking query is empty after stripping metadata terms | "The query only contained metadata terms. Add a financial topic such as revenue, margin, combined ratio..." |
Why hard filters, not soft penalties? Soft penalties (e.g. multiply by 0.5 for wrong company) still return wrong-company results at lower ranks. Hard filters guarantee that results from a Progressive query never include an Allstate chunk, regardless of how similar the text is.
Why separate BM25 fields?
A chunk with text "combined ratio improved" has 3 unigrams and 2 bigrams. If stored in one field it appears to have 5 tokens, inflating dl / avg_dl. Separate fields give each dimension its own calibrated length normalisation.
Why RRF instead of a learned reranker? RRF is deterministic, requires no training data, adds zero latency, and has been shown in benchmarks to match or beat learned rerankers at modest corpus sizes. For a financial retrieval pipeline with stable schema, the tradeoff is correct.
Why stopword expansion?
The tokenizer removes ~35 stop words including natural language filler ("all", "any", "which", "every"). Without this, a query like "all combined ratio Q3 2024" would score Allstate chunks highly because "all" appears frequently in Allstate filings ("across all segments", "approved in all 47 states"). Removing it makes the query truly about combined ratio.
Why collision-safe alias resolution?
A naive alias map could silently resolve "american" to American International Group when both AIG and American Financial Group are in the corpus. The two-pass approach checks word-boundary regex across all other canonicals and only registers an alias when it is unambiguous.
| File | Class / Function | Responsibility |
|---|---|---|
rag_pipeline/models.py |
DocumentChunk, QueryFilters, ScoredChunk, FusedChunk |
Core data types |
rag_pipeline/pipeline.py |
HybridRetrievalPipeline |
Orchestrates the full search flow |
rag_pipeline/metadata.py |
MetadataFilterEngine |
Filter inference, hard filtering, query stripping |
rag_pipeline/keyword.py |
BM25KeywordRetriever |
Dual-field BM25 |
rag_pipeline/vector.py |
VectorRetriever, HashingSemanticEmbedder |
Semantic search |
rag_pipeline/rrf.py |
ReciprocalRankFusion |
Rank merging |
rag_pipeline/text.py |
tokenize, iter_ngrams |
Shared tokenizer |
rag_pipeline/pdf_ingestor.py |
ingest_pdf |
PDF → chunks |
app.py |
— | Streamlit UI |
demo.py |
— | CLI demo |
scripts/ingest_sec_filings.py |
— | Live SEC EDGAR ingestion |
No build step or package install needed for the core pipeline — zero external dependencies.
# Clone and enter
git clone https://github.com/your-org/rag-pipeline.git
cd rag-pipeline
# Run the CLI demo against the curated dataset
python demo.py "Progressive combined ratio Q3 2024"
# Run the Streamlit UI (requires: pip install streamlit pdfplumber)
streamlit run app.py# Single company, specific quarter
python demo.py "Progressive combined ratio Q3 2024"
python demo.py "Allstate loss ratio improvement 2024"
python demo.py "UnitedHealth medical loss ratio Medicare Advantage Q3 2024"
python demo.py "AIG net written premiums commercial lines 2024"
python demo.py "MetLife RBC ratio capital Q1 2024"
python demo.py "Travelers investment income 2024"
python demo.py "Cigna EPS earnings guidance Q3 2024"
# Multi-company comparison (no company filter — searches all)
python demo.py "Which insurer had the best combined ratio improvement in Q3 2024?"
python demo.py "catastrophe losses homeowners 2024"
# Semantic — no exact terms
python demo.py "underwriting margins deteriorated claims experience"
python demo.py "reserve development adverse social inflation"
# Complex multi-metric
python demo.py "Progressive underwriting income loss ratio expense ratio Q3 2024"
python demo.py "Allstate catastrophe losses homeowners rate actions 2024"
python demo.py "UnitedHealth Optum pharmacy revenue Medicare Advantage utilization Q3 2024"python demo.py "Progressive net written premiums combined ratio 2025" --data data\sec_live_chunks.jsonl
python demo.py "UnitedHealth medical loss ratio Optum Q1 2025" --data data\sec_live_chunks.jsonl
python demo.py "Travelers catastrophe losses investment income Q1 2025" --data data\sec_live_chunks.jsonlfrom rag_pipeline import DocumentChunk, HybridRetrievalPipeline
chunks = [
DocumentChunk(
id="pgr-q3-2024-combined-ratio",
text="Progressive Q3 2024 combined ratio improved to 89.4%, loss ratio 63.1%, expense ratio 26.3%.",
metadata={"company": "Progressive Corporation", "year": 2024, "quarter": "Q3"},
),
DocumentChunk(
id="all-q3-2024-combined-ratio",
text="Allstate Q3 2024 combined ratio was 93.2%, loss ratio 67.1%, with $420M catastrophe losses.",
metadata={"company": "Allstate Corporation", "year": 2024, "quarter": "Q3"},
),
]
pipeline = HybridRetrievalPipeline(chunks)
response = pipeline.search("Progressive combined ratio Q3 2024")
print(f"Filters inferred: {response.trace.filters}")
print(f"Candidates after filter: {response.trace.candidate_count}")
print(f"Ranking query sent to BM25/vector: '{response.trace.ranking_query}'")
for result in response.results:
print(f" Rank {result.rank} score={result.score:.4f} sources={result.contributing_sources}")
print(f" {result.chunk.text}")from rag_pipeline import QueryFilters
filters = QueryFilters(
companies={"Progressive Corporation"},
years={2024},
quarters={"Q3"},
)
response = pipeline.search("underwriting income loss trend", filters=filters)from sentence_transformers import SentenceTransformer
class STEmbedder:
def __init__(self):
self.model = SentenceTransformer("BAAI/bge-small-en-v1.5")
def embed(self, text: str) -> list[float]:
return self.model.encode(text, normalize_embeddings=True).tolist()
pipeline = HybridRetrievalPipeline(chunks, embedder=STEmbedder())47 chunks covering 7 US insurers and managed-care companies across 2023–2025:
| Company | Ticker |
|---|---|
| Progressive Corporation | PGR |
| Allstate Corporation | ALL |
| UnitedHealth Group | UNH |
| AIG | AIG |
| MetLife | MET |
| Cigna Group | CI |
| Travelers Companies | TRV |
Metrics covered: combined ratio, loss ratio, expense ratio, net written premiums, earned premiums, medical loss ratio (MLR), RBC ratio, investment income, catastrophe losses, reserve development, EPS, book value, capital returns.
Chunk schema:
{
"id": "pgr-2024-q3-underwriting",
"text": "Progressive Q3 2024 combined ratio improved to 89.4%...",
"metadata": {
"company": "Progressive Corporation",
"year": 2024,
"quarter": "Q3",
"section": "underwriting"
}
}21 Q1 2025 chunks generated from SEC EDGAR 10-Q filings. Ingest fresh filings:
python scripts\ingest_sec_filings.py `
--tickers PGR ALL UNH AIG MET CI TRV `
--forms 10-Q `
--limit 1 `
--output data\sec_live_chunks.jsonl `
--user-agent "RAGPipelineDemo/0.1 your-email@example.com"The Streamlit UI (app.py) accepts uploaded PDF files (10-Q filings, earnings releases, investor presentations) and indexes them alongside the built-in dataset.
rag_pipeline/pdf_ingestor.py — ingest_pdf():
- Extracts text and tables from each page via
pdfplumber - Splits into overlapping chunks with a sliding-window word chunker
- Auto-detects company from the filename stem
- Detects year and quarter from the filename first, then from body text
- Labels each chunk with a
sectiontag (underwriting, investment, capital, claims, revenue, profitability) based on dominant vocabulary in that window - Stores
page,source, andsectionin chunk metadata
Detected metadata can be overridden per-file in the UI.
pip install streamlit pdfplumber
streamlit run app.pyThree tabs:
| Tab | Purpose |
|---|---|
| Search | Query the pipeline; toggle between built-in and uploaded PDF data; dynamically populated metadata filters |
| Upload PDFs | Drag-and-drop PDF ingestion; metadata overrides; configurable chunk size and overlap |
| Chunk Data | Browse the full indexed dataset; filter by company, year, quarter, section, or free text |
# All 42 tests
python -m unittest discover -s tests -v
# Individual test classes
python -m unittest tests.test_pipeline.BM25Test -v
python -m unittest tests.test_pipeline.MetadataFilterTest -v
python -m unittest tests.test_pipeline.EmbedderTest -v
python -m unittest tests.test_pipeline.RRFTest -v
python -m unittest tests.test_pipeline.PipelineIntegrationTest -vKey integration tests:
- Exact-match filter inference from query text (year, quarter, company)
- Multi-company and cross-year queries return correct scoping
- Collision-safe alias: short names do not match the wrong company
- Empty-query short-circuit: metadata-only query returns a clear error message
- Stopword guard:
"all combined ratio Q3 2024"returns results from multiple companies, not just Allstate - RRF tie-breaking: chunks appearing in both keyword and vector results rank above single-source results
| Item | Current state | Recommended upgrade |
|---|---|---|
| Embedder | Local hashing (demo-grade) | bge-small-en, text-embedding-3-small, or FinBERT |
| Corpus storage | JSONL file scan | Qdrant / Weaviate / Pinecone for millions of chunks |
| Metadata filtering | In-memory list scan | Push where filters into the vector DB |
| Keyword search | In-process BM25 | Elasticsearch / OpenSearch for large corpora |
| LLM answer generation | Not included | Feed top-k chunks as context to GPT-4o, Llama 3, Mistral, or any open-weight model |
| Auth / rate limiting | Not included | Add before public deployment |
MIT