Hybrid RAG Pipeline — US Insurance Financial Data

A deterministic hybrid retrieval pipeline for financial document search, specialised for US insurance and managed-care financials. No LLM is used internally — retrieval is purely algorithmic, reproducible, and requires no API keys.

What It Does

Given a natural-language query like "Why did Progressive combined ratio improve in Q3 2024?", the pipeline:

Infers hard metadata filters — extracts company, year, and quarter from the query text
Hard-filters the corpus down to only matching chunks (e.g., Progressive + Q3 + 2024)
Runs BM25 keyword search over the filtered candidates to find exact term matches
Runs semantic vector search over the same candidates to find conceptually related chunks
Fuses both ranked lists using Reciprocal Rank Fusion — no LLM reranker needed
Returns ranked chunks ready to be fed to any LLM as grounded context

Architecture

                          ┌─────────────────────────┐
   Natural-language  ───► │  MetadataFilterEngine   │ ─── infers year / company / quarter
   query                  └──────────┬──────────────┘
                                     │ filtered candidates
                       ┌─────────────┴─────────────┐
                       │                           │
              ┌────────▼────────┐       ┌──────────▼────────┐
              │ BM25Keyword     │       │  VectorRetriever   │
              │ Retriever       │       │  (semantic search) │
              │ (exact terms)   │       │                    │
              └────────┬────────┘       └──────────┬────────┘
                       │  ranked list              │  ranked list
                       └─────────────┬─────────────┘
                              ┌──────▼──────┐
                              │  Reciprocal │
                              │ Rank Fusion │
                              └──────┬──────┘
                                     │
                              top-k FusedChunks
                              (fed to LLM or returned directly)

Stage 1 — MetadataFilterEngine (`rag_pipeline/metadata.py`)

Extracts structured filters before any ranking:

Year extraction — regex \b(20\d{2})\b matches 2024, FY2025
Quarter extraction — regex \bQ([1-4])\b matches Q3, q1
Company extraction — collision-safe alias matching using word-boundary regex across all canonical company names. Short aliases (e.g. progressive) are only registered when they uniquely match one company and don't appear as a substring of another.
Ranking query stripping — metadata terms are removed from the query before it goes to BM25/vector, so "Progressive Q3 2024 combined ratio" becomes "combined ratio" for the ranking stage.

Hard filtering means zero wrong-company results — it is not a soft penalty, it is exclusion before any scoring.

Stage 2 — BM25 Keyword Retriever (`rag_pipeline/keyword.py`)

Standard BM25 (Okapi BM25) with a key improvement: separate field statistics for unigrams and phrases.

Most BM25 implementations mix unigrams and bigrams in a single field, which inflates average document length by ~2× and breaks length normalisation. This implementation maintains:

Field	Content	Weight
Unigram	Individual content tokens	1.0×
Phrase	Bigrams + trigrams	1.5× (higher precision)

Each field has its own avg_dl and document frequency table. The final score is:

score(chunk) = BM25_unigram(chunk, query) + 1.5 × BM25_phrase(chunk, query)

Parameters: k1=1.5, b=0.75 (Robertson & Zaragoza defaults).

Insurance terms like combined ratio, loss ratio, net written premiums score highly as phrases even when the individual words are common.

Stage 3 — Vector Retriever (`rag_pipeline/vector.py`)

Semantic similarity search over embedded chunks. Ships with HashingSemanticEmbedder — a local, deterministic, zero-dependency embedder built from:

Token-hashing signal — BLAKE2b hashes of tokens and n-grams mapped to a 375-dim bucket space
Financial concept signal — 9 domain concept dimensions (profitability, growth, decline, revenue, cash flow, debt, insurance ratios, claims, capital) filled with sqrt(match_count) weighting

The result is an L2-normalised 384-dim vector. Dot product of two normalised vectors equals cosine similarity.

Production swap: replace HashingSemanticEmbedder with any model that implements embed(text: str) -> list[float]. Examples: bge-small-en, text-embedding-3-small, FinBERT, E5-small. Pass it as embedder= to HybridRetrievalPipeline.

Stage 4 — Reciprocal Rank Fusion (`rag_pipeline/rrf.py`)

Merges the keyword and vector ranked lists using rank-only math:

RRF_score(chunk) = Σ  1 / (k + rank_in_list)

With k=60 (Cormack et al., 2009 default). Ties are broken by: number of contributing sources → vector score → keyword score → max of both.

Chunks appearing in both ranked lists get a double boost — they matched both exact terms and semantic meaning, which is the strongest retrieval signal.

Pipeline Short-Circuits

The pipeline has two explicit exit paths to surface clear error messages:

Condition	Message
No chunks survive metadata filtering	`"No chunks matched the metadata filters."`
Ranking query is empty after stripping metadata terms	`"The query only contained metadata terms. Add a financial topic such as revenue, margin, combined ratio..."`

Key Design Decisions

Why hard filters, not soft penalties? Soft penalties (e.g. multiply by 0.5 for wrong company) still return wrong-company results at lower ranks. Hard filters guarantee that results from a Progressive query never include an Allstate chunk, regardless of how similar the text is.

Why separate BM25 fields? A chunk with text "combined ratio improved" has 3 unigrams and 2 bigrams. If stored in one field it appears to have 5 tokens, inflating dl / avg_dl. Separate fields give each dimension its own calibrated length normalisation.

Why RRF instead of a learned reranker? RRF is deterministic, requires no training data, adds zero latency, and has been shown in benchmarks to match or beat learned rerankers at modest corpus sizes. For a financial retrieval pipeline with stable schema, the tradeoff is correct.

Why stopword expansion? The tokenizer removes ~35 stop words including natural language filler ("all", "any", "which", "every"). Without this, a query like "all combined ratio Q3 2024" would score Allstate chunks highly because "all" appears frequently in Allstate filings ("across all segments", "approved in all 47 states"). Removing it makes the query truly about combined ratio.

Why collision-safe alias resolution? A naive alias map could silently resolve "american" to American International Group when both AIG and American Financial Group are in the corpus. The two-pass approach checks word-boundary regex across all other canonicals and only registers an alias when it is unambiguous.

Modules

File	Class / Function	Responsibility
`rag_pipeline/models.py`	`DocumentChunk`, `QueryFilters`, `ScoredChunk`, `FusedChunk`	Core data types
`rag_pipeline/pipeline.py`	`HybridRetrievalPipeline`	Orchestrates the full search flow
`rag_pipeline/metadata.py`	`MetadataFilterEngine`	Filter inference, hard filtering, query stripping
`rag_pipeline/keyword.py`	`BM25KeywordRetriever`	Dual-field BM25
`rag_pipeline/vector.py`	`VectorRetriever`, `HashingSemanticEmbedder`	Semantic search
`rag_pipeline/rrf.py`	`ReciprocalRankFusion`	Rank merging
`rag_pipeline/text.py`	`tokenize`, `iter_ngrams`	Shared tokenizer
`rag_pipeline/pdf_ingestor.py`	`ingest_pdf`	PDF → chunks
`app.py`	—	Streamlit UI
`demo.py`	—	CLI demo
`scripts/ingest_sec_filings.py`	—	Live SEC EDGAR ingestion

Quick Start

No build step or package install needed for the core pipeline — zero external dependencies.

# Clone and enter
git clone https://github.com/your-org/rag-pipeline.git
cd rag-pipeline

# Run the CLI demo against the curated dataset
python demo.py "Progressive combined ratio Q3 2024"

# Run the Streamlit UI (requires: pip install streamlit pdfplumber)
streamlit run app.py

Demo Queries

# Single company, specific quarter
python demo.py "Progressive combined ratio Q3 2024"
python demo.py "Allstate loss ratio improvement 2024"
python demo.py "UnitedHealth medical loss ratio Medicare Advantage Q3 2024"
python demo.py "AIG net written premiums commercial lines 2024"
python demo.py "MetLife RBC ratio capital Q1 2024"
python demo.py "Travelers investment income 2024"
python demo.py "Cigna EPS earnings guidance Q3 2024"

# Multi-company comparison (no company filter — searches all)
python demo.py "Which insurer had the best combined ratio improvement in Q3 2024?"
python demo.py "catastrophe losses homeowners 2024"

# Semantic — no exact terms
python demo.py "underwriting margins deteriorated claims experience"
python demo.py "reserve development adverse social inflation"

# Complex multi-metric
python demo.py "Progressive underwriting income loss ratio expense ratio Q3 2024"
python demo.py "Allstate catastrophe losses homeowners rate actions 2024"
python demo.py "UnitedHealth Optum pharmacy revenue Medicare Advantage utilization Q3 2024"

Live SEC Filing Queries

python demo.py "Progressive net written premiums combined ratio 2025" --data data\sec_live_chunks.jsonl
python demo.py "UnitedHealth medical loss ratio Optum Q1 2025" --data data\sec_live_chunks.jsonl
python demo.py "Travelers catastrophe losses investment income Q1 2025" --data data\sec_live_chunks.jsonl

Use in Code

from rag_pipeline import DocumentChunk, HybridRetrievalPipeline

chunks = [
    DocumentChunk(
        id="pgr-q3-2024-combined-ratio",
        text="Progressive Q3 2024 combined ratio improved to 89.4%, loss ratio 63.1%, expense ratio 26.3%.",
        metadata={"company": "Progressive Corporation", "year": 2024, "quarter": "Q3"},
    ),
    DocumentChunk(
        id="all-q3-2024-combined-ratio",
        text="Allstate Q3 2024 combined ratio was 93.2%, loss ratio 67.1%, with $420M catastrophe losses.",
        metadata={"company": "Allstate Corporation", "year": 2024, "quarter": "Q3"},
    ),
]

pipeline = HybridRetrievalPipeline(chunks)
response = pipeline.search("Progressive combined ratio Q3 2024")

print(f"Filters inferred: {response.trace.filters}")
print(f"Candidates after filter: {response.trace.candidate_count}")
print(f"Ranking query sent to BM25/vector: '{response.trace.ranking_query}'")

for result in response.results:
    print(f"  Rank {result.rank}  score={result.score:.4f}  sources={result.contributing_sources}")
    print(f"    {result.chunk.text}")

Supplying Explicit Filters

from rag_pipeline import QueryFilters

filters = QueryFilters(
    companies={"Progressive Corporation"},
    years={2024},
    quarters={"Q3"},
)
response = pipeline.search("underwriting income loss trend", filters=filters)

Swapping the Embedder

from sentence_transformers import SentenceTransformer

class STEmbedder:
    def __init__(self):
        self.model = SentenceTransformer("BAAI/bge-small-en-v1.5")

    def embed(self, text: str) -> list[float]:
        return self.model.encode(text, normalize_embeddings=True).tolist()

pipeline = HybridRetrievalPipeline(chunks, embedder=STEmbedder())

Data

Curated Dataset (`data/financial_chunks.jsonl`)

47 chunks covering 7 US insurers and managed-care companies across 2023–2025:

Company	Ticker
Progressive Corporation	PGR
Allstate Corporation	ALL
UnitedHealth Group	UNH
AIG	AIG
MetLife	MET
Cigna Group	CI
Travelers Companies	TRV

Metrics covered: combined ratio, loss ratio, expense ratio, net written premiums, earned premiums, medical loss ratio (MLR), RBC ratio, investment income, catastrophe losses, reserve development, EPS, book value, capital returns.

Chunk schema:

{
  "id": "pgr-2024-q3-underwriting",
  "text": "Progressive Q3 2024 combined ratio improved to 89.4%...",
  "metadata": {
    "company": "Progressive Corporation",
    "year": 2024,
    "quarter": "Q3",
    "section": "underwriting"
  }
}

Live SEC Filings (`data/sec_live_chunks.jsonl`)

21 Q1 2025 chunks generated from SEC EDGAR 10-Q filings. Ingest fresh filings:

python scripts\ingest_sec_filings.py `
    --tickers PGR ALL UNH AIG MET CI TRV `
    --forms 10-Q `
    --limit 1 `
    --output data\sec_live_chunks.jsonl `
    --user-agent "RAGPipelineDemo/0.1 your-email@example.com"

PDF Upload and Ingestion

The Streamlit UI (app.py) accepts uploaded PDF files (10-Q filings, earnings releases, investor presentations) and indexes them alongside the built-in dataset.

rag_pipeline/pdf_ingestor.py — ingest_pdf():

Extracts text and tables from each page via pdfplumber
Splits into overlapping chunks with a sliding-window word chunker
Auto-detects company from the filename stem
Detects year and quarter from the filename first, then from body text
Labels each chunk with a section tag (underwriting, investment, capital, claims, revenue, profitability) based on dominant vocabulary in that window
Stores page, source, and section in chunk metadata

Detected metadata can be overridden per-file in the UI.

Streamlit UI

pip install streamlit pdfplumber
streamlit run app.py

Three tabs:

Tab	Purpose
Search	Query the pipeline; toggle between built-in and uploaded PDF data; dynamically populated metadata filters
Upload PDFs	Drag-and-drop PDF ingestion; metadata overrides; configurable chunk size and overlap
Chunk Data	Browse the full indexed dataset; filter by company, year, quarter, section, or free text

Tests

# All 42 tests
python -m unittest discover -s tests -v

# Individual test classes
python -m unittest tests.test_pipeline.BM25Test -v
python -m unittest tests.test_pipeline.MetadataFilterTest -v
python -m unittest tests.test_pipeline.EmbedderTest -v
python -m unittest tests.test_pipeline.RRFTest -v
python -m unittest tests.test_pipeline.PipelineIntegrationTest -v

Key integration tests:

Exact-match filter inference from query text (year, quarter, company)
Multi-company and cross-year queries return correct scoping
Collision-safe alias: short names do not match the wrong company
Empty-query short-circuit: metadata-only query returns a clear error message
Stopword guard: "all combined ratio Q3 2024" returns results from multiple companies, not just Allstate
RRF tie-breaking: chunks appearing in both keyword and vector results rank above single-source results

Production Checklist

Item	Current state	Recommended upgrade
Embedder	Local hashing (demo-grade)	`bge-small-en`, `text-embedding-3-small`, or FinBERT
Corpus storage	JSONL file scan	Qdrant / Weaviate / Pinecone for millions of chunks
Metadata filtering	In-memory list scan	Push `where` filters into the vector DB
Keyword search	In-process BM25	Elasticsearch / OpenSearch for large corpora
LLM answer generation	Not included	Feed top-k chunks as context to GPT-4o, Llama 3, Mistral, or any open-weight model
Auth / rate limiting	Not included	Add before public deployment

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid RAG Pipeline — US Insurance Financial Data

What It Does

Architecture

Stage 1 — MetadataFilterEngine (`rag_pipeline/metadata.py`)

Stage 2 — BM25 Keyword Retriever (`rag_pipeline/keyword.py`)

Stage 3 — Vector Retriever (`rag_pipeline/vector.py`)

Stage 4 — Reciprocal Rank Fusion (`rag_pipeline/rrf.py`)

Pipeline Short-Circuits

Key Design Decisions

Modules

Quick Start

Demo Queries

Live SEC Filing Queries

Use in Code

Supplying Explicit Filters

Swapping the Embedder

Data

Curated Dataset (`data/financial_chunks.jsonl`)

Live SEC Filings (`data/sec_live_chunks.jsonl`)

PDF Upload and Ingestion

Streamlit UI

Tests

Production Checklist

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
rag_pipeline		rag_pipeline
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
app.py		app.py
demo.py		demo.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Hybrid RAG Pipeline — US Insurance Financial Data

What It Does

Architecture

Stage 1 — MetadataFilterEngine (rag_pipeline/metadata.py)

Stage 2 — BM25 Keyword Retriever (rag_pipeline/keyword.py)

Stage 3 — Vector Retriever (rag_pipeline/vector.py)

Stage 4 — Reciprocal Rank Fusion (rag_pipeline/rrf.py)

Pipeline Short-Circuits

Key Design Decisions

Modules

Quick Start

Demo Queries

Live SEC Filing Queries

Use in Code

Supplying Explicit Filters

Swapping the Embedder

Data

Curated Dataset (data/financial_chunks.jsonl)

Live SEC Filings (data/sec_live_chunks.jsonl)

PDF Upload and Ingestion

Streamlit UI

Tests

Production Checklist

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Stage 1 — MetadataFilterEngine (`rag_pipeline/metadata.py`)

Stage 2 — BM25 Keyword Retriever (`rag_pipeline/keyword.py`)

Stage 3 — Vector Retriever (`rag_pipeline/vector.py`)

Stage 4 — Reciprocal Rank Fusion (`rag_pipeline/rrf.py`)

Curated Dataset (`data/financial_chunks.jsonl`)

Live SEC Filings (`data/sec_live_chunks.jsonl`)

Packages