Skip to content

gaurjin/Omna

Repository files navigation

Omna

PyPI Python License Tests

Semantic search, PII masking, and schema understanding — directly on your Polars DataFrames. No vector database. No API key. Data never leaves your machine.


The problem

# Finding every insurance claim denial — painful
keywords = ["claim denied", "coverage rejected", "policy voided", ...]
pattern  = re.compile("|".join(keywords), re.IGNORECASE)
results  = df[df["text"].str.contains(pattern, na=False)]
# Still misses: "insurer refused to honour the policy"
# Still misses: "claim outcome: not payable"
# Still misses: medical claim rejections using clinical terminology
# ...50+ lines per task. Grows with every edge case. Still wrong.
# With Omna
results = df.omna.search("insurance claim denied", on="text", k=5)
# Finds ALL of them — including docs that never say "denied" literally.
# 9ms. 50,000 documents. Zero cloud.

filtered = df.omna.filter("insurance claim denied", on="text", threshold=0.73)
# Every semantically matching document above the threshold.
# No keyword lists. No guesswork. Pure meaning.

answer = results.omna.ask("What personal data do these documents expose?")
# → "These insurance documents expose SSNs, medical record numbers,
#    dates of birth, health plan numbers, and claimant identifiers."
# Instant. One line.

# Auditing for PII before the data ships — painful
for col in df.columns:
    for i, val in enumerate(df[col].to_list()):
        if re.search(r'\b\d{3}-\d{2}-\d{4}\b', str(val)):   # SSNs only
            print(f"row {i}, {col}: {str(val)[:60]}")
# Catches one pattern. Misses emails, phones, names, IBANs.
# No confidence score. No audit trail. No redaction.
# With Omna
df.omna.pii_report()   # audit — find every leak, every column
df.omna.mask_pii()     # redact — one line, full audit log
# Names, SSNs, emails, phone numbers — all gone. Local. No cloud.

Demo

The Sword — semantic search, filter, and ask across 50,000 documents:

Omna Sword Demo

The Shield — PII audit and redaction in one line:

Omna Shield Demo

Dataset: Gretel PII Benchmark (acquired by NVIDIA) — 50,000 synthetic documents built to test data privacy tools.


Install

pip install omna
python -m spacy download en_core_web_lg   # one-time, for PII detection

Requires Python 3.10+. No API key needed for search, filter, embed, pii_report, mask_pii, or understand. Only ask() requires ANTHROPIC_API_KEY.


Quick start

import polars as pl
import omna

df = pl.read_csv("documents.csv")

# 1 — explore the schema
omna.understand_df(df)

# 2 — audit for PII before anything touches the data
df.omna.pii_report()

# 3 — redact
clean = df.omna.mask_pii()

# 4 — build a search index once
clean.omna.embed("text")

# 5 — search by meaning
results = clean.omna.search("insurance claim denied", on="text", k=5)

# 6 — filter everything above a threshold
flagged = clean.omna.filter("insurance claim denied", on="text", threshold=0.73)

# 7 — ask a question in plain English
results.omna.ask("What personal data do these documents expose?")

What Omna does

Method What it does
omna.understand_df(df) Schema inference — labels, null rates, samples. No LLM.
df.omna.embed(column) Vectorize a text column once; reuse across sessions
df.omna.search(query, on, k) Top-k results by semantic meaning
df.omna.filter(query, on, threshold) Every row above a similarity threshold
df.omna.pii_report() Audit every string column for PII
df.omna.mask_pii() Redact PII, auto-save audit log
df.omna.ask(question) Natural language queries over your DataFrame

API reference

omna.understand_df(df) — explore before you do anything

No LLM. No API call. Analyzes column names, dtypes, null rates, and sample values.

omna.understand_df(df)
 column                dtype    null_pct   label     sample
 uid                   String     0.0%     category  24bb757...
 domain                String     0.0%     category  insurance, healthcare...
 document_type         String     0.0%     category  Invoice, ClaimForm...
 document_description  String     0.0%     text      An insurance claim...
 text                  String     0.0%     text      **Claim ID: 285-14...

Labels: email phone name id date text numeric boolean category unknown

df.omna.embed(column) — vectorize once, search forever

Converts text to 384-dimensional vectors using FastEmbed (local ONNX, no API key). Saves to .omna/{column}.parquet. Run once — search() and filter() load it automatically on every subsequent call.

df.omna.embed("text")
# → .omna/text.parquet

Model: BAAI/bge-small-en-v1.5 (~130 MB, downloaded once). Embed is a one-time cost.

Hardware 50k rows
MacBook Air M5 ~45 min
MacBook Pro M4 Max ~15 min
AWS GPU instance ~2 min
df.omna.search(query, on, k) — semantic search

Requires df.omna.embed("column") first.

results = df.omna.search("insurance claim denied", on="text", k=5)
 uid            document_type         domain      text                               _score
 67fccc1e207…   ClaimSummary          insurance   **Claim ID: 285-14-1755, Policy…   0.762
 b8ae088cd21…   ClaimSummary          insurance   **Claim Summary**…                 0.749
 de5bba0a2cc…   Insurance Claim Form  healthcare  **Insurance Claim Form**…          0.748
 ebccdde3b42…   Insurance Claim       healthcare  Insurance Claim for MED74974358…   0.747
 aebb0eb55fb…   ClaimForm             healthcare  **Claim Form** - Patient ID…       0.747

_score is cosine similarity (0–1). None of these documents contain the phrase "insurance claim denied" — Omna finds them by meaning.

df.omna.filter(query, on, threshold) — semantic filter

Requires df.omna.embed("column") first.

filtered = df.omna.filter("insurance claim denied", on="text", threshold=0.73)
# → N documents matched — all semantically related to claim denials

Returns every row above the threshold. Default: 0.3. Raise for precision, lower for recall.

Use search() for the top k. Use filter() for everything above a threshold.

df.omna.pii_report() — audit before you redact
df.omna.pii_report()
 column    detected types                                    hit rate   flagged
 entities  CREDIT_CARD, EMAIL_ADDRESS, PERSON, PHONE_NUMBER   85.4%    ✓ YES
 text      CREDIT_CARD, EMAIL_ADDRESS, PERSON, PHONE_NUMBER   78.1%    ✓ YES

Scans every string column. Returns hit rates, PII types, and confidence scores. Nothing is modified.

df.omna.mask_pii() — redact in one line
clean = df.omna.mask_pii()
# → <REDACTED> replaces every detected entity
# → audit log saved to .omna/pii_audit.parquet automatically

# Fast mode — regex only, ~10x faster, catches email/phone/SSN/URL
clean = df.omna.mask_pii(fast=True)

Detects: PERSON EMAIL_ADDRESS PHONE_NUMBER CREDIT_CARD US_SSN US_PASSPORT IP_ADDRESS IBAN_CODE URL and more.

df.omna.ask(question) — natural language queries

Sends schema + up to 20 sample rows to Claude. Requires ANTHROPIC_API_KEY.

export ANTHROPIC_API_KEY=sk-ant-...
results.omna.ask("What personal data do these documents expose?")
# → "These insurance documents expose SSNs, medical record numbers,
#    dates of birth, health plan numbers, and claimant identifiers."

# Override model
results.omna.ask("Summarise the key themes", model="claude-sonnet-4-6")

Default model: claude-haiku-4-5-20251001.


How it works

df.omna.search("insurance claim denied", on="text", k=5)
         │
         ▼
   embedder.py       FastEmbed — BAAI/bge-small-en-v1.5, local ONNX
                     query → [0.12, -0.34, 0.87, ...]  384-dim vector
         │
         ▼
   index.py          loads .omna/text.parquet → Arrow memory, zero-copy
                     50,000 stored vectors in Polars' own allocation
         │
         ▼
   similarity.rs     Rust kernel — cosine similarity over all vectors
                     returns top-k sorted descending, no Python loop
         │
         ▼
   frame.py          slices result rows, attaches _score → pl.DataFrame

The Rust kernel is 23 lines. Dot products and norms in machine code, no intermediate allocations. 500,000 × 384-dim in under 10ms on a single core.


Performance

50k rows 500k rows
Omna search 9ms 27ms
Omna filter 9ms 27ms
Pandas + FAISS ~25ms + index build ~25ms + index build
Polars keyword regex 1ms — exact match only 1ms — exact match only

Benchmarked on MacBook Air M5, BAAI/bge-small-en-v1.5 (384-dim), 10-query median, warm index.

Omna inherits Polars' Arrow columnar memory. The Rust similarity kernel operates on the same memory — no copy into NumPy, no copy into a C buffer.


FAQ

Does Omna send my data to the cloud?

No. Embedding, search, filter, PII detection, and masking all run locally. The only method that makes a network call is ask(), which sends schema metadata and sample rows to Claude via the Anthropic API — and only when you explicitly call it.

Do I need a GPU?

No. FastEmbed uses ONNX and runs on CPU. On Apple Silicon, it uses CoreML automatically. Embedding 50,000 documents takes ~45 minutes on a MacBook Air M5 — a one-time cost. After that, search() and filter() run in milliseconds from the saved index.

Why not FAISS / ChromaDB / Pinecone?

Those are vector databases. Omna is a Polars plugin. If your data already lives in a DataFrame, Omna adds semantic search with zero infrastructure — no separate process, no index server, no network hop. It's the difference between df.omna.search(...) and spinning up a separate service just to query your own data.

What PII types does Omna detect?

PERSON, EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, US_SSN, US_PASSPORT, IP_ADDRESS, IBAN_CODE, URL, DATE_TIME, LOCATION, and more. Detection uses Microsoft Presidio + spaCy NER, running fully local.

Which Polars versions are supported?

Omna is tested on Polars 0.20+. It installs as a namespace plugin via df.omna.* — no import needed after import omna.

The embed step took 45 minutes. Do I have to redo it every time?

No. embed() saves the index to .omna/{column}.parquet. Every subsequent search() or filter() call loads it in ~300ms. You only re-run embed() if your data changes.


Roadmap

# Coming in v0.2
matched = transactions.omna.join(regulatory_categories, on="description")
# Match rows between two DataFrames by meaning, not exact key.

Star the repo to follow progress.


License

Layer License
Python package (omna/) MIT
Rust engine (src/) Proprietary — ships as a compiled binary in the pip wheel

omna.dev · PyPI · GitHub

About

Semantic search, PII masking and schema understanding for Polars DataFrames

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages