Skip to content

freeeve/roaringrange

Repository files navigation

roaringrange

CI Coverage Status Go Report Card Go Reference Maintainability Rating Reliability Rating Security Rating Snyk

Static, range-fetchable full-text search built on roaring bitmaps — query a multi-million-document trigram index in the browser, over HTTP Range requests, with no backend.

The index is one static file on object storage (S3/CDN). The browser fetches a tiny sparse view once (~tens of KB), then each query pulls a few small byte ranges — independent of corpus size. The multi-GB index is never downloaded whole. Postings are portable RoaringBitmaps, so writers and readers interoperate byte-for-byte with zero re-encoding: build an index directly with the Rust build module, or transcode an existing roaringsearch index with the Go writer — both emit the same files the Rust/WASM (or Go) reader reads.

Live demos

How it fits together

build (Go):   corpus ─roaringsearch─▶ FTSR ─roaringrange.Transcode─▶ RRS (.rrs) ─▶ S3/CDN
              (optional) facets ──────────────roaringrange.WriteFacets─▶ RRSF (.rrf) ─▶ S3/CDN
build (Rust): corpus ──build::write_index / write_facets / write_records──▶ .rrs + .rrf + records
browser (Rust/WASM): .rrs/.rrf/records on CDN ─HTTP Range─▶ Catalog (= Index + FacetIndex + RecordStore)

Doc IDs are assigned at build time in descending static rank (citations, holdings, …), so ascending doc ID = rank. Each posting is split into a head (the 65,536 top-ranked docs) and a tail (the rest); a query ANDs the small heads to get the ranked top-K and only fetches a tail when paginating past it. See docs/format.svg and docs/search.svg.

This is the design trade-off: ranking is baked in up front (no query-time relevance scoring), in exchange for near-constant per-query fetch cost.

How it compares

Inspired by lunr.js and Pagefind — both deliver full-text search to static sites with no backend. lunr loads the whole index into memory; Pagefind pioneered fetching only the index shards a query needs. roaringrange pushes that "fetch only what you query" idea to millions of records by HTTP-Range-reading a single roaring-bitmap index file, trading query-time relevance ranking for a baked-in static rank.

lunr.js Pagefind roaringrange
backend none none none
index transport whole index in memory many shard files, per query one file, HTTP Range
typical scale hundreds–few thousand docs up to ~100k+ pages millions–100M+ records
per-query bytes 0 after full load (load can be MBs+) ~tens–hundreds KB ~KB–few MB (≈ constant)
matching stemmed terms; wildcard, fuzzy, boosts stemmed words; partial trigram substring; fuzzy (tolerate-N)
ranking TF-IDF / BM25 relevance BM25-like relevance static rank (query-independent importance) in doc-ID order — no query-time relevance
facets / filters fielded search (no facet counts) filters + facet counts facets + live counts (sidecar)
build input JS objects / prebuilt JSON crawls built HTML pages any records (Go via roaringsearch, or the Rust builder)
sweet spot embedding a search library in a JS app (you own the index) static sites & docs, small to large very large catalogs & datasets

In short: lunr.js when you want to embed a search library and control indexing in code; Pagefind for static-site search from small to large; roaringrange when you have a lot of records and want a single range-fetched file with static-rank ordering and facets.

Repository layout

path what
go/ core Go module (github.com/freeeve/roaringrange): Transcode (FTSR→RRS), Open/Index reference reader, WriteFacets, NgramKeys
FORMAT.md, FACETS.md, RECORDS.md, VECTORS.md the frozen on-disk specs (RRSI index, RRSF facet sidecar, RRSR record store, RRVI vector index)
rust/ Rust crate roaringrange: reader (Catalog over Index/FacetIndex/RecordStore) + native build writers; both exposed to WASM (wasm-pack). Optional vector feature adds the RRVI similarity-search reader + IVFPQ trainer
python/ PyO3 bindings (pip install roaringrange): Builder (text index) + VectorBuilder (similarity index) over the core build writers
go/conformance/ cross-library test: roaringsearch build ⇄ roaringrange(go) read must agree
examples/openalex/ the OpenAlex demo: Go loader, parallel Rust builder/, download.sh, static web UI
docs/ architecture diagrams (SVG)

The core Go module has no dependency on roaringsearch — it parses the FTSR byte format directly. The n-gram key derivation is reproduced independently in Go (here), Go (roaringsearch), and Rust (the reader); the go/conformance/ module is the guard that keeps all three byte-compatible.

Quick start

Build an index — two paths to the same files. Assign doc IDs in descending static rank first, so the head holds the top-K.

Rust (direct): split each posting into head/tail, then write the index + an optional facet sidecar + record store:

use roaringrange::build::{write_index, write_facets, write_records, split_posting};
let entries = postings.iter()
    .map(|(k, bm)| { let (h, t) = split_posting(bm); (*k, h, t) })
    .collect();
write_index(rrs_w, 3, 0, entries)?;            // → .rrs
write_facets(rrf_w, facet_fields)?;            // → .rrf  (optional)
write_records(bin_w, idx_w, &records)?;        // → record store (optional)

For a corpus whose index exceeds RAM, build it in doc-ID-range chunks and fold them into one standard .rrs with build::chunk::{write_partial, merge_partials_to_rrs}.

Go (transcode a roaringsearch index):

rr.Transcode(ftsrReader, rrsWriter)             // FTSR → .rrs
rr.WriteFacets(rrfWriter, []rr.FacetField{...}) // → .rrf  (optional)
rr.WriteRecords(binW, idxW, records)            // → record store (optional)

Read it (Rust/WASM): build the reader, then open a Catalog and search:

cd rust && wasm-pack build --target web --features wasm
import init, { RrsCatalog } from "./roaringrange.js";
await init();
const cat = await RrsCatalog.openAll("index.rrs", "index.rrf", "records.idx", "records.bin");
const page = await cat.search("query", 0, 25, 0, '[["type","article"]]');
// page.ids = ranked doc IDs · page.records · page.facetCounts

Catalog/Index/FacetIndex/RecordStore (and RrsIndex/RrsRecords) stay available for advanced use. Host the files on anything that supports HTTP Range (S3 + CloudFront works well) and point the reader at the URLs.

Measured (full corpora, range-fetched)

library catalog 9.6M OpenAlex 47.8M (with abstracts)
text index (.rrs) 1.4 GB 11.5 GB
boot — index sparse ~52 KB ~0.5 MB
typical query tens–hundreds of KB ~1–6 MB head+tail (less head-only)
compute 2–14 µs

Boot and per-query cost stay ~constant as the corpus grows; size lives in the postings (≈0.4 bytes per trigram-document incidence — roaring is near-optimal), so the lever for a smaller index is indexing less text per doc, not the encoding.

The Rust builder reaches the 47.8M-work OpenAlex corpus that backs the live demo: an 11.5 GB index of 30.3M unique trigrams, built in ~57 min at ~52 GB peak RAM with no swap — and sublinearly (≈half the naive linear projection), as the trigram vocabulary saturates and roaring absorbs the added postings. With facets and records attached, the demo's full boot is ~3 MB: the index sparse, the facet metadata + top-category heads (the facet tails stay range-fetched, not loaded up front), and the record-store header.

Tried and shelved

Experiments the data argued against — recorded so we don't relitigate them.

Inverting common-trigram postings

Idea: a very common trigram has a huge posting, but its complement (the docs that lack it) is small — so store the complement plus a one-bit "inverted" flag and ANDNOT it during intersection, cutting the bytes a query fetches.

Why it didn't pan out: roaring stores each 65,536-doc block as either an array (≤4096 docs, 2 B each) or a flat 8 KB bitmap for any cardinality in (4096, 61440]. A common trigram's complement usually lands in that same band, so it is also an 8 KB bitmap — inversion only shrinks a block denser than ~94% (complement becomes a small array, or empty). Measured on the live 47.8M index (rust/examples/density, e.g. cargo run --release --example density -- "machine learning" "posthuman became"): the hottest trigram, the, is only ~52% dense — OpenAlex lacks an abstract for roughly half its works, so half the corpus is title-only text — and just 0.0–0.6% of posting bytes sit in >94%-dense blocks. Net saving across those queries: ~0.1%. Not worth a format-version bump + reindex.

It would pay on a corpus where common trigrams are near-universal (full text with an abstract on every doc); here it's the partial abstract coverage that flattens the density. The density example re-runs the analysis on any index/query.

Rarest-trigram candidates + verify

Idea: skip the common trigrams' multi-MB postings entirely — seed candidates by intersecting only the rarest trigrams, then verify each against its record's stored text (title + abstract + authors + venue), keeping the true matches.

Why it didn't pan out: client-side, egress is floored by the result-set size. The candidate set can't shrink below the number of results, and verifying that many records costs ~result_count × record_size. Measured on the live 47.8M index (rust/examples/candidates): machine learning (171k results) still has ~195k candidates after seeding the 4 rarest trigrams → ~190 MB of record verification, worse than the 53 MB full intersection. It helps only sparse results (posthuman became, 317 → ~12 MB vs 30 MB), which the lazy tail already gates behind an explicit load. Index::search_candidates and the candidates example stay for re-measuring.

Returning just result IDs needs the intersection to run server-side (a thin Lambda@Edge over in-region postings) — the one lever that beats the result-set floor.

Vector / similarity search (optional)

Beyond trigram text search, the crate has a range-fetchable similarity index in the same ethos: an RRVI (IVFPQ) file whose coarse centroids and PQ codebooks boot once, after which each query range-fetches only the nprobe nearest clusters' codes and scans them with asymmetric distance computation — top-k nearest vectors in ~constant bytes, independent of corpus size. vector_id == doc_id, so a hit reuses the same record store and can hybridize with the trigram result set.

Build it from Rust (build_ivfpq, behind the vector feature) or Python (VectorBuilder), or train at scale with FAISS and export the same layout (build_ivfpq_from_parts / roaringrange.write_rrvi_from_faiss, verified against the reader at recall@10 ≈ 0.9995). The reader VectorIndex is pure Rust with a browser binding (RrviIndex, wasm-pack build --features "wasm vector"). See VECTORS.md. Still in progress: the query embedder (in-browser model2vec / a Lambda proxy) and a trigram hybrid (tasks/004_vector_search).

Development

Enable the formatting pre-commit hook (runs gofmt + cargo fmt --check on staged changes, matching CI):

git config core.hooksPath .githooks

CI runs go test ./... in go/, the go/conformance/ module, go vet on the example, cargo test + fmt + clippy for the reader (a second pass with --features vector), and builds + tests the Python extension on CPython 3.12–3.14.

License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors