Static, range-fetchable full-text search built on roaring bitmaps — query a multi-million-document trigram index in the browser, over HTTP Range requests, with no backend.
The index is one static file on object storage (S3/CDN). The browser fetches a
tiny sparse view once (~tens of KB), then each query pulls a few small byte
ranges — independent of corpus size. The multi-GB index is never downloaded
whole. Postings are portable RoaringBitmaps, so writers and readers interoperate
byte-for-byte with zero re-encoding: build an index directly with the Rust
build module, or transcode an existing
roaringsearch index with the Go
writer — both emit the same files the Rust/WASM (or Go) reader reads.
- OpenAlex — openalex.evefreeman.com: ~47.8M
scholarly works, citation-ranked, faceted (year/type/open-access/language/topic),
searched entirely client-side. (Reproducible — see
examples/openalex/.)
build (Go): corpus ─roaringsearch─▶ FTSR ─roaringrange.Transcode─▶ RRS (.rrs) ─▶ S3/CDN
(optional) facets ──────────────roaringrange.WriteFacets─▶ RRSF (.rrf) ─▶ S3/CDN
build (Rust): corpus ──build::write_index / write_facets / write_records──▶ .rrs + .rrf + records
browser (Rust/WASM): .rrs/.rrf/records on CDN ─HTTP Range─▶ Catalog (= Index + FacetIndex + RecordStore)
Doc IDs are assigned at build time in descending static rank (citations, holdings,
…), so ascending doc ID = rank. Each posting is split into a head (the 65,536
top-ranked docs) and a tail (the rest); a query ANDs the small heads to get
the ranked top-K and only fetches a tail when paginating past it. See
docs/format.svg and docs/search.svg.
This is the design trade-off: ranking is baked in up front (no query-time relevance scoring), in exchange for near-constant per-query fetch cost.
Inspired by lunr.js and Pagefind — both deliver full-text search to static sites with no backend. lunr loads the whole index into memory; Pagefind pioneered fetching only the index shards a query needs. roaringrange pushes that "fetch only what you query" idea to millions of records by HTTP-Range-reading a single roaring-bitmap index file, trading query-time relevance ranking for a baked-in static rank.
| lunr.js | Pagefind | roaringrange | |
|---|---|---|---|
| backend | none | none | none |
| index transport | whole index in memory | many shard files, per query | one file, HTTP Range |
| typical scale | hundreds–few thousand docs | up to ~100k+ pages | millions–100M+ records |
| per-query bytes | 0 after full load (load can be MBs+) | ~tens–hundreds KB | ~KB–few MB (≈ constant) |
| matching | stemmed terms; wildcard, fuzzy, boosts | stemmed words; partial | trigram substring; fuzzy (tolerate-N) |
| ranking | TF-IDF / BM25 relevance | BM25-like relevance | static rank (query-independent importance) in doc-ID order — no query-time relevance |
| facets / filters | fielded search (no facet counts) | filters + facet counts | facets + live counts (sidecar) |
| build input | JS objects / prebuilt JSON | crawls built HTML pages | any records (Go via roaringsearch, or the Rust builder) |
| sweet spot | embedding a search library in a JS app (you own the index) | static sites & docs, small to large | very large catalogs & datasets |
In short: lunr.js when you want to embed a search library and control indexing in code; Pagefind for static-site search from small to large; roaringrange when you have a lot of records and want a single range-fetched file with static-rank ordering and facets.
| path | what |
|---|---|
go/ |
core Go module (github.com/freeeve/roaringrange): Transcode (FTSR→RRS), Open/Index reference reader, WriteFacets, NgramKeys |
FORMAT.md, FACETS.md, RECORDS.md, VECTORS.md |
the frozen on-disk specs (RRSI index, RRSF facet sidecar, RRSR record store, RRVI vector index) |
rust/ |
Rust crate roaringrange: reader (Catalog over Index/FacetIndex/RecordStore) + native build writers; both exposed to WASM (wasm-pack). Optional vector feature adds the RRVI similarity-search reader + IVFPQ trainer |
python/ |
PyO3 bindings (pip install roaringrange): Builder (text index) + VectorBuilder (similarity index) over the core build writers |
go/conformance/ |
cross-library test: roaringsearch build ⇄ roaringrange(go) read must agree |
examples/openalex/ |
the OpenAlex demo: Go loader, parallel Rust builder/, download.sh, static web UI |
docs/ |
architecture diagrams (SVG) |
The core Go module has no dependency on roaringsearch — it parses the FTSR
byte format directly. The n-gram key derivation is reproduced independently in Go
(here), Go (roaringsearch), and Rust (the reader); the go/conformance/ module is
the guard that keeps all three byte-compatible.
Build an index — two paths to the same files. Assign doc IDs in descending static rank first, so the head holds the top-K.
Rust (direct): split each posting into head/tail, then write the index + an optional facet sidecar + record store:
use roaringrange::build::{write_index, write_facets, write_records, split_posting};
let entries = postings.iter()
.map(|(k, bm)| { let (h, t) = split_posting(bm); (*k, h, t) })
.collect();
write_index(rrs_w, 3, 0, entries)?; // → .rrs
write_facets(rrf_w, facet_fields)?; // → .rrf (optional)
write_records(bin_w, idx_w, &records)?; // → record store (optional)For a corpus whose index exceeds RAM, build it in doc-ID-range chunks and fold
them into one standard .rrs with build::chunk::{write_partial, merge_partials_to_rrs}.
Go (transcode a roaringsearch index):
rr.Transcode(ftsrReader, rrsWriter) // FTSR → .rrs
rr.WriteFacets(rrfWriter, []rr.FacetField{...}) // → .rrf (optional)
rr.WriteRecords(binW, idxW, records) // → record store (optional)Read it (Rust/WASM): build the reader, then open a Catalog and search:
cd rust && wasm-pack build --target web --features wasmimport init, { RrsCatalog } from "./roaringrange.js";
await init();
const cat = await RrsCatalog.openAll("index.rrs", "index.rrf", "records.idx", "records.bin");
const page = await cat.search("query", 0, 25, 0, '[["type","article"]]');
// page.ids = ranked doc IDs · page.records · page.facetCountsCatalog/Index/FacetIndex/RecordStore (and RrsIndex/RrsRecords) stay
available for advanced use. Host the files on anything that supports HTTP Range
(S3 + CloudFront works well) and point the reader at the URLs.
| library catalog 9.6M | OpenAlex 47.8M (with abstracts) | |
|---|---|---|
text index (.rrs) |
1.4 GB | 11.5 GB |
| boot — index sparse | ~52 KB | ~0.5 MB |
| typical query | tens–hundreds of KB | ~1–6 MB head+tail (less head-only) |
| compute | 2–14 µs | — |
Boot and per-query cost stay ~constant as the corpus grows; size lives in the postings (≈0.4 bytes per trigram-document incidence — roaring is near-optimal), so the lever for a smaller index is indexing less text per doc, not the encoding.
The Rust builder reaches the 47.8M-work OpenAlex corpus that backs the live demo: an 11.5 GB index of 30.3M unique trigrams, built in ~57 min at ~52 GB peak RAM with no swap — and sublinearly (≈half the naive linear projection), as the trigram vocabulary saturates and roaring absorbs the added postings. With facets and records attached, the demo's full boot is ~3 MB: the index sparse, the facet metadata + top-category heads (the facet tails stay range-fetched, not loaded up front), and the record-store header.
Experiments the data argued against — recorded so we don't relitigate them.
Idea: a very common trigram has a huge posting, but its complement (the docs
that lack it) is small — so store the complement plus a one-bit "inverted" flag
and ANDNOT it during intersection, cutting the bytes a query fetches.
Why it didn't pan out: roaring stores each 65,536-doc block as either an array
(≤4096 docs, 2 B each) or a flat 8 KB bitmap for any cardinality in
(4096, 61440]. A common trigram's complement usually lands in that same band, so
it is also an 8 KB bitmap — inversion only shrinks a block denser than ~94%
(complement becomes a small array, or empty). Measured on the live 47.8M index
(rust/examples/density, e.g. cargo run --release --example density -- "machine learning" "posthuman became"):
the hottest trigram, the, is only ~52% dense — OpenAlex lacks an abstract for
roughly half its works, so half the corpus is title-only text — and just
0.0–0.6% of posting bytes sit in >94%-dense blocks. Net saving across those
queries: ~0.1%. Not worth a format-version bump + reindex.
It would pay on a corpus where common trigrams are near-universal (full text
with an abstract on every doc); here it's the partial abstract coverage that
flattens the density. The density example re-runs the analysis on any
index/query.
Idea: skip the common trigrams' multi-MB postings entirely — seed candidates by intersecting only the rarest trigrams, then verify each against its record's stored text (title + abstract + authors + venue), keeping the true matches.
Why it didn't pan out: client-side, egress is floored by the result-set size.
The candidate set can't shrink below the number of results, and verifying that
many records costs ~result_count × record_size. Measured on the live 47.8M
index (rust/examples/candidates): machine learning (171k results) still has
~195k candidates after seeding the 4 rarest trigrams → ~190 MB of record
verification, worse than the 53 MB full intersection. It helps only sparse
results (posthuman became, 317 → ~12 MB vs 30 MB), which the lazy tail already
gates behind an explicit load. Index::search_candidates and the candidates
example stay for re-measuring.
Returning just result IDs needs the intersection to run server-side (a thin Lambda@Edge over in-region postings) — the one lever that beats the result-set floor.
Beyond trigram text search, the crate has a range-fetchable similarity index
in the same ethos: an RRVI (IVFPQ) file whose coarse centroids and PQ codebooks
boot once, after which each query range-fetches only the nprobe nearest
clusters' codes and scans them with asymmetric distance computation — top-k
nearest vectors in ~constant bytes, independent of corpus size. vector_id == doc_id, so a hit reuses the same record store and can hybridize with the trigram
result set.
Build it from Rust (build_ivfpq, behind the vector feature) or Python
(VectorBuilder), or train at scale with FAISS and export the same layout
(build_ivfpq_from_parts / roaringrange.write_rrvi_from_faiss, verified against
the reader at recall@10 ≈ 0.9995). The reader VectorIndex
is pure Rust with a browser binding (RrviIndex, wasm-pack build --features "wasm vector"). See VECTORS.md. Still in progress: the query
embedder (in-browser model2vec / a Lambda proxy) and a trigram hybrid
(tasks/004_vector_search).
Enable the formatting pre-commit hook (runs gofmt + cargo fmt --check on
staged changes, matching CI):
git config core.hooksPath .githooksCI runs go test ./... in go/, the go/conformance/ module, go vet on the
example, cargo test + fmt + clippy for the reader (a second pass with
--features vector), and builds + tests the Python extension on CPython
3.12–3.14.
MIT — see LICENSE.