roaringrange

Static, range-fetchable full-text search built on roaring bitmaps — query a multi-million-document trigram index in the browser, over HTTP Range requests, with no backend.

The index is one static file on object storage (S3/CDN). The browser fetches a tiny sparse view once (~tens of KB), then each query pulls a few small byte ranges — independent of corpus size. The multi-GB index is never downloaded whole. Postings are portable RoaringBitmaps, so writers and readers interoperate byte-for-byte with zero re-encoding: build an index directly with the Rust build module, or transcode an existing roaringsearch index with the Go writer — both emit the same files the Rust/WASM (or Go) reader reads.

Live demos

OpenAlex — openalex.evefreeman.com: ~47.8M scholarly works, citation-ranked, faceted (year/type/open-access/language/topic), searched entirely client-side. (Reproducible — see examples/openalex/.)

How it fits together

build (Go):   corpus ─roaringsearch─▶ FTSR ─roaringrange.Transcode─▶ RRS (.rrs) ─▶ S3/CDN
              (optional) facets ──────────────roaringrange.WriteFacets─▶ RRSF (.rrf) ─▶ S3/CDN
build (Rust): corpus ──build::write_index / write_facets / write_records──▶ .rrs + .rrf + records
browser (Rust/WASM): .rrs/.rrf/records on CDN ─HTTP Range─▶ Catalog (= Index + FacetIndex + RecordStore)

Doc IDs are assigned at build time in descending static rank (citations, holdings, …), so ascending doc ID = rank. Each posting is split into a head (the 65,536 top-ranked docs) and a tail (the rest); a query ANDs the small heads to get the ranked top-K and only fetches a tail when paginating past it. See docs/format.svg and docs/search.svg.

This is the design trade-off: ranking is baked in up front (no query-time relevance scoring), in exchange for near-constant per-query fetch cost.

How it compares

Inspired by lunr.js and Pagefind — both deliver full-text search to static sites with no backend. lunr loads the whole index into memory; Pagefind pioneered fetching only the index shards a query needs. roaringrange pushes that "fetch only what you query" idea to millions of records by HTTP-Range-reading a single roaring-bitmap index file, trading query-time relevance ranking for a baked-in static rank.

	lunr.js	Pagefind	roaringrange
backend	none	none	none
index transport	whole index in memory	many shard files, per query	one file, HTTP Range
typical scale	hundreds–few thousand docs	up to ~100k+ pages	millions–100M+ records
per-query bytes	0 after full load (load can be MBs+)	~tens–hundreds KB	~KB–few MB (≈ constant)
matching	stemmed terms; wildcard, fuzzy, boosts	stemmed words; partial	trigram substring; fuzzy (tolerate-N)
ranking	TF-IDF / BM25 relevance	BM25-like relevance	static rank (query-independent importance) in doc-ID order — no query-time relevance
facets / filters	fielded search (no facet counts)	filters + facet counts	facets + live counts (sidecar)
build input	JS objects / prebuilt JSON	crawls built HTML pages	any records (Go via roaringsearch, or the Rust builder)
sweet spot	embedding a search library in a JS app (you own the index)	static sites & docs, small to large	very large catalogs & datasets

In short: lunr.js when you want to embed a search library and control indexing in code; Pagefind for static-site search from small to large; roaringrange when you have a lot of records and want a single range-fetched file with static-rank ordering and facets.

Repository layout

path	what
`go/`	core Go module (`github.com/freeeve/roaringrange`): `Transcode` (FTSR→RRS), `Open`/`Index` reference reader, `WriteFacets`, `NgramKeys`
`FORMAT.md`, `FACETS.md`, `RECORDS.md`, `VECTORS.md`	the frozen on-disk specs (`RRSI` index, `RRSF` facet sidecar, `RRSR` record store, `RRVI` vector index)
`rust/`	Rust crate `roaringrange`: reader (`Catalog` over `Index`/`FacetIndex`/`RecordStore`) + native `build` writers; both exposed to WASM (`wasm-pack`). Optional `vector` feature adds the `RRVI` similarity-search reader + IVFPQ trainer
`python/`	PyO3 bindings (`pip install roaringrange`): `Builder` (text index) + `VectorBuilder` (similarity index) over the core build writers
`go/conformance/`	cross-library test: roaringsearch build ⇄ roaringrange(go) read must agree
`examples/openalex/`	the OpenAlex demo: Go loader, parallel Rust `builder/`, `download.sh`, static web UI
`docs/`	architecture diagrams (SVG)

The core Go module has no dependency on roaringsearch — it parses the FTSR byte format directly. The n-gram key derivation is reproduced independently in Go (here), Go (roaringsearch), and Rust (the reader); the go/conformance/ module is the guard that keeps all three byte-compatible.

Quick start

Build an index — two paths to the same files. Assign doc IDs in descending static rank first, so the head holds the top-K.

Rust (direct): split each posting into head/tail, then write the index + an optional facet sidecar + record store:

use roaringrange::build::{write_index, write_facets, write_records, split_posting};
let entries = postings.iter()
    .map(|(k, bm)| { let (h, t) = split_posting(bm); (*k, h, t) })
    .collect();
write_index(rrs_w, 3, 0, entries)?;            // → .rrs
write_facets(rrf_w, facet_fields)?;            // → .rrf  (optional)
write_records(bin_w, idx_w, &records)?;        // → record store (optional)

For a corpus whose index exceeds RAM, build it in doc-ID-range chunks and fold them into one standard .rrs with build::chunk::{write_partial, merge_partials_to_rrs}.

Go (transcode a roaringsearch index):

rr.Transcode(ftsrReader, rrsWriter)             // FTSR → .rrs
rr.WriteFacets(rrfWriter, []rr.FacetField{...}) // → .rrf  (optional)
rr.WriteRecords(binW, idxW, records)            // → record store (optional)

Read it (Rust/WASM): build the reader, then open a Catalog and search:

cd rust && wasm-pack build --target web --features wasm

import init, { RrsCatalog } from "./roaringrange.js";
await init();
const cat = await RrsCatalog.openAll("index.rrs", "index.rrf", "records.idx", "records.bin");
const page = await cat.search("query", 0, 25, 0, '[["type","article"]]');
// page.ids = ranked doc IDs · page.records · page.facetCounts

Catalog/Index/FacetIndex/RecordStore (and RrsIndex/RrsRecords) stay available for advanced use. Host the files on anything that supports HTTP Range (S3 + CloudFront works well) and point the reader at the URLs.

Measured (full corpora, range-fetched)

	library catalog 9.6M	OpenAlex 47.8M (with abstracts)
text index (`.rrs`)	1.4 GB	11.5 GB
boot — index sparse	~52 KB	~0.5 MB
typical query	tens–hundreds of KB	~1–6 MB head+tail (less head-only)
compute	2–14 µs	—

Boot and per-query cost stay ~constant as the corpus grows; size lives in the postings (≈0.4 bytes per trigram-document incidence — roaring is near-optimal), so the lever for a smaller index is indexing less text per doc, not the encoding.

The Rust builder reaches the 47.8M-work OpenAlex corpus that backs the live demo: an 11.5 GB index of 30.3M unique trigrams, built in ~57 min at ~52 GB peak RAM with no swap — and sublinearly (≈half the naive linear projection), as the trigram vocabulary saturates and roaring absorbs the added postings. With facets and records attached, the demo's full boot is ~3 MB: the index sparse, the facet metadata + top-category heads (the facet tails stay range-fetched, not loaded up front), and the record-store header.

Tried and shelved

Experiments the data argued against — recorded so we don't relitigate them.

Inverting common-trigram postings

Idea: a very common trigram has a huge posting, but its complement (the docs that lack it) is small — so store the complement plus a one-bit "inverted" flag and ANDNOT it during intersection, cutting the bytes a query fetches.

Why it didn't pan out: roaring stores each 65,536-doc block as either an array (≤4096 docs, 2 B each) or a flat 8 KB bitmap for any cardinality in (4096, 61440]. A common trigram's complement usually lands in that same band, so it is also an 8 KB bitmap — inversion only shrinks a block denser than ~94% (complement becomes a small array, or empty). Measured on the live 47.8M index (rust/examples/density, e.g. cargo run --release --example density -- "machine learning" "posthuman became"): the hottest trigram, the, is only ~52% dense — OpenAlex lacks an abstract for roughly half its works, so half the corpus is title-only text — and just 0.0–0.6% of posting bytes sit in >94%-dense blocks. Net saving across those queries: ~0.1%. Not worth a format-version bump + reindex.

It would pay on a corpus where common trigrams are near-universal (full text with an abstract on every doc); here it's the partial abstract coverage that flattens the density. The density example re-runs the analysis on any index/query.

Rarest-trigram candidates + verify

Idea: skip the common trigrams' multi-MB postings entirely — seed candidates by intersecting only the rarest trigrams, then verify each against its record's stored text (title + abstract + authors + venue), keeping the true matches.

Why it didn't pan out: client-side, egress is floored by the result-set size. The candidate set can't shrink below the number of results, and verifying that many records costs ~result_count × record_size. Measured on the live 47.8M index (rust/examples/candidates): machine learning (171k results) still has ~195k candidates after seeding the 4 rarest trigrams → ~190 MB of record verification, worse than the 53 MB full intersection. It helps only sparse results (posthuman became, 317 → ~12 MB vs 30 MB), which the lazy tail already gates behind an explicit load. Index::search_candidates and the candidates example stay for re-measuring.

Returning just result IDs needs the intersection to run server-side (a thin Lambda@Edge over in-region postings) — the one lever that beats the result-set floor.

Vector / similarity search (optional)

Beyond trigram text search, the crate has a range-fetchable similarity index in the same ethos: an RRVI (IVFPQ) file whose coarse centroids and PQ codebooks boot once, after which each query range-fetches only the nprobe nearest clusters' codes and scans them with asymmetric distance computation — top-k nearest vectors in ~constant bytes, independent of corpus size. vector_id == doc_id, so a hit reuses the same record store and can hybridize with the trigram result set.

Build it from Rust (build_ivfpq, behind the vector feature) or Python (VectorBuilder), or train at scale with FAISS and export the same layout (build_ivfpq_from_parts / roaringrange.write_rrvi_from_faiss, verified against the reader at recall@10 ≈ 0.9995). The reader VectorIndex is pure Rust with a browser binding (RrviIndex, wasm-pack build --features "wasm vector"). See VECTORS.md. Still in progress: the query embedder (in-browser model2vec / a Lambda proxy) and a trigram hybrid (tasks/004_vector_search).

Development

Enable the formatting pre-commit hook (runs gofmt + cargo fmt --check on staged changes, matching CI):

git config core.hooksPath .githooks

CI runs go test ./... in go/, the go/conformance/ module, go vet on the example, cargo test + fmt + clippy for the reader (a second pass with --features vector), and builds + tests the Python extension on CPython 3.12–3.14.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

roaringrange

Live demos

How it fits together

How it compares

Repository layout

Quick start

Measured (full corpora, range-fetched)

Tried and shelved

Inverting common-trigram postings

Rarest-trigram candidates + verify

Vector / similarity search (optional)

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.githooks		.githooks
.github/workflows		.github/workflows
docs		docs
examples		examples
go		go
python		python
rust		rust
tasks		tasks
.gitignore		.gitignore
FACETS.md		FACETS.md
FORMAT.md		FORMAT.md
HOTCACHE.md		HOTCACHE.md
LICENSE		LICENSE
README.md		README.md
RECORDS.md		RECORDS.md
SORTCOLS.md		SORTCOLS.md
TERMS.md		TERMS.md
VECTORS.md		VECTORS.md

Folders and files

Latest commit

History

Repository files navigation

roaringrange

Live demos

How it fits together

How it compares

Repository layout

Quick start

Measured (full corpora, range-fetched)

Tried and shelved

Inverting common-trigram postings

Rarest-trigram candidates + verify

Vector / similarity search (optional)

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages