Skip to content

aosingh/lexrs

Repository files navigation

lexrs

A Rust library implementing two efficient lexicon data structures — Trie and DAWG (Directed Acyclic Word Graph) — with an optional Python binding via PyO3 and Maturin.

lexrs is the successor to lexpy (pure Python). It exposes the same API while delivering 10–100× faster insertion and search by moving the core data structures to Rust.

Table of contents

Data structures

Trie

A standard prefix tree using arena allocation (Vec<Node>). Supports insertion in any order.

DAWG

A minimized Trie that compresses shared suffixes in addition to shared prefixes, resulting in a significantly smaller node count for large lexicons. Words must be inserted in lexicographic (alphabetical) order; call reduce() after all insertions to finalize minimization.

Features

Both Trie and DAWG expose the same API:

Method Description
add(word, count) Insert a word with an optional frequency count
add_all(words) Insert from any iterable
add_from_file(path) Insert words from a file (one per line)
contains(word) Exact membership test
contains_prefix(prefix) Check if any word starts with the given prefix
search(pattern) Wildcard search (* = zero or more chars, ? = exactly one)
search_with_count(pattern) Like search, but returns (word, count) pairs
search_with_prefix(prefix) All words beginning with the prefix
search_with_prefix_count(prefix) Like above, with counts
search_within_distance(word, dist) Levenshtein fuzzy search
search_within_distance_count(word, dist) Like above, with counts
word_count() Number of words stored
node_count() Number of nodes in the structure

Rust usage

# Cargo.toml
[dependencies]
lexrs = { path = "." }
use lexrs::{Trie, Dawg};

// Trie
let mut trie = Trie::new();
trie.add("hello", 1).unwrap();
trie.add_all(vec!["world".to_string(), "foo".to_string()]).unwrap();

assert!(trie.contains("hello"));
assert!(trie.contains_prefix("wor"));

let results = trie.search("h*").unwrap();       // wildcard
let fuzzy   = trie.search_within_distance("helo", 1); // Levenshtein ≤ 1

// DAWG (words must be inserted in alphabetical order)
let mut dawg = Dawg::new();
dawg.add_all(vec!["apple".to_string(), "apply".to_string(), "apt".to_string()]).unwrap();
// add_all sorts automatically; call reduce() if you use add() directly
dawg.reduce();

assert!(dawg.contains("apple"));

Python usage

The library is also available as the lexrs Python package, built with Maturin.

Build & install

# Install Maturin (once)
pip install maturin

# Build and install into the current virtualenv
maturin develop --features python

API

from lexrs import Trie, DAWG

# ── Trie ──────────────────────────────────────────────────────────────────────
t = Trie()
t.add("hello", 5)          # word + optional count
t.add_all(["world", "foo"])
t.add_from_file("words.txt")

"hello" in t               # True
t.contains_prefix("wor")   # True
t.get_word_count()         # total words
len(t)                     # total nodes

t.search("h*")                          # wildcard, returns list of words
t.search("h*", with_count=True)         # returns list of (word, count)
t.search_with_prefix("wo")              # prefix completion
t.search_with_prefix_count("wo")        # with counts
t.search_within_distance("helo", 1)     # fuzzy (Levenshtein ≤ 1)
t.search_within_distance("helo", 1, with_count=True)

# ── DAWG ──────────────────────────────────────────────────────────────────────
d = DAWG()
d.add_all(["apple", "apply", "apt"])   # sorted automatically

"apple" in d               # True
d.search("ap*")            # wildcard
d.search_within_distance("aple", dist=1)

Wildcard syntax

Pattern Meaning
* Zero or more characters
? Exactly one character
h* All words starting with h
?at Three-letter words ending in at
a?* Words of two or more characters starting with a

Consecutive wildcards are normalized (***, ?**).

Production HTTP server

lexrs ships a production-ready HTTP service in lexrs-server/ with two binaries:

Binary Role
writer Accepts word ingestion (POST /ingest), buffers a delta Trie in memory, and periodically compacts it into a versioned snapshot on disk.
reader Loads the latest snapshot into a DAWG and serves all search queries. Multiple readers can run as replicas and hot-reload new snapshots without downtime.

Consul is used for snapshot version coordination — the writer publishes new snapshot versions to the Consul KV store and readers watch for changes via blocking queries, atomically swapping the in-memory DAWG when a new snapshot arrives.

Docker Compose — full stack in one command

The docker/ directory has a Compose file that brings up the entire stack:

Consul ──► writer ──► (snapshots on shared volume)
                              │
              ┌───────────────┘
              ▼
         reader-1   reader-2
              │         │
              └────┬────┘
                   ▼
                 nginx  (port 80)
                 /ingest  → writer
                 /search  → readers (round-robin)
cd docker
docker compose up -d
# ingest words
curl -s -X POST http://localhost/ingest \
  -H 'Content-Type: application/json' \
  -d '["apple", "apply", {"word": "application", "count": 5}]'

# prefix search
curl -s 'http://localhost/search?prefix=app'

# wildcard search
curl -s 'http://localhost/search?pattern=app*'

# fuzzy search (Levenshtein distance ≤ 1)
curl -s 'http://localhost/search?word=aple&dist=1'

The writer and reader are built from a single Dockerfile — the command: field in Compose selects which binary to start. See lexrs-server/README.md for the full API reference and configuration options.

Running tests

# Rust unit tests
cargo test

# Python tests (requires the package to be built first)
maturin develop --features python
pytest tests/

Project structure

src/
  lib.rs      — public re-exports and Python module registration
  trie.rs     — Trie implementation + wildcard / Levenshtein helpers
  dawg.rs     — DAWG implementation
  node.rs     — shared Node type (arena-allocated)
  utils.rs    — file I/O and pattern normalization
  error.rs    — LexError enum
tests/
  test_python_api.py — pytest suite for the Python bindings

Related components

Directory Description
lexrs-server/ Production HTTP server — a writer binary for word ingestion and compaction, and a reader binary for search. Readers scale horizontally and hot-reload snapshots via Consul.
docker/ Docker Compose setup running the full stack: Consul, writer, two reader replicas, and an nginx reverse proxy that routes reads and writes to the right service.
lexpy-shim/ Source for lexpy==2.x — a one-file compatibility shim that re-exports lexrs so existing lexpy users can upgrade without changing their imports.
benchmarks/ Python scripts comparing lexrs against lexpy (pure Python) across insertion, prefix, wildcard, and Levenshtein workloads.
tests/ Integration tests — pytest suite for the Python API and Rust-level integration tests.

License

MIT

About

A Rust lexicon library (Trie + DAWG) with Python bindings and a horizontally-scalable HTTP search server.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors