ngi — N-gram Indexed Regex Search

Fast regex search over large codebases. Builds a trigram index to pre-filter files, then delegates matching to ripgrep for SIMD-accelerated regex.

13–69× faster than grep. 2–6× faster than ripgrep on selective queries.

How it works

regex pattern
  → extract trigrams from the pattern
  → intersect posting lists (mmap'd binary search)
  → if <10% of files match → pass candidates to rg
  → if >10% of files match → let rg scan everything (its walker is faster)
  → results

The index is built once and updated incrementally. First search auto-builds it.

Benchmarks

Linux kernel (92,916 files)

Pattern	ngi	rg	grep	ngi vs rg	ngi vs grep
`__attribute__.*section`	27ms	161ms	1866ms	5.9×	69×
`dma_alloc_coherent`	42ms	161ms	1646ms	3.8×	39×
`struct file_operations`	71ms	160ms	1642ms	2.3×	23×
`EXPORT_SYMBOL_GPL`	90ms	174ms	1607ms	1.9×	17×
`mutex_lock`	95ms	162ms	1314ms	1.7×	13×

Correctness: 100% match with ripgrep across all tested queries (13/13 exact).

Index overhead

Codebase	Files	Index size	Build time
Small project (500 files)	500	~1 MB	<300ms
CPython	5,354	13 MB	1.9s
Linux kernel	92,916	146 MB	16s

Install

# From GitHub
cargo install --git https://github.com/erogol/ngi

# From source
git clone https://github.com/erogol/ngi && cd ngi
cargo build --release
cp target/release/ngi /usr/local/bin/

Requires Rust toolchain. Optional: ripgrep on $PATH for best performance (falls back to built-in Rust regex if unavailable).

Usage

# Just search — index is built automatically on first run
ngi search 'fn.*parse'

# Explicit index management
ngi index                      # Build/rebuild index
ngi index --force              # Force full rebuild
ngi index --max-file-size 50M  # Include files up to 50MB

# Search options
ngi search 'pattern'              # Regex search
ngi search -i 'pattern'           # Case-insensitive
ngi search -l 'pattern'           # File names only
ngi search -f '\.rs$' 'pattern'   # Filter by file extension
ngi search -C 3 'pattern'         # 3 lines of context around matches
ngi search -A 2 -B 1 'pattern'    # 2 after, 1 before
ngi search -m 10 'pattern'        # Stop after 10 matches
ngi search --json 'pattern'       # JSONL output (machine-readable)
ngi search --json -C 2 -m 5 'fn'  # All flags compose
ngi search --no-index 'pattern'   # Skip index, search all files
ngi search --explain 'pattern'    # Show query plan

# Maintenance
ngi status                     # Show index stats
ngi clean                      # Remove index

How the index works

ngi extracts all 3-byte substrings (trigrams) from every file and builds an inverted index mapping each trigram to the files containing it.

When you search for fn.*parse:

Extract trigrams from the regex: fn , par, ars, rse
Look up each trigram's file list in the index (binary search on mmap'd data)
Intersect the lists → only files containing ALL trigrams survive
Run ripgrep on the survivors for the actual regex match

For the pattern __attribute__.*section in the Linux kernel, this narrows 92,916 files down to 526 candidates (0.6%) before rg even starts.

Auto-index

The first time you run ngi search in a project, it automatically builds the index. Subsequent searches detect file changes and incrementally reindex.

The index lives in .ngi/ at the project root (detected via .git/). Add .ngi/ to your .gitignore.

Ignoring files

ngi respects .gitignore and skips common non-source directories (node_modules, __pycache__, .venv, etc.). For custom exclusions, create a .ngiignore file with the same syntax as .gitignore.

Machine-readable output

--json produces JSONL (one JSON object per line):

{"type":"match","path":"src/main.rs","line_number":42,"line_text":"fn parse()","context_before":["// comment"],"context_after":["  let x = 1;"]}
{"type":"summary","match_count":15,"file_count":3,"files_searched":100,"total_files":5000,"duration_ms":45,"mode":"indexed"}

Context arrays are empty unless -C/-A/-B is used. Modes: indexed, rg-fullscan, full-scan, no-index.

Inspired by

Cursor's blog post on fast code search describing sparse n-gram indexing. ngi implements the trigram subset of that approach — sparse n-grams produced 12M+ unique entries vs 181K trigrams for CPython, making the index impractically large for marginal selectivity gain.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ngi — N-gram Indexed Regex Search

How it works

Benchmarks

Linux kernel (92,916 files)

Index overhead

Install

Usage

How the index works

Auto-index

Ignoring files

Machine-readable output

Inspired by

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ngi — N-gram Indexed Regex Search

How it works

Benchmarks

Linux kernel (92,916 files)

Index overhead

Install

Usage

How the index works

Auto-index

Ignoring files

Machine-readable output

Inspired by

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages