PostgreSQL Search Benchmarks (Playground)

This project is a playground for comparing different PostgreSQL search strategies:

Full‑text search with BM25 ranking
Dense vector similarity (cosine, L2, L1, inner product) using pgvector
Binary vector similarity (Hamming / Jaccard) built from the same embeddings

Prerequisites

Elixir ≥ 1.18, Erlang/OTP ≥ 27
PostgreSQL 16+ with the vector extension installed

Setup

mix deps.get
mix ecto.create
mix ecto.migrate

These commands install dependencies, create the database, and run migrations. If you pull updates that add new SQL functions (e.g. the binary distance helpers), run mix ecto.migrate again to ensure your database has them.

Embedding model

Embeddings are generated with thenlper/gte-small using Bumblebee.
The loader fetches the model from Hugging Face ({:hf, "thenlper/gte-small"}) and caches it under ~/.cache/bumblebee/.

Loading markdown data

Put your .md files in priv/data/. To ingest them:

mix hybrid_search.load
# or specify a different directory / glob
mix hybrid_search.load --dir /path/to/md --glob "**/*.md"

During loading we:

Split each markdown file into paragraph-sized chunks.
Generate dense embeddings with Bumblebee (vector length 384).
Convert the dense vector to a binary 0/1 string for Hamming/Jaccard search.
Upsert everything into dataset_chunks.

You can also call the loader from IEx:

{:ok, stats} = HybridSearch.load_dataset()

Running searches

Start an iex session with the application loaded:

iex -S mix

Example helpers (all functions live under HybridSearch):

# Full text (BM25)
HybridSearch.search_bm25("liveview")

# Dense vector searches
HybridSearch.search_embeddings("liveview")        # cosine distance
HybridSearch.search_embeddings_l2("liveview")     # L2 / Euclidean
HybridSearch.search_embeddings_l1("liveview")     # L1 / Manhattan
HybridSearch.search_embeddings_dot("liveview")    # inner product (larger is better)

# Binary similarity
HybridSearch.search_embeddings_hamming("liveview")   # Hamming distance over bitstrings
HybridSearch.search_embeddings_jaccard("liveview")   # Jaccard distance over bitstrings

Each search returns a list of maps containing:

:chunk – the %HybridSearch.Datasets.DatasetChunk{} row
:distance or :score – the metric used for ordering
Additional fields such as :headline for BM25

Notes on benchmarking

All dense searches rely on the embedding column (pgvector 384).
Binary searches use the embedding_binary text column and custom SQL functions:
- binary_hamming_distance/2
- binary_jaccard_distance/2
IVFFlat indexes are created where supported. If your pgvector build does not include the optional operator classes (e.g. vector_l1_ops, bit_hamming_ops), the migration skips those indexes gracefully.

Benchmarks

Compare search runtimes with Benchee:

mix bench.search --query "phoenix" --limit 5

Available flags:

--query – text to search (default: "phoenix")
--limit – number of results per run (default: 5)
--time – seconds spent benchmarking each scenario (default: 1.0)
--warmup – warmup time in seconds (default: 0.5)

Run mix hybrid_search.load beforehand so the benchmark has data to work with. The benchmark also needs to generate an embedding for the query text; the first run will download the model from Hugging Face (internet connection required). If the embedding is not available, the task reports the failure and only the BM25 scenario is executed.

Development

mix test          # run the test suite
mix precommit     # compile w/ warnings-as-errors, format, tests

License

This project is released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
lib		lib
priv		priv
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
SEARCH.md		SEARCH.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PostgreSQL Search Benchmarks (Playground)

Prerequisites

Setup

Embedding model

Loading markdown data

Running searches

Notes on benchmarking

Benchmarks

Development

License

About

Uh oh!

Releases

Packages

Languages

License

elchemista/postgres_rag_bench

Folders and files

Latest commit

History

Repository files navigation

PostgreSQL Search Benchmarks (Playground)

Prerequisites

Setup

Embedding model

Loading markdown data

Running searches

Notes on benchmarking

Benchmarks

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages