This project is a playground for comparing different PostgreSQL search strategies:
- Full‑text search with BM25 ranking
- Dense vector similarity (cosine, L2, L1, inner product) using
pgvector - Binary vector similarity (Hamming / Jaccard) built from the same embeddings
- Elixir ≥ 1.18, Erlang/OTP ≥ 27
- PostgreSQL 16+ with the
vectorextension installed
mix deps.get
mix ecto.create
mix ecto.migrateThese commands install dependencies, create the database, and run migrations. If you
pull updates that add new SQL functions (e.g. the binary distance helpers), run
mix ecto.migrate again to ensure your database has them.
Embeddings are generated with thenlper/gte-small using Bumblebee.
The loader fetches the model from Hugging Face ({:hf, "thenlper/gte-small"}) and caches it under ~/.cache/bumblebee/.
Put your .md files in priv/data/. To ingest them:
mix hybrid_search.load
# or specify a different directory / glob
mix hybrid_search.load --dir /path/to/md --glob "**/*.md"During loading we:
- Split each markdown file into paragraph-sized chunks.
- Generate dense embeddings with Bumblebee (vector length 384).
- Convert the dense vector to a binary
0/1string for Hamming/Jaccard search. - Upsert everything into
dataset_chunks.
You can also call the loader from IEx:
{:ok, stats} = HybridSearch.load_dataset()Start an iex session with the application loaded:
iex -S mixExample helpers (all functions live under HybridSearch):
# Full text (BM25)
HybridSearch.search_bm25("liveview")
# Dense vector searches
HybridSearch.search_embeddings("liveview") # cosine distance
HybridSearch.search_embeddings_l2("liveview") # L2 / Euclidean
HybridSearch.search_embeddings_l1("liveview") # L1 / Manhattan
HybridSearch.search_embeddings_dot("liveview") # inner product (larger is better)
# Binary similarity
HybridSearch.search_embeddings_hamming("liveview") # Hamming distance over bitstrings
HybridSearch.search_embeddings_jaccard("liveview") # Jaccard distance over bitstringsEach search returns a list of maps containing:
:chunk– the%HybridSearch.Datasets.DatasetChunk{}row:distanceor:score– the metric used for ordering- Additional fields such as
:headlinefor BM25
- All dense searches rely on the
embeddingcolumn (pgvector 384). - Binary searches use the
embedding_binarytext column and custom SQL functions:binary_hamming_distance/2binary_jaccard_distance/2
- IVFFlat indexes are created where supported. If your pgvector build does not include the optional operator classes (e.g.
vector_l1_ops,bit_hamming_ops), the migration skips those indexes gracefully.
Compare search runtimes with Benchee:
mix bench.search --query "phoenix" --limit 5Available flags:
--query– text to search (default:"phoenix")--limit– number of results per run (default:5)--time– seconds spent benchmarking each scenario (default:1.0)--warmup– warmup time in seconds (default:0.5)
Run mix hybrid_search.load beforehand so the benchmark has data to work with.
The benchmark also needs to generate an embedding for the query text; the first run will
download the model from Hugging Face (internet connection required). If the embedding
is not available, the task reports the failure and only the BM25 scenario is executed.
mix test # run the test suite
mix precommit # compile w/ warnings-as-errors, format, testsThis project is released under the MIT License. See LICENSE for details.