You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add first-class support for pgvector — schema parsing for vector(N) columns, generation of valid vector values, and a set of recipes for property-testing the most common pgvector workloads (embedding search, hybrid retrieval, RLS-gated tenant isolation, versioned embedding migrations).
This is a sqlproof differentiator. No mainstream property-based testing library exercises vector workloads today; pgTAP has no concept of "is the index returning the right neighbors?". Most pgvector bugs surface as silent ranking degradation rather than hard failures, which is exactly the failure class property tests are best at catching.
Background
pgvector ships:
A vector type with a per-column dimension (embedding vector(1536) for OpenAI ada-002, vector(384) for sentence-transformers MiniLM, etc.).
Two index types: HNSW (newer, faster, no training) and IVFFlat (older, requires training).
It's ubiquitous in Supabase, RAG pipelines, and recommendation systems. Real-world bugs in pgvector deployments don't usually crash — they silently return wrong neighbors.
Schema support gaps today
sqlproof.schema.parse_sql._parse_type_node falls back to scalar text for vector(N). The introspector treats it as scalar. The row generator emits random strings, which Postgres rejects at INSERT with invalid input syntax for type vector.
This blocks every pgvector workload from sqlproof entirely.
Use cases worth supporting
1. Embedding-based semantic search (~80% of real-world usage)
CREATETABLEdocuments (
id bigserialPRIMARY KEY,
content textNOT NULL,
embedding vector(384) NOT NULL
);
CREATEINDEXdocs_embedding_hnswON documents
USING hnsw (embedding vector_cosine_ops);
Tests want to assert: given a query embedding, the K nearest neighbors are the K rows whose embeddings have the smallest cosine distance. Index-backed queries must agree with brute-force ground truth up to recall@K thresholds.
2. Hybrid search (vector + structured filter)
SELECT id, embedding <=> $1AS dist
FROM documents
WHERE published_at >= $2AND language = $3ORDER BY embedding <=> $1LIMIT10;
Tests want to assert: the structured filter does NOT change the ranking of remaining rows (only their inclusion). A bug where the optimizer swaps the scan strategy can shuffle results.
3. Multi-tenant RLS-gated vector search
CREATETABLEknowledge_base (
id bigserialPRIMARY KEY,
tenant_id uuid NOT NULL,
embedding vector(1536) NOT NULL,
-- ...
);
CREATE POLICY tenant_isolation ON knowledge_base
USING (tenant_id = current_tenant_id());
Tests want to assert: a user querying for nearest neighbors NEVER receives rows from a different tenant, even when their query doesn't include an explicit tenant filter (RLS enforces it). This is the most security-sensitive pgvector workload — a single leak compromises every customer's data.
4. Versioned embedding migration
Production deployments routinely upgrade embedding models (ada-002 → 3-small, or moving from OpenAI to a fine-tuned in-house model). The typical pattern:
ALTERTABLE documents ADD COLUMN embedding_v2 vector(1536);
-- backfill in batches-- shadow-query v2 alongside v1 for some period-- atomic cutover
Tests want to assert: during the migration window, queries against embedding_v2 return rankings consistent with queries against a freshly-rebuilt index — i.e., the backfill produced the same vectors as a from-scratch rebuild would have. The bug: a partial backfill leaves stale or zero vectors in some rows.
5. Deduplication / fuzzy matching
-- Find candidate duplicates for a newly-inserted record.SELECT id FROM products
WHERE embedding <=> $1<0.1ORDER BY embedding <=> $1LIMIT5;
Tests want to assert: a record's NN-of-itself is itself (self <=> self == 0 for cosine). And that "near-duplicates" defined by an application-level threshold remain stable under reembedding.
6. Clustering / periodic recompute
Some apps cluster embeddings nightly. Tests want to assert: cluster assignments are stable under small perturbations of input embeddings (the clustering function is robust), and unstable under large perturbations (clustering is responsive to real changes).
7. Reranking pipelines
Vector similarity is one signal in a larger ranking function (BM25 + recency + vector + business rules). Tests want to assert: knocking out any one signal degrades NDCG@10 by less than X (each signal is contributing), but knocking out all of them tanks it (the pipeline is doing work).
Concrete invariants to property-test
For each invariant: what shape, why it matters, what bug it catches.
Schema-level
I1. Dimension consistency. Every row in a vector(N) column has exactly N components.
Why: Mismatched dimensions cause Postgres to reject INSERTs at runtime, but the bug usually surfaces in a migration when a different embedding model was used for a subset of rows.
Catches: bad batch reembeddings, mismatched model versions in concurrent backfills.
I2. Non-null embedding contract. If a column is documented as having an embedding for every row, the count of WHERE embedding IS NULL is zero.
Why: A failed background job often leaves a partial nulled-out set behind. Counts surface this immediately.
Math-level
I3. Self-distance is 0 (L2) or 0 (cosine, normalized). For every stored vector v: v <-> v == 0 and v <=> v == 0.
Why: This is mathematically true; the only way it fails is storage corruption.
Catches: a bug where vectors are being passed through a lossy conversion (e.g., float32 → float16 → float32) before storage.
I4. Distance symmetry. For any two stored a, b: a <-> b == b <-> a and a <=> b == b <=> a.
Why: Mathematically true; failing means values are read differently in different scan paths.
Catches: a bug in a custom UDF wrapping distance, or a planner choosing different distance implementations under different conditions.
I5. Cosine ↔ inner-product equivalence (when vectors are L2-normalized). For normalized a, b: (a <=> b) ≈ 1 - (a <#> b * -1) within float tolerance.
Why: A common bug: the schema documents that vectors are normalized but the embedding pipeline skipped the normalization step. Then queries using <=> and <#> give inconsistent rankings.
Index-level
I6. Recall@K vs ground truth. Generate a corpus, build an HNSW index, run K-NN queries with and without the index. The indexed query must achieve ≥ recall threshold (e.g., 0.95) against the exact result.
Why: HNSW parameters (m, ef_construction, ef) trade off recall vs latency. A change to defaults can silently drop recall below acceptable.
Catches: a config tweak that looks faster but degrades quality.
I7. Top-K stability under structured filter. For a query q with no filter vs. a query with a non-restrictive filter (e.g., language = 'en' when all rows are English), the top-K rankings must match exactly.
Why: This catches optimizer bugs where adding a redundant filter triggers a different scan path that uses a different index, returning different neighbors.
I8. Top-K stability across distance operators (for normalized vectors). For normalized vectors, the top-K under <->, <=>, <#> rank the same neighbors (up to ties).
Why: A bug where a query accidentally uses the wrong operator (<-> instead of <=>) returns subtly wrong results, especially in a reranking pipeline.
Security-level
I9. RLS isolation under vector search. With a tenant_id-gating policy, a user querying nearest neighbors NEVER sees a row from another tenant — even when the user doesn't include an explicit WHERE tenant_id = ....
Why: The most security-sensitive pgvector invariant. RLS bypass in vector search is a customer-data leak.
Catches: bugs where the vector index path doesn't apply RLS (rare but has happened), or policies that gate SELECT but miss LIMIT-with-ORDER-BY-vector plan paths.
Recipe shape: generate cross-tenant data, simulate a user, run vector queries, assert all returned tenant_ids match the user.
Application-level
I10. Upsert idempotency. For INSERT ... ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding, after a sequence (id, vec_1), (id, vec_2), ..., (id, vec_n), the stored embedding for id equals vec_n.
Why: A trigger or row-level policy interaction can corrupt upserts. Property-testing surfaces it immediately.
I11. Reembedding cutover consistency. Backfill column embedding_v2. The result of SELECT id ORDER BY embedding_v2 <=> q LIMIT 10 should equal the result of a freshly-rebuilt embedding_v2 for the same q.
I12. Near-duplicate detection threshold stability. For a fixed similarity threshold T, the set of near-duplicates returned for a query vector is invariant under reembedding with the same model.
Why: Application logic that branches on "duplicate? yes/no" is fragile to small embedding drift; testing surfaces sensitivity issues.
Acceptance criteria
Phase 1 — schema + generator (minimum viable)
Parser recognizes vector(N) and vector (unsized) → PgType(kind="vector", name="vector", modifiers=(N,)). Pick a new kind since scalar doesn't carry dimension semantics.
Introspector reads pg_attribute.atttypmod to recover the dimension on a vector column.
strategy_for_type for kind="vector" produces list[float] of length N, optionally L2-normalized via a column attribute (default unnormalized).
psycopg adapter integration — generated lists round-trip cleanly to a real Postgres + pgvector instance.
Unit tests pin dimension, value type, and normalization invariants.
Live-DB integration test runs SqlProof.check() against a schema with a vector(384) column.
Phase 2 — recipes for the canonical patterns
Bundled in sqlproof.contrib.pgvector (mirroring how sqlproof.contrib.supabase is structured today):
assert_recall_at_k(db, table, embedding_column, k=10, recall_threshold=0.95) — generates query vectors, compares index-backed top-K to brute-force top-K, asserts the recall.
assert_self_distance_zero(db, table, embedding_column) — for every row, verifies v <-> v == 0.
assert_rls_isolated_search(db, table, tenant_column, embedding_column) — generates cross-tenant data, runs vector queries from each tenant's perspective, asserts no cross-tenant results.
assert_upsert_idempotent(db, table, key_columns, embedding_column) — drives a property test that random INSERT-ON-CONFLICT sequences end at the last value written.
Phase 3 — advanced
Index parameter exploration: a property-testing variant of assert_recall_at_k that searches HNSW parameters and reports the recall/latency frontier.
IVFFlat-specific tests (training set composition affects index quality — surface that).
Open questions
Should vector get its own PgType.kind ("vector") or be lumped under "scalar" with a dimension modifier? My instinct: new kind. Vector values are structurally different from scalars (they have a per-column dim) and downstream generators need to dispatch on them. Modifiers are already used for varchar(N) and numeric(p, s) so the modifier mechanism CAN carry dim — but type dispatch reads kind not modifiers, so a new kind is cleaner.
L2 normalization by column attribute, or always? Most apps using cosine want L2-normalized vectors. But pgvector itself doesn't enforce this. Options:
(a) Always emit normalized vectors → safer default, breaks rare workloads that need unnormalized.
(b) Always emit unnormalized → matches pgvector's behavior, but breaks the cosine equivalence invariant (I5).
(c) Add a column override / attribute to choose. Recommended.
Recall@K threshold defaults. Industry consensus is 0.95 at K=10 with HNSW defaults. We should ship sensible defaults but let users tune per-recipe.
Should we depend on pgvector (the psycopg adapter)? Probably an [pgvector] optional dependency. Wire-format compatibility is the alternative but more fragile.
Float precision strategy. Hypothesis's st.floats() produces full-range values including subnormals and exact zeros. Should vector_strategy clip to [-1, 1] by default? For embeddings, the realistic value range is small. Recommended: clip to [-1, 1] unless explicitly overridden.
Out of scope (for now)
Sparse vector types (sparsevec) — pgvector v0.7+. Add after the dense path is stable.
Phase 1: medium (~300 LoC + tests). Schema parsing is small; psycopg adapter is the unknown.
Phase 2: each recipe is 50–100 LoC. Four recipes = ~300 LoC + integration tests against pgvector.
Phase 3: large; gate on Phase 1+2 adoption.
This is the kind of feature that sqlproof would be the FIRST property-based library to ship. The closest precedents are ANN benchmark suites (ann-benchmarks.com) which don't integrate with test runners; pgvector's own test suite covers correctness but not application-level invariants like RLS isolation or upsert idempotency.
Summary
Add first-class support for pgvector — schema parsing for
vector(N)columns, generation of valid vector values, and a set of recipes for property-testing the most common pgvector workloads (embedding search, hybrid retrieval, RLS-gated tenant isolation, versioned embedding migrations).This is a sqlproof differentiator. No mainstream property-based testing library exercises vector workloads today; pgTAP has no concept of "is the index returning the right neighbors?". Most pgvector bugs surface as silent ranking degradation rather than hard failures, which is exactly the failure class property tests are best at catching.
Background
pgvectorships:vectortype with a per-column dimension (embedding vector(1536)for OpenAI ada-002,vector(384)for sentence-transformers MiniLM, etc.).<->(L2),<#>(negative inner product),<=>(cosine distance).It's ubiquitous in Supabase, RAG pipelines, and recommendation systems. Real-world bugs in pgvector deployments don't usually crash — they silently return wrong neighbors.
Schema support gaps today
sqlproof.schema.parse_sql._parse_type_nodefalls back to scalar text forvector(N). The introspector treats it as scalar. The row generator emits random strings, which Postgres rejects at INSERT withinvalid input syntax for type vector.This blocks every pgvector workload from sqlproof entirely.
Use cases worth supporting
1. Embedding-based semantic search (~80% of real-world usage)
Tests want to assert: given a query embedding, the K nearest neighbors are the K rows whose embeddings have the smallest cosine distance. Index-backed queries must agree with brute-force ground truth up to recall@K thresholds.
2. Hybrid search (vector + structured filter)
Tests want to assert: the structured filter does NOT change the ranking of remaining rows (only their inclusion). A bug where the optimizer swaps the scan strategy can shuffle results.
3. Multi-tenant RLS-gated vector search
Tests want to assert: a user querying for nearest neighbors NEVER receives rows from a different tenant, even when their query doesn't include an explicit tenant filter (RLS enforces it). This is the most security-sensitive pgvector workload — a single leak compromises every customer's data.
4. Versioned embedding migration
Production deployments routinely upgrade embedding models (ada-002 → 3-small, or moving from OpenAI to a fine-tuned in-house model). The typical pattern:
Tests want to assert: during the migration window, queries against
embedding_v2return rankings consistent with queries against a freshly-rebuilt index — i.e., the backfill produced the same vectors as a from-scratch rebuild would have. The bug: a partial backfill leaves stale or zero vectors in some rows.5. Deduplication / fuzzy matching
Tests want to assert: a record's NN-of-itself is itself (
self <=> self == 0for cosine). And that "near-duplicates" defined by an application-level threshold remain stable under reembedding.6. Clustering / periodic recompute
Some apps cluster embeddings nightly. Tests want to assert: cluster assignments are stable under small perturbations of input embeddings (the clustering function is robust), and unstable under large perturbations (clustering is responsive to real changes).
7. Reranking pipelines
Vector similarity is one signal in a larger ranking function (BM25 + recency + vector + business rules). Tests want to assert: knocking out any one signal degrades NDCG@10 by less than X (each signal is contributing), but knocking out all of them tanks it (the pipeline is doing work).
Concrete invariants to property-test
For each invariant: what shape, why it matters, what bug it catches.
Schema-level
I1. Dimension consistency. Every row in a
vector(N)column has exactly N components.I2. Non-null embedding contract. If a column is documented as having an embedding for every row, the count of
WHERE embedding IS NULLis zero.Math-level
I3. Self-distance is 0 (L2) or 0 (cosine, normalized). For every stored vector
v:v <-> v == 0andv <=> v == 0.I4. Distance symmetry. For any two stored
a, b:a <-> b == b <-> aanda <=> b == b <=> a.I5. Cosine ↔ inner-product equivalence (when vectors are L2-normalized). For normalized
a, b:(a <=> b) ≈ 1 - (a <#> b * -1)within float tolerance.<=>and<#>give inconsistent rankings.Index-level
I6. Recall@K vs ground truth. Generate a corpus, build an HNSW index, run K-NN queries with and without the index. The indexed query must achieve ≥ recall threshold (e.g., 0.95) against the exact result.
m,ef_construction,ef) trade off recall vs latency. A change to defaults can silently drop recall below acceptable.I7. Top-K stability under structured filter. For a query
qwith no filter vs. a query with a non-restrictive filter (e.g.,language = 'en'when all rows are English), the top-K rankings must match exactly.I8. Top-K stability across distance operators (for normalized vectors). For normalized vectors, the top-K under
<->,<=>,<#>rank the same neighbors (up to ties).<->instead of<=>) returns subtly wrong results, especially in a reranking pipeline.Security-level
I9. RLS isolation under vector search. With a
tenant_id-gating policy, a user querying nearest neighbors NEVER sees a row from another tenant — even when the user doesn't include an explicitWHERE tenant_id = ....SELECTbut missLIMIT-with-ORDER-BY-vectorplan paths.tenant_ids match the user.Application-level
I10. Upsert idempotency. For
INSERT ... ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding, after a sequence(id, vec_1), (id, vec_2), ..., (id, vec_n), the stored embedding foridequalsvec_n.I11. Reembedding cutover consistency. Backfill column
embedding_v2. The result ofSELECT id ORDER BY embedding_v2 <=> q LIMIT 10should equal the result of a freshly-rebuiltembedding_v2for the sameq.I12. Near-duplicate detection threshold stability. For a fixed similarity threshold T, the set of near-duplicates returned for a query vector is invariant under reembedding with the same model.
Acceptance criteria
Phase 1 — schema + generator (minimum viable)
vector(N)andvector(unsized) →PgType(kind="vector", name="vector", modifiers=(N,)). Pick a newkindsincescalardoesn't carry dimension semantics.pg_attribute.atttypmodto recover the dimension on avectorcolumn.strategy_for_typeforkind="vector"produceslist[float]of length N, optionally L2-normalized via a column attribute (default unnormalized).SqlProof.check()against a schema with avector(384)column.Phase 2 — recipes for the canonical patterns
Bundled in
sqlproof.contrib.pgvector(mirroring howsqlproof.contrib.supabaseis structured today):assert_recall_at_k(db, table, embedding_column, k=10, recall_threshold=0.95)— generates query vectors, compares index-backed top-K to brute-force top-K, asserts the recall.assert_self_distance_zero(db, table, embedding_column)— for every row, verifiesv <-> v == 0.assert_rls_isolated_search(db, table, tenant_column, embedding_column)— generates cross-tenant data, runs vector queries from each tenant's perspective, asserts no cross-tenant results.assert_upsert_idempotent(db, table, key_columns, embedding_column)— drives a property test that random INSERT-ON-CONFLICT sequences end at the last value written.Phase 3 — advanced
assert_recall_at_kthat searches HNSW parameters and reports the recall/latency frontier.Open questions
Should
vectorget its ownPgType.kind("vector") or be lumped under"scalar"with a dimension modifier? My instinct: new kind. Vector values are structurally different from scalars (they have a per-column dim) and downstream generators need to dispatch on them. Modifiers are already used forvarchar(N)andnumeric(p, s)so the modifier mechanism CAN carry dim — but type dispatch readskindnotmodifiers, so a new kind is cleaner.L2 normalization by column attribute, or always? Most apps using cosine want L2-normalized vectors. But pgvector itself doesn't enforce this. Options:
Recall@K threshold defaults. Industry consensus is 0.95 at K=10 with HNSW defaults. We should ship sensible defaults but let users tune per-recipe.
Should we depend on
pgvector(the psycopg adapter)? Probably an[pgvector]optional dependency. Wire-format compatibility is the alternative but more fragile.Float precision strategy. Hypothesis's
st.floats()produces full-range values including subnormals and exact zeros. Shouldvector_strategyclip to[-1, 1]by default? For embeddings, the realistic value range is small. Recommended: clip to[-1, 1]unless explicitly overridden.Out of scope (for now)
sparsevec) — pgvector v0.7+. Add after the dense path is stable.halfvec) — same.Estimated effort
This is the kind of feature that sqlproof would be the FIRST property-based library to ship. The closest precedents are ANN benchmark suites (ann-benchmarks.com) which don't integrate with test runners; pgvector's own test suite covers correctness but not application-level invariants like RLS isolation or upsert idempotency.