Skip to content

Support pgvector: schema parsing, generation, and recipes for embedding/RLS/migration invariants #69

@alialavia

Description

@alialavia

Summary

Add first-class support for pgvector — schema parsing for vector(N) columns, generation of valid vector values, and a set of recipes for property-testing the most common pgvector workloads (embedding search, hybrid retrieval, RLS-gated tenant isolation, versioned embedding migrations).

This is a sqlproof differentiator. No mainstream property-based testing library exercises vector workloads today; pgTAP has no concept of "is the index returning the right neighbors?". Most pgvector bugs surface as silent ranking degradation rather than hard failures, which is exactly the failure class property tests are best at catching.

Background

pgvector ships:

  • A vector type with a per-column dimension (embedding vector(1536) for OpenAI ada-002, vector(384) for sentence-transformers MiniLM, etc.).
  • Three distance operators: <-> (L2), <#> (negative inner product), <=> (cosine distance).
  • Two index types: HNSW (newer, faster, no training) and IVFFlat (older, requires training).

It's ubiquitous in Supabase, RAG pipelines, and recommendation systems. Real-world bugs in pgvector deployments don't usually crash — they silently return wrong neighbors.

Schema support gaps today

sqlproof.schema.parse_sql._parse_type_node falls back to scalar text for vector(N). The introspector treats it as scalar. The row generator emits random strings, which Postgres rejects at INSERT with invalid input syntax for type vector.

This blocks every pgvector workload from sqlproof entirely.

Use cases worth supporting

1. Embedding-based semantic search (~80% of real-world usage)

CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    content text NOT NULL,
    embedding vector(384) NOT NULL
);
CREATE INDEX docs_embedding_hnsw ON documents
    USING hnsw (embedding vector_cosine_ops);

Tests want to assert: given a query embedding, the K nearest neighbors are the K rows whose embeddings have the smallest cosine distance. Index-backed queries must agree with brute-force ground truth up to recall@K thresholds.

2. Hybrid search (vector + structured filter)

SELECT id, embedding <=> $1 AS dist
FROM documents
WHERE published_at >= $2 AND language = $3
ORDER BY embedding <=> $1
LIMIT 10;

Tests want to assert: the structured filter does NOT change the ranking of remaining rows (only their inclusion). A bug where the optimizer swaps the scan strategy can shuffle results.

3. Multi-tenant RLS-gated vector search

CREATE TABLE knowledge_base (
    id bigserial PRIMARY KEY,
    tenant_id uuid NOT NULL,
    embedding vector(1536) NOT NULL,
    -- ...
);
CREATE POLICY tenant_isolation ON knowledge_base
    USING (tenant_id = current_tenant_id());

Tests want to assert: a user querying for nearest neighbors NEVER receives rows from a different tenant, even when their query doesn't include an explicit tenant filter (RLS enforces it). This is the most security-sensitive pgvector workload — a single leak compromises every customer's data.

4. Versioned embedding migration

Production deployments routinely upgrade embedding models (ada-002 → 3-small, or moving from OpenAI to a fine-tuned in-house model). The typical pattern:

ALTER TABLE documents ADD COLUMN embedding_v2 vector(1536);
-- backfill in batches
-- shadow-query v2 alongside v1 for some period
-- atomic cutover

Tests want to assert: during the migration window, queries against embedding_v2 return rankings consistent with queries against a freshly-rebuilt index — i.e., the backfill produced the same vectors as a from-scratch rebuild would have. The bug: a partial backfill leaves stale or zero vectors in some rows.

5. Deduplication / fuzzy matching

-- Find candidate duplicates for a newly-inserted record.
SELECT id FROM products
WHERE embedding <=> $1 < 0.1
ORDER BY embedding <=> $1
LIMIT 5;

Tests want to assert: a record's NN-of-itself is itself (self <=> self == 0 for cosine). And that "near-duplicates" defined by an application-level threshold remain stable under reembedding.

6. Clustering / periodic recompute

Some apps cluster embeddings nightly. Tests want to assert: cluster assignments are stable under small perturbations of input embeddings (the clustering function is robust), and unstable under large perturbations (clustering is responsive to real changes).

7. Reranking pipelines

Vector similarity is one signal in a larger ranking function (BM25 + recency + vector + business rules). Tests want to assert: knocking out any one signal degrades NDCG@10 by less than X (each signal is contributing), but knocking out all of them tanks it (the pipeline is doing work).

Concrete invariants to property-test

For each invariant: what shape, why it matters, what bug it catches.

Schema-level

I1. Dimension consistency. Every row in a vector(N) column has exactly N components.

  • Why: Mismatched dimensions cause Postgres to reject INSERTs at runtime, but the bug usually surfaces in a migration when a different embedding model was used for a subset of rows.
  • Catches: bad batch reembeddings, mismatched model versions in concurrent backfills.

I2. Non-null embedding contract. If a column is documented as having an embedding for every row, the count of WHERE embedding IS NULL is zero.

  • Why: A failed background job often leaves a partial nulled-out set behind. Counts surface this immediately.

Math-level

I3. Self-distance is 0 (L2) or 0 (cosine, normalized). For every stored vector v: v <-> v == 0 and v <=> v == 0.

  • Why: This is mathematically true; the only way it fails is storage corruption.
  • Catches: a bug where vectors are being passed through a lossy conversion (e.g., float32 → float16 → float32) before storage.

I4. Distance symmetry. For any two stored a, b: a <-> b == b <-> a and a <=> b == b <=> a.

  • Why: Mathematically true; failing means values are read differently in different scan paths.
  • Catches: a bug in a custom UDF wrapping distance, or a planner choosing different distance implementations under different conditions.

I5. Cosine ↔ inner-product equivalence (when vectors are L2-normalized). For normalized a, b: (a <=> b) ≈ 1 - (a <#> b * -1) within float tolerance.

  • Why: A common bug: the schema documents that vectors are normalized but the embedding pipeline skipped the normalization step. Then queries using <=> and <#> give inconsistent rankings.

Index-level

I6. Recall@K vs ground truth. Generate a corpus, build an HNSW index, run K-NN queries with and without the index. The indexed query must achieve ≥ recall threshold (e.g., 0.95) against the exact result.

  • Why: HNSW parameters (m, ef_construction, ef) trade off recall vs latency. A change to defaults can silently drop recall below acceptable.
  • Catches: a config tweak that looks faster but degrades quality.

I7. Top-K stability under structured filter. For a query q with no filter vs. a query with a non-restrictive filter (e.g., language = 'en' when all rows are English), the top-K rankings must match exactly.

  • Why: This catches optimizer bugs where adding a redundant filter triggers a different scan path that uses a different index, returning different neighbors.

I8. Top-K stability across distance operators (for normalized vectors). For normalized vectors, the top-K under <->, <=>, <#> rank the same neighbors (up to ties).

  • Why: A bug where a query accidentally uses the wrong operator (<-> instead of <=>) returns subtly wrong results, especially in a reranking pipeline.

Security-level

I9. RLS isolation under vector search. With a tenant_id-gating policy, a user querying nearest neighbors NEVER sees a row from another tenant — even when the user doesn't include an explicit WHERE tenant_id = ....

  • Why: The most security-sensitive pgvector invariant. RLS bypass in vector search is a customer-data leak.
  • Catches: bugs where the vector index path doesn't apply RLS (rare but has happened), or policies that gate SELECT but miss LIMIT-with-ORDER-BY-vector plan paths.
  • Recipe shape: generate cross-tenant data, simulate a user, run vector queries, assert all returned tenant_ids match the user.

Application-level

I10. Upsert idempotency. For INSERT ... ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding, after a sequence (id, vec_1), (id, vec_2), ..., (id, vec_n), the stored embedding for id equals vec_n.

  • Why: A trigger or row-level policy interaction can corrupt upserts. Property-testing surfaces it immediately.

I11. Reembedding cutover consistency. Backfill column embedding_v2. The result of SELECT id ORDER BY embedding_v2 <=> q LIMIT 10 should equal the result of a freshly-rebuilt embedding_v2 for the same q.

  • Why: Catches partial backfills, stale-rate-limit-failed rows, batch-coalescing bugs.

I12. Near-duplicate detection threshold stability. For a fixed similarity threshold T, the set of near-duplicates returned for a query vector is invariant under reembedding with the same model.

  • Why: Application logic that branches on "duplicate? yes/no" is fragile to small embedding drift; testing surfaces sensitivity issues.

Acceptance criteria

Phase 1 — schema + generator (minimum viable)

  • Parser recognizes vector(N) and vector (unsized) → PgType(kind="vector", name="vector", modifiers=(N,)). Pick a new kind since scalar doesn't carry dimension semantics.
  • Introspector reads pg_attribute.atttypmod to recover the dimension on a vector column.
  • strategy_for_type for kind="vector" produces list[float] of length N, optionally L2-normalized via a column attribute (default unnormalized).
  • psycopg adapter integration — generated lists round-trip cleanly to a real Postgres + pgvector instance.
  • Unit tests pin dimension, value type, and normalization invariants.
  • Live-DB integration test runs SqlProof.check() against a schema with a vector(384) column.

Phase 2 — recipes for the canonical patterns

Bundled in sqlproof.contrib.pgvector (mirroring how sqlproof.contrib.supabase is structured today):

  • assert_recall_at_k(db, table, embedding_column, k=10, recall_threshold=0.95) — generates query vectors, compares index-backed top-K to brute-force top-K, asserts the recall.
  • assert_self_distance_zero(db, table, embedding_column) — for every row, verifies v <-> v == 0.
  • assert_rls_isolated_search(db, table, tenant_column, embedding_column) — generates cross-tenant data, runs vector queries from each tenant's perspective, asserts no cross-tenant results.
  • assert_upsert_idempotent(db, table, key_columns, embedding_column) — drives a property test that random INSERT-ON-CONFLICT sequences end at the last value written.

Phase 3 — advanced

  • Index parameter exploration: a property-testing variant of assert_recall_at_k that searches HNSW parameters and reports the recall/latency frontier.
  • Cross-model migration recipe: shadow-query helper.
  • IVFFlat-specific tests (training set composition affects index quality — surface that).

Open questions

  1. Should vector get its own PgType.kind ("vector") or be lumped under "scalar" with a dimension modifier? My instinct: new kind. Vector values are structurally different from scalars (they have a per-column dim) and downstream generators need to dispatch on them. Modifiers are already used for varchar(N) and numeric(p, s) so the modifier mechanism CAN carry dim — but type dispatch reads kind not modifiers, so a new kind is cleaner.

  2. L2 normalization by column attribute, or always? Most apps using cosine want L2-normalized vectors. But pgvector itself doesn't enforce this. Options:

    • (a) Always emit normalized vectors → safer default, breaks rare workloads that need unnormalized.
    • (b) Always emit unnormalized → matches pgvector's behavior, but breaks the cosine equivalence invariant (I5).
    • (c) Add a column override / attribute to choose. Recommended.
  3. Recall@K threshold defaults. Industry consensus is 0.95 at K=10 with HNSW defaults. We should ship sensible defaults but let users tune per-recipe.

  4. Should we depend on pgvector (the psycopg adapter)? Probably an [pgvector] optional dependency. Wire-format compatibility is the alternative but more fragile.

  5. Float precision strategy. Hypothesis's st.floats() produces full-range values including subnormals and exact zeros. Should vector_strategy clip to [-1, 1] by default? For embeddings, the realistic value range is small. Recommended: clip to [-1, 1] unless explicitly overridden.

Out of scope (for now)

  • Sparse vector types (sparsevec) — pgvector v0.7+. Add after the dense path is stable.
  • Half-precision vectors (halfvec) — same.
  • Index tuning advisor (could be its own Tier-1 feature) — leave to mutation-testing-style Mutation testing harness for SQL function bodies #11.

Estimated effort

  • Phase 1: medium (~300 LoC + tests). Schema parsing is small; psycopg adapter is the unknown.
  • Phase 2: each recipe is 50–100 LoC. Four recipes = ~300 LoC + integration tests against pgvector.
  • Phase 3: large; gate on Phase 1+2 adoption.

This is the kind of feature that sqlproof would be the FIRST property-based library to ship. The closest precedents are ANN benchmark suites (ann-benchmarks.com) which don't integrate with test runners; pgvector's own test suite covers correctness but not application-level invariants like RLS isolation or upsert idempotency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions