Support pgvector: schema parsing, generation, and recipes for embedding/RLS/migration invariants

## Summary

Add first-class support for **pgvector** — schema parsing for `vector(N)` columns, generation of valid vector values, and a set of recipes for property-testing the most common pgvector workloads (embedding search, hybrid retrieval, RLS-gated tenant isolation, versioned embedding migrations).

This is a sqlproof differentiator. No mainstream property-based testing library exercises vector workloads today; pgTAP has no concept of "is the index returning the right neighbors?". Most pgvector bugs surface as silent ranking degradation rather than hard failures, which is exactly the failure class property tests are best at catching.

## Background

`pgvector` ships:
- A `vector` type with a per-column dimension (`embedding vector(1536)` for OpenAI ada-002, `vector(384)` for sentence-transformers MiniLM, etc.).
- Three distance operators: `<->` (L2), `<#>` (negative inner product), `<=>` (cosine distance).
- Two index types: HNSW (newer, faster, no training) and IVFFlat (older, requires training).

It's ubiquitous in Supabase, RAG pipelines, and recommendation systems. Real-world bugs in pgvector deployments don't usually crash — they silently return wrong neighbors.

## Schema support gaps today

`sqlproof.schema.parse_sql._parse_type_node` falls back to scalar text for `vector(N)`. The introspector treats it as scalar. The row generator emits random strings, which Postgres rejects at INSERT with `invalid input syntax for type vector`.

This blocks every pgvector workload from sqlproof entirely.

## Use cases worth supporting

### 1. Embedding-based semantic search (~80% of real-world usage)

```sql
CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    content text NOT NULL,
    embedding vector(384) NOT NULL
);
CREATE INDEX docs_embedding_hnsw ON documents
    USING hnsw (embedding vector_cosine_ops);
```

Tests want to assert: given a query embedding, the K nearest neighbors are the K rows whose embeddings have the smallest cosine distance. Index-backed queries must agree with brute-force ground truth up to recall@K thresholds.

### 2. Hybrid search (vector + structured filter)

```sql
SELECT id, embedding <=> $1 AS dist
FROM documents
WHERE published_at >= $2 AND language = $3
ORDER BY embedding <=> $1
LIMIT 10;
```

Tests want to assert: the structured filter does NOT change the ranking of remaining rows (only their inclusion). A bug where the optimizer swaps the scan strategy can shuffle results.

### 3. Multi-tenant RLS-gated vector search

```sql
CREATE TABLE knowledge_base (
    id bigserial PRIMARY KEY,
    tenant_id uuid NOT NULL,
    embedding vector(1536) NOT NULL,
    -- ...
);
CREATE POLICY tenant_isolation ON knowledge_base
    USING (tenant_id = current_tenant_id());
```

Tests want to assert: a user querying for nearest neighbors NEVER receives rows from a different tenant, even when their query doesn't include an explicit tenant filter (RLS enforces it). This is the most security-sensitive pgvector workload — a single leak compromises every customer's data.

### 4. Versioned embedding migration

Production deployments routinely upgrade embedding models (ada-002 → 3-small, or moving from OpenAI to a fine-tuned in-house model). The typical pattern:

```sql
ALTER TABLE documents ADD COLUMN embedding_v2 vector(1536);
-- backfill in batches
-- shadow-query v2 alongside v1 for some period
-- atomic cutover
```

Tests want to assert: during the migration window, queries against `embedding_v2` return rankings consistent with queries against a freshly-rebuilt index — i.e., the backfill produced the same vectors as a from-scratch rebuild would have. The bug: a partial backfill leaves stale or zero vectors in some rows.

### 5. Deduplication / fuzzy matching

```sql
-- Find candidate duplicates for a newly-inserted record.
SELECT id FROM products
WHERE embedding <=> $1 < 0.1
ORDER BY embedding <=> $1
LIMIT 5;
```

Tests want to assert: a record's NN-of-itself is itself (`self <=> self == 0` for cosine). And that "near-duplicates" defined by an application-level threshold remain stable under reembedding.

### 6. Clustering / periodic recompute

Some apps cluster embeddings nightly. Tests want to assert: cluster assignments are stable under small perturbations of input embeddings (the clustering function is robust), and unstable under large perturbations (clustering is responsive to real changes).

### 7. Reranking pipelines

Vector similarity is one signal in a larger ranking function (BM25 + recency + vector + business rules). Tests want to assert: knocking out any one signal degrades NDCG@10 by less than X (each signal is contributing), but knocking out all of them tanks it (the pipeline is doing work).

## Concrete invariants to property-test

For each invariant: what shape, why it matters, what bug it catches.

### Schema-level

**I1. Dimension consistency.** Every row in a `vector(N)` column has exactly N components.
- **Why**: Mismatched dimensions cause Postgres to reject INSERTs at runtime, but the bug usually surfaces in a migration when a different embedding model was used for a subset of rows.
- **Catches**: bad batch reembeddings, mismatched model versions in concurrent backfills.

**I2. Non-null embedding contract.** If a column is documented as having an embedding for every row, the count of `WHERE embedding IS NULL` is zero.
- **Why**: A failed background job often leaves a partial nulled-out set behind. Counts surface this immediately.

### Math-level

**I3. Self-distance is 0** (L2) **or 0** (cosine, normalized). For every stored vector `v`: `v <-> v == 0` and `v <=> v == 0`.
- **Why**: This is mathematically true; the only way it fails is storage corruption.
- **Catches**: a bug where vectors are being passed through a lossy conversion (e.g., float32 → float16 → float32) before storage.

**I4. Distance symmetry.** For any two stored `a, b`: `a <-> b == b <-> a` and `a <=> b == b <=> a`.
- **Why**: Mathematically true; failing means values are read differently in different scan paths.
- **Catches**: a bug in a custom UDF wrapping distance, or a planner choosing different distance implementations under different conditions.

**I5. Cosine ↔ inner-product equivalence** (when vectors are L2-normalized). For normalized `a, b`: `(a <=> b) ≈ 1 - (a <#> b * -1)` within float tolerance.
- **Why**: A common bug: the schema documents that vectors are normalized but the embedding pipeline skipped the normalization step. Then queries using `<=>` and `<#>` give inconsistent rankings.

### Index-level

**I6. Recall@K vs ground truth.** Generate a corpus, build an HNSW index, run K-NN queries with and without the index. The indexed query must achieve ≥ recall threshold (e.g., 0.95) against the exact result.
- **Why**: HNSW parameters (`m`, `ef_construction`, `ef`) trade off recall vs latency. A change to defaults can silently drop recall below acceptable.
- **Catches**: a config tweak that looks faster but degrades quality.

**I7. Top-K stability under structured filter.** For a query `q` with no filter vs. a query with a non-restrictive filter (e.g., `language = 'en'` when all rows are English), the top-K rankings must match exactly.
- **Why**: This catches optimizer bugs where adding a redundant filter triggers a different scan path that uses a different index, returning different neighbors.

**I8. Top-K stability across distance operators** (for normalized vectors). For normalized vectors, the top-K under `<->`, `<=>`, `<#>` rank the same neighbors (up to ties).
- **Why**: A bug where a query accidentally uses the wrong operator (`<->` instead of `<=>`) returns subtly wrong results, especially in a reranking pipeline.

### Security-level

**I9. RLS isolation under vector search.** With a `tenant_id`-gating policy, a user querying nearest neighbors NEVER sees a row from another tenant — even when the user doesn't include an explicit `WHERE tenant_id = ...`.
- **Why**: The most security-sensitive pgvector invariant. RLS bypass in vector search is a customer-data leak.
- **Catches**: bugs where the vector index path doesn't apply RLS (rare but has happened), or policies that gate `SELECT` but miss `LIMIT-with-ORDER-BY-vector` plan paths.
- **Recipe shape**: generate cross-tenant data, simulate a user, run vector queries, assert all returned `tenant_id`s match the user.

### Application-level

**I10. Upsert idempotency.** For `INSERT ... ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding`, after a sequence `(id, vec_1), (id, vec_2), ..., (id, vec_n)`, the stored embedding for `id` equals `vec_n`.
- **Why**: A trigger or row-level policy interaction can corrupt upserts. Property-testing surfaces it immediately.

**I11. Reembedding cutover consistency.** Backfill column `embedding_v2`. The result of `SELECT id ORDER BY embedding_v2 <=> q LIMIT 10` should equal the result of a freshly-rebuilt `embedding_v2` for the same `q`.
- **Why**: Catches partial backfills, stale-rate-limit-failed rows, batch-coalescing bugs.

**I12. Near-duplicate detection threshold stability.** For a fixed similarity threshold T, the set of near-duplicates returned for a query vector is invariant under reembedding with the same model.
- **Why**: Application logic that branches on "duplicate? yes/no" is fragile to small embedding drift; testing surfaces sensitivity issues.

## Acceptance criteria

### Phase 1 — schema + generator (minimum viable)

- [ ] Parser recognizes `vector(N)` and `vector` (unsized) → `PgType(kind="vector", name="vector", modifiers=(N,))`. Pick a new `kind` since `scalar` doesn't carry dimension semantics.
- [ ] Introspector reads `pg_attribute.atttypmod` to recover the dimension on a `vector` column.
- [ ] `strategy_for_type` for `kind="vector"` produces `list[float]` of length N, optionally L2-normalized via a column attribute (default unnormalized).
- [ ] psycopg adapter integration — generated lists round-trip cleanly to a real Postgres + pgvector instance.
- [ ] Unit tests pin dimension, value type, and normalization invariants.
- [ ] Live-DB integration test runs `SqlProof.check()` against a schema with a `vector(384)` column.

### Phase 2 — recipes for the canonical patterns

Bundled in `sqlproof.contrib.pgvector` (mirroring how `sqlproof.contrib.supabase` is structured today):

- [ ] **`assert_recall_at_k(db, table, embedding_column, k=10, recall_threshold=0.95)`** — generates query vectors, compares index-backed top-K to brute-force top-K, asserts the recall.
- [ ] **`assert_self_distance_zero(db, table, embedding_column)`** — for every row, verifies `v <-> v == 0`.
- [ ] **`assert_rls_isolated_search(db, table, tenant_column, embedding_column)`** — generates cross-tenant data, runs vector queries from each tenant's perspective, asserts no cross-tenant results.
- [ ] **`assert_upsert_idempotent(db, table, key_columns, embedding_column)`** — drives a property test that random INSERT-ON-CONFLICT sequences end at the last value written.

### Phase 3 — advanced

- [ ] Index parameter exploration: a property-testing variant of `assert_recall_at_k` that searches HNSW parameters and reports the recall/latency frontier.
- [ ] Cross-model migration recipe: shadow-query helper.
- [ ] IVFFlat-specific tests (training set composition affects index quality — surface that).

## Open questions

1. **Should `vector` get its own `PgType.kind`** (`"vector"`) or be lumped under `"scalar"` with a dimension modifier? My instinct: new kind. Vector values are structurally different from scalars (they have a per-column dim) and downstream generators need to dispatch on them. Modifiers are already used for `varchar(N)` and `numeric(p, s)` so the modifier mechanism CAN carry dim — but type dispatch reads `kind` not `modifiers`, so a new kind is cleaner.

2. **L2 normalization by column attribute, or always?** Most apps using cosine want L2-normalized vectors. But pgvector itself doesn't enforce this. Options:
   - (a) Always emit normalized vectors → safer default, breaks rare workloads that need unnormalized.
   - (b) Always emit unnormalized → matches pgvector's behavior, but breaks the cosine equivalence invariant (I5).
   - (c) Add a column override / attribute to choose. **Recommended.**

3. **Recall@K threshold defaults.** Industry consensus is 0.95 at K=10 with HNSW defaults. We should ship sensible defaults but let users tune per-recipe.

4. **Should we depend on `pgvector` (the psycopg adapter)?** Probably an `[pgvector]` optional dependency. Wire-format compatibility is the alternative but more fragile.

5. **Float precision strategy.** Hypothesis's `st.floats()` produces full-range values including subnormals and exact zeros. Should `vector_strategy` clip to `[-1, 1]` by default? For embeddings, the realistic value range is small. **Recommended: clip to `[-1, 1]` unless explicitly overridden.**

## Out of scope (for now)

- Sparse vector types (`sparsevec`) — pgvector v0.7+. Add after the dense path is stable.
- Half-precision vectors (`halfvec`) — same.
- Index tuning advisor (could be its own Tier-1 feature) — leave to mutation-testing-style #11.

## Estimated effort

- Phase 1: medium (~300 LoC + tests). Schema parsing is small; psycopg adapter is the unknown.
- Phase 2: each recipe is 50–100 LoC. Four recipes = ~300 LoC + integration tests against pgvector.
- Phase 3: large; gate on Phase 1+2 adoption.

This is the kind of feature that sqlproof would be the FIRST property-based library to ship. The closest precedents are ANN benchmark suites (ann-benchmarks.com) which don't integrate with test runners; pgvector's own test suite covers correctness but not application-level invariants like RLS isolation or upsert idempotency.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pgvector: schema parsing, generation, and recipes for embedding/RLS/migration invariants #69

Summary

Background

Schema support gaps today

Use cases worth supporting

1. Embedding-based semantic search (~80% of real-world usage)

2. Hybrid search (vector + structured filter)

3. Multi-tenant RLS-gated vector search

4. Versioned embedding migration

5. Deduplication / fuzzy matching

6. Clustering / periodic recompute

7. Reranking pipelines

Concrete invariants to property-test

Schema-level

Math-level

Index-level

Security-level

Application-level

Acceptance criteria

Phase 1 — schema + generator (minimum viable)

Phase 2 — recipes for the canonical patterns

Phase 3 — advanced

Open questions

Out of scope (for now)

Estimated effort

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support pgvector: schema parsing, generation, and recipes for embedding/RLS/migration invariants #69

Description

Summary

Background

Schema support gaps today

Use cases worth supporting

1. Embedding-based semantic search (~80% of real-world usage)

2. Hybrid search (vector + structured filter)

3. Multi-tenant RLS-gated vector search

4. Versioned embedding migration

5. Deduplication / fuzzy matching

6. Clustering / periodic recompute

7. Reranking pipelines

Concrete invariants to property-test

Schema-level

Math-level

Index-level

Security-level

Application-level

Acceptance criteria

Phase 1 — schema + generator (minimum viable)

Phase 2 — recipes for the canonical patterns

Phase 3 — advanced

Open questions

Out of scope (for now)

Estimated effort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions