Skip to content

CRDB vectorizer#168739

Draft
suj-krishnan wants to merge 27 commits intocockroachdb:masterfrom
suj-krishnan:crdb_vectorizer
Draft

CRDB vectorizer#168739
suj-krishnan wants to merge 27 commits intocockroachdb:masterfrom
suj-krishnan:crdb_vectorizer

Conversation

@suj-krishnan
Copy link
Copy Markdown
Collaborator

@suj-krishnan suj-krishnan commented Apr 21, 2026

Summary

Add an in-database vectorizer that automatically generates and maintains vector embeddings for table rows.

Epic: none

🤖 Generated with Claude Code

suj-krishnan and others added 9 commits April 20, 2026 11:01
Add a new package that dynamically loads the ONNX Runtime C library
via dlopen/dlsym and exposes Go bindings for loading ONNX models and
running neural network inference. This follows the same pattern used
by pkg/geo/geos for GEOS integration.

The ONNX Runtime C API exposes a single dlsym entry point
(OrtGetApiBase) that returns a struct of ~200+ function pointers,
making the integration much simpler than GEOS (which requires ~80
individual dlsym calls).

Key components:
- onnxruntime.h: C ABI contract defining opaque handles (CR_ONNX,
  CR_ONNX_Model), data types (Slice, String, Status), and function
  declarations for init, model loading, and inference.
- onnxruntime.cc: C++ dlopen/dlsym shim that resolves OrtApi function
  pointers, manages ORT environment and sessions, and wraps inference
  calls with proper tensor creation and cleanup.
- onnxruntime.go: Go bindings with sync.Once initialization, platform-
  specific library search (flag, env var, Bazel runfiles, parent dirs),
  and a Model type with LoadModel, RunInference, and Close methods.
- postprocess.go: Pure Go mean pooling and L2 normalization for
  converting token embeddings to sentence embeddings.

The library is loaded at runtime — operations gracefully error if
libonnxruntime is not found. Integration tests skip when the library
is unavailable; pure Go post-processing tests always run.

A tiny test ONNX model (vocab_size=32, dims=8) is included in
testdata/ along with the Python script that generated it.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a WordPiece tokenizer package and a high-level embedding engine
that wires the tokenizer to the ONNX Runtime integration from the
previous commit, providing a complete text-to-embedding pipeline.

The tokenizer (pkg/embedding/tokenizer/) implements the BERT uncased
tokenization algorithm in pure Go:
- Vocab loading from vocab.txt files (one token per line, 0-indexed)
- BERT pre-tokenization: lowercasing, NFD accent stripping, CJK
  character spacing, and punctuation splitting (matching HuggingFace's
  BasicTokenizer behavior exactly)
- WordPiece subword tokenization using the greedy longest-match-first
  algorithm, with ## continuation prefixes
- Encode/EncodeBatch methods that assemble [CLS] + tokens + [SEP] +
  padding, with dynamic batch padding for efficient inference

The embedding engine (pkg/embedding/engine.go) combines the tokenizer,
ONNX model inference, and post-processing into a single API:
- NewEngine(modelPath, vocabPath) loads both model and vocabulary
- Embed(text) returns a unit-normalized float32 embedding vector
- EmbedBatch(texts) processes multiple texts in a single inference call
- Thread-safe: tokenizer is immutable, ORT sessions are thread-safe

The end-to-end pipeline is:
  text → pre-tokenize → WordPiece → model inference → mean pool → L2 normalize → embedding

All 33 tokenizer tests are pure Go and always run. Engine integration
tests skip gracefully when ONNX Runtime is not available.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the embed() SQL builtin function that generates vector embeddings
from text using the ONNX model configured at server startup. This
completes the in-database vectorizer pipeline: text → tokenize → infer
→ post-process → VECTOR.

The implementation follows the GEOS pattern:
- CLI flags (--embedding-libs, --embedding-model, --embedding-vocab)
  configure the ONNX runtime library, model, and vocabulary paths.
- initEmbedding() runs at server startup (after initGEOS) and
  initializes a global Engine singleton via sync.Once.
- If initialization fails (missing library, model, or vocab), the
  server starts normally but embed() returns a pgerror with pgcode
  ConfigFile and a hint about the required flags.
- The builtin calls embedding.GetEngine() to obtain the singleton,
  then engine.Embed(text) to produce the vector.

The embed() function has Stable volatility since the same model
produces the same output for the same input within a session, but
the model could change across server restarts.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The embed() SQL builtin previously required users to manually download
model and vocabulary files and pass them via --embedding-model and
--embedding-vocab CLI flags. This commit adds automatic model management:
when those flags are omitted, the server downloads the default
all-MiniLM-L6-v2 model from Hugging Face on first startup and caches it
locally.

The new pkg/embedding/modelcache package handles downloading, SHA256
verification, and atomic file placement. Cache location defaults to
<store-dir>/embedding-cache/ for persistent stores or
~/Library/Caches/cockroach/embedding-models/ for in-memory stores
(e.g. cockroach demo). A new --embedding-cache-dir flag allows
overriding this.

Only --embedding-libs (ONNX Runtime library path) is still required.
Explicit --embedding-model/--embedding-vocab flags continue to work
and skip auto-download entirely.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add sentence-aware text chunking for embedding long documents that
exceed the transformer model's token limit (256 tokens for
all-MiniLM-L6-v2).

The chunker splits text on sentence boundaries (., !, ?), then
greedily packs sentences into chunks up to a configurable token
limit with overlap for context continuity between chunks.

The embed_chunks() SQL generator builtin wraps chunking and
embedding into a single function:

  SELECT * FROM embed_chunks('long text here...');

Returns rows of (chunk_seq INT, chunk STRING, embedding VECTOR),
where each row is an independently embedded segment of the input
text. This is the foundation for the vectorizer's background job
that will embed table rows in batches.

New packages and changes:
- pkg/embedding/chunker: sentence-aware text chunker with
  configurable max tokens and overlap
- pkg/embedding/tokenizer: add TokenCount() method for efficient
  token counting without full encoding
- pkg/embedding/engine: add Tokenizer() accessor for chunker use
- pkg/sql/sem/builtins: embed_chunks() generator builtin

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add catalog metadata for the vectorizer feature, which will
automatically generate and maintain vector embeddings for table rows.

The Vectorizer message in catpb stores the configuration:
- source_columns: which columns to embed
- template: how to combine columns into text
- embedding_table_id: companion table storing embeddings
- schedule_id: periodic job that generates embeddings
- model: embedding model name (e.g. all-MiniLM-L6-v2)
- schedule_cron: how often the job runs
- batch_size: rows per job invocation

The message follows the RowLevelTTL pattern: defined in
catpb/catalog.proto and referenced as an optional field (73)
on TableDescriptor in descpb/structured.proto.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the `pkg/sql/vectorizer` package with helper functions that
generate SQL for the companion embedding table and view. These will
be called by the CREATE VECTORIZER DDL execution (Phase 6d).

The companion table stores one embedding row per chunk:

    <table>_embeddings (
      embedding_uuid UUID PRIMARY KEY,
      source_<pk> <pk_type> NOT NULL REFERENCES <table> ON DELETE CASCADE,
      chunk_seq INT8 NOT NULL DEFAULT 0,
      chunk STRING NOT NULL,
      embedding VECTOR(<dims>) NOT NULL,
      UNIQUE (source_<pk>, chunk_seq)
    )

A companion view joins source and embedding tables for convenient
querying:

    <table>_embeddings_view AS
    SELECT s.*, e.chunk_seq, e.chunk, e.embedding
    FROM <table> s JOIN <table>_embeddings e ON ...

Both single-column and composite primary keys are supported.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SQL parser support for the vectorizer DDL statements:

  CREATE VECTORIZER ON <table>
    USING COLUMN (col1, col2, ...)
    [WITH model = '...', schedule = '...', batch_size = '...']

  DROP VECTORIZER [IF EXISTS] ON <table>

This adds:
- AST nodes (CreateVectorizer, DropVectorizer) in tree/vectorizer.go
- Grammar rules in sql.y with VECTORIZER as an unreserved keyword
- Statement dispatch via opaque.go to stub planner methods
- Parser test data and contextual help test coverage

The planner methods currently return "not yet implemented" errors;
actual DDL execution will follow in Phase 6d.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the planner methods for CREATE VECTORIZER and DROP VECTORIZER
that were previously returning "not yet implemented" stubs.

CREATE VECTORIZER ON <table> USING COLUMN (<cols>) [WITH <opts>]:
- Resolves the source table and validates ownership
- Validates that specified columns exist
- Parses WITH options (model, template, schedule, batch_size) using
  the exprutil.Evaluator.KVOptions pattern
- Creates a companion embeddings table via internal SQL with:
  - UUID primary key, source FK with CASCADE delete
  - chunk_seq/chunk/embedding columns, VECTOR(384) type
  - UNIQUE constraint on (source columns, chunk_seq)
- Creates a companion view joining source with embeddings
- Sets the Vectorizer protobuf config on the source table descriptor

DROP VECTORIZER [IF EXISTS] ON <table>:
- Resolves the source table, validates ownership
- Checks that a vectorizer is configured (or handles IF EXISTS)
- Drops the companion view and table via internal SQL
- Clears the Vectorizer config from the descriptor

Also registers both plan node types in plan_names.go.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 21, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Apr 21, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@cockroach-teamcity cockroach-teamcity added the X-perf-gain Microbenchmarks CI: Added if a performance gain is detected label Apr 21, 2026
Add the jobs framework integration for the vectorizer feature. When
CREATE VECTORIZER runs, it now also creates a scheduled job that
periodically fires a vectorizer job. When DROP VECTORIZER runs, it
deletes the schedule.

The vectorizer job (resumer) processes rows from the source table that
don't yet have embeddings in the companion table: it queries for pending
rows, generates embeddings via the embedding engine, and inserts them.

Components added:
- `jobs.proto`: VectorizerDetails, VectorizerProgress messages and
  TypeVectorizer enum value (35)
- `catalog.proto`: ScheduledVectorizerArgs for schedule executor args
- `tree/show.go`: ScheduledVectorizerExecutor constant
- `vectorizerschedule/`: scheduled job executor that creates vectorizer
  jobs when the schedule fires
- `vectorizerjob/`: job resumer that processes pending rows and generates
  embeddings
- `create_vectorizer.go`: creates scheduled job and stores schedule ID
  in the Vectorizer proto
- `drop_vectorizer.go`: deletes the schedule on DROP VECTORIZER

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Apr 21, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

suj-krishnan and others added 15 commits April 21, 2026 21:55
Add a new crdb_internal.vectorizer_status virtual table that exposes
vectorizer configuration for tables with active vectorizers. The table
shows source table identity, source columns, model, schedule, batch
size, and companion table/schedule IDs.

This provides operational visibility into configured vectorizers without
requiring direct descriptor inspection.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add support for fetching and embedding content from cloud storage URIs
(S3, GCS, HTTP, etc.) in the vectorizer pipeline.

- Add  field to the Vectorizer protobuf message to
  distinguish between inline column text (column) and URI-based
  loading (uri).
- Add  option to CREATE VECTORIZER DDL with
  validation: requires exactly one STRING column.
- Add  SQL builtin that fetches file content
  via the existing cloud storage infrastructure and extracts text.
  Supports text formats (.txt, .md, .csv, .json, .xml, .html, .yml)
  and PDF text extraction via ledongthuc/pdf.
- Add  package for content type detection
  and text extraction with tests.
sql/vectorizer: add S3 URI loading mode and read_uri builtin
Introduces the Embedder interface and adds support for remote embedding
models alongside the existing local ONNX engine. Users can now call
embed() and embed_chunks() with a model argument to use remote providers:

  SELECT embed('hello', 'openai/text-embedding-3-small');
  SELECT embed('hello', 'google/text-embedding-004');

Remote models require a matching external connection:

  CREATE EXTERNAL CONNECTION openai AS 'https://api.openai.com/v1?api_key=sk-...';
  CREATE EXTERNAL CONNECTION google AS 'https://REGION-aiplatform.googleapis.com/v1?project=PROJECT&credentials=BASE64_SA_KEY';

Key changes:
- Extract Embedder interface; add context.Context to Engine methods.
- Add model registry with dimension/provider metadata for 6 models.
- Implement OpenAI and Google Vertex AI clients with retry, error
  classification (pgcodes), and response size limits.
- Google client supports both static access tokens (for testing) and
  service account key authentication with automatic token refresh.
- Add ResolveRemoteEmbedder that parses connection URIs per-provider.
- Wire up 2-arg overloads of embed() and embed_chunks() that resolve
  the model, look up the external connection, and call the remote API.
- CREATE VECTORIZER now uses the model registry for dimensions and
  validates the external connection at DDL time.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The vectorizer background job was hardcoded to use the local ONNX
engine. Update it to read the model and connection name from the
vectorizer config and resolve the appropriate embedder (local or
remote) via ResolveRemoteEmbedder.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The vectorizer background job previously only detected missing rows
(new inserts). This extends it with two capabilities:

1. Stale row detection: compares each source row's MVCC timestamp
   against last_embedded_at in the companion table to find rows whose
   content changed after embedding. Stale rows are re-embedded via
   ON CONFLICT DO UPDATE.

2. URI loading mode: when the vectorizer's loading mode is "uri", the
   job fetches file content from cloud storage (S3, GCS, HTTP) via
   ExternalStorageFromURI, extracts text, and embeds that instead of
   the raw column value.

The companion table schema gains a last_embedded_at TIMESTAMPTZ column
to support the stale detection comparison.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Adds support for embedding images alongside text using Google Vertex
AI's multimodalembedding@001 model. Text and images are projected into
the same 1408-dim vector space, enabling cross-modal semantic search
(e.g., searching images with text queries).

Key changes:
- Add ImageEmbedder interface (EmbedImage, EmbedImageBatch) and
  Modality bitmask (ModalityText, ModalityImage) to the embedding
  package.
- Extend the Google Vertex AI client to handle both text-only and
  multimodal request/response formats, with EmbedImage/EmbedImageBatch
  methods that base64-encode images for the API.
- Add embed_image(image BYTES, model STRING) SQL builtin for ad-hoc
  image embedding queries.
- Add VectorizerInputType enum (TEXT/IMAGE) to the Vectorizer protobuf.
  CREATE VECTORIZER inspects column types: BYTES columns automatically
  select the image path and validate the model supports images.
- Update the background job to dispatch to ImageEmbedder for image
  input types.

Tested end-to-end: 5 distinct images (cat, dog, car, beach, mountain)
embedded via Vertex AI, text queries rank the correct image first in
all cases.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add two enhancements to the vectorizer:

1. Add a `sql.vectorizer.default_schedule` cluster setting that controls
   the default cron expression for vectorizer background jobs. Previously
   hardcoded to `@every 5m`, this can now be changed globally. Individual
   vectorizers can still override via `WITH schedule = '...'`.

2. Automatically create a vector index with cosine distance on the
   companion embeddings table. The `VECTOR INDEX (embedding
   vector_cosine_ops)` is included inline in the CREATE TABLE statement,
   so every new vectorizer gets an efficient nearest-neighbor index
   without requiring manual index creation.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable embedding of image files (jpg, png, etc.) fetched from cloud
storage URIs using multimodal models like Vertex AI's
multimodalembedding@001.

- Add content type classification (ContentType, ClassifyURI, IsImage)
  to the content package for routing image URIs to the ImageEmbedder
  pipeline instead of text extraction.
- Update the vectorizer background job to detect image URIs and embed
  them via ImageEmbedder.EmbedImage(), while continuing to batch-embed
  text URIs via EmbedBatch(). Mixed batches containing both text and
  image URIs are handled within a single job invocation.
- Add read_uri_bytes(uri) SQL builtin that returns raw bytes from cloud
  storage, enabling manual image embedding via embed_image().
- Fix Vertex AI multimodal client to send one instance per API request
  in both EmbedBatch and EmbedImageBatch, since the multimodal endpoint
  rejects multi-instance requests.
- Remove DDL restriction that prevented URI loading mode with image-
  capable models.
- Consolidate readURIContent as a thin wrapper over readURIBytes to
  eliminate duplicated cloud storage fetching logic.
sql/vectorizer: integrate S3 URI loading with multimodal image embedding
Add a standalone Go terminal chat application that demonstrates
CockroachDB's vectorizer feature. The app performs semantic search
over a books catalog stored in CockroachDB and generates responses
using a local Ollama LLM (llama 3.1).

Key features:
- Two-step hybrid search: LLM extracts structured filters (price,
  region) from natural language, then CockroachDB executes a filtered
  vector similarity query combining SQL WHERE clauses with the
  embed() builtin and cosine distance operator.
- Streaming LLM responses via Ollama's native chat API.
- Demo dataset of 10 programming books with price and region data.
- Automatic setup via --setup flag (creates table, inserts data,
  creates vectorizer).

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a `sql.vectorize.enabled` cluster setting (default: false) that
gates all vectorization features. When disabled, `embed()`,
`embed_chunks()`, `embed_image()`, `CREATE VECTORIZER`, `DROP
VECTORIZER`, and the vectorizer background job all return a clear
user-facing error directing users to enable the setting.

Also default `--embedding-libs` to `./onnxruntime/lib` relative to
the working directory for easier local development.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Extend the LLM query classification step to detect messages that
don't require a database search (greetings, follow-ups from history,
unrelated questions). When needs_search is false, the app responds
directly from conversation history without hitting CockroachDB,
avoiding irrelevant vector search results and unnecessary latency.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
suj-krishnan and others added 2 commits April 27, 2026 13:59
Replace the hardcoded bookstore demo with a live CS research paper
archive that fetches ~500 papers from arXiv across 10 categories
(cs.AI, cs.CL, cs.CR, cs.CV, cs.DB, cs.DC, cs.DS, cs.LG, cs.SE,
cs.PL).

Key changes:
- Add arxiv.go with arXiv API client (XML/Atom parsing, pagination,
  rate limiting, dedup).
- New papers table schema with arxiv_id, title, authors, abstract,
  category, pdf_link, and published columns.
- Split --setup into subcommands: `setup load` (fetch and insert
  papers) and `setup vectorizer` (create vectorizer). Bare `setup`
  runs both.
- Update RAG pipeline: search filters use category and year range
  instead of price and region. Prompts rewritten for research paper
  context.
- Improve handling of relative time expressions ("last year",
  "recent") in filter extraction prompt.
- Change default vectorizer schedule to @every 30s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

X-perf-gain Microbenchmarks CI: Added if a performance gain is detected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants