CRDB vectorizer#168739
Draft
suj-krishnan wants to merge 27 commits intocockroachdb:masterfrom
Draft
Conversation
Add a new package that dynamically loads the ONNX Runtime C library via dlopen/dlsym and exposes Go bindings for loading ONNX models and running neural network inference. This follows the same pattern used by pkg/geo/geos for GEOS integration. The ONNX Runtime C API exposes a single dlsym entry point (OrtGetApiBase) that returns a struct of ~200+ function pointers, making the integration much simpler than GEOS (which requires ~80 individual dlsym calls). Key components: - onnxruntime.h: C ABI contract defining opaque handles (CR_ONNX, CR_ONNX_Model), data types (Slice, String, Status), and function declarations for init, model loading, and inference. - onnxruntime.cc: C++ dlopen/dlsym shim that resolves OrtApi function pointers, manages ORT environment and sessions, and wraps inference calls with proper tensor creation and cleanup. - onnxruntime.go: Go bindings with sync.Once initialization, platform- specific library search (flag, env var, Bazel runfiles, parent dirs), and a Model type with LoadModel, RunInference, and Close methods. - postprocess.go: Pure Go mean pooling and L2 normalization for converting token embeddings to sentence embeddings. The library is loaded at runtime — operations gracefully error if libonnxruntime is not found. Integration tests skip when the library is unavailable; pure Go post-processing tests always run. A tiny test ONNX model (vocab_size=32, dims=8) is included in testdata/ along with the Python script that generated it. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a WordPiece tokenizer package and a high-level embedding engine that wires the tokenizer to the ONNX Runtime integration from the previous commit, providing a complete text-to-embedding pipeline. The tokenizer (pkg/embedding/tokenizer/) implements the BERT uncased tokenization algorithm in pure Go: - Vocab loading from vocab.txt files (one token per line, 0-indexed) - BERT pre-tokenization: lowercasing, NFD accent stripping, CJK character spacing, and punctuation splitting (matching HuggingFace's BasicTokenizer behavior exactly) - WordPiece subword tokenization using the greedy longest-match-first algorithm, with ## continuation prefixes - Encode/EncodeBatch methods that assemble [CLS] + tokens + [SEP] + padding, with dynamic batch padding for efficient inference The embedding engine (pkg/embedding/engine.go) combines the tokenizer, ONNX model inference, and post-processing into a single API: - NewEngine(modelPath, vocabPath) loads both model and vocabulary - Embed(text) returns a unit-normalized float32 embedding vector - EmbedBatch(texts) processes multiple texts in a single inference call - Thread-safe: tokenizer is immutable, ORT sessions are thread-safe The end-to-end pipeline is: text → pre-tokenize → WordPiece → model inference → mean pool → L2 normalize → embedding All 33 tokenizer tests are pure Go and always run. Engine integration tests skip gracefully when ONNX Runtime is not available. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the embed() SQL builtin function that generates vector embeddings from text using the ONNX model configured at server startup. This completes the in-database vectorizer pipeline: text → tokenize → infer → post-process → VECTOR. The implementation follows the GEOS pattern: - CLI flags (--embedding-libs, --embedding-model, --embedding-vocab) configure the ONNX runtime library, model, and vocabulary paths. - initEmbedding() runs at server startup (after initGEOS) and initializes a global Engine singleton via sync.Once. - If initialization fails (missing library, model, or vocab), the server starts normally but embed() returns a pgerror with pgcode ConfigFile and a hint about the required flags. - The builtin calls embedding.GetEngine() to obtain the singleton, then engine.Embed(text) to produce the vector. The embed() function has Stable volatility since the same model produces the same output for the same input within a session, but the model could change across server restarts. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The embed() SQL builtin previously required users to manually download model and vocabulary files and pass them via --embedding-model and --embedding-vocab CLI flags. This commit adds automatic model management: when those flags are omitted, the server downloads the default all-MiniLM-L6-v2 model from Hugging Face on first startup and caches it locally. The new pkg/embedding/modelcache package handles downloading, SHA256 verification, and atomic file placement. Cache location defaults to <store-dir>/embedding-cache/ for persistent stores or ~/Library/Caches/cockroach/embedding-models/ for in-memory stores (e.g. cockroach demo). A new --embedding-cache-dir flag allows overriding this. Only --embedding-libs (ONNX Runtime library path) is still required. Explicit --embedding-model/--embedding-vocab flags continue to work and skip auto-download entirely. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add sentence-aware text chunking for embedding long documents that
exceed the transformer model's token limit (256 tokens for
all-MiniLM-L6-v2).
The chunker splits text on sentence boundaries (., !, ?), then
greedily packs sentences into chunks up to a configurable token
limit with overlap for context continuity between chunks.
The embed_chunks() SQL generator builtin wraps chunking and
embedding into a single function:
SELECT * FROM embed_chunks('long text here...');
Returns rows of (chunk_seq INT, chunk STRING, embedding VECTOR),
where each row is an independently embedded segment of the input
text. This is the foundation for the vectorizer's background job
that will embed table rows in batches.
New packages and changes:
- pkg/embedding/chunker: sentence-aware text chunker with
configurable max tokens and overlap
- pkg/embedding/tokenizer: add TokenCount() method for efficient
token counting without full encoding
- pkg/embedding/engine: add Tokenizer() accessor for chunker use
- pkg/sql/sem/builtins: embed_chunks() generator builtin
Epic: none
Release note: None
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add catalog metadata for the vectorizer feature, which will automatically generate and maintain vector embeddings for table rows. The Vectorizer message in catpb stores the configuration: - source_columns: which columns to embed - template: how to combine columns into text - embedding_table_id: companion table storing embeddings - schedule_id: periodic job that generates embeddings - model: embedding model name (e.g. all-MiniLM-L6-v2) - schedule_cron: how often the job runs - batch_size: rows per job invocation The message follows the RowLevelTTL pattern: defined in catpb/catalog.proto and referenced as an optional field (73) on TableDescriptor in descpb/structured.proto. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the `pkg/sql/vectorizer` package with helper functions that
generate SQL for the companion embedding table and view. These will
be called by the CREATE VECTORIZER DDL execution (Phase 6d).
The companion table stores one embedding row per chunk:
<table>_embeddings (
embedding_uuid UUID PRIMARY KEY,
source_<pk> <pk_type> NOT NULL REFERENCES <table> ON DELETE CASCADE,
chunk_seq INT8 NOT NULL DEFAULT 0,
chunk STRING NOT NULL,
embedding VECTOR(<dims>) NOT NULL,
UNIQUE (source_<pk>, chunk_seq)
)
A companion view joins source and embedding tables for convenient
querying:
<table>_embeddings_view AS
SELECT s.*, e.chunk_seq, e.chunk, e.embedding
FROM <table> s JOIN <table>_embeddings e ON ...
Both single-column and composite primary keys are supported.
Epic: none
Release note: None
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SQL parser support for the vectorizer DDL statements:
CREATE VECTORIZER ON <table>
USING COLUMN (col1, col2, ...)
[WITH model = '...', schedule = '...', batch_size = '...']
DROP VECTORIZER [IF EXISTS] ON <table>
This adds:
- AST nodes (CreateVectorizer, DropVectorizer) in tree/vectorizer.go
- Grammar rules in sql.y with VECTORIZER as an unreserved keyword
- Statement dispatch via opaque.go to stub planner methods
- Parser test data and contextual help test coverage
The planner methods currently return "not yet implemented" errors;
actual DDL execution will follow in Phase 6d.
Epic: none
Release note: None
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the planner methods for CREATE VECTORIZER and DROP VECTORIZER that were previously returning "not yet implemented" stubs. CREATE VECTORIZER ON <table> USING COLUMN (<cols>) [WITH <opts>]: - Resolves the source table and validates ownership - Validates that specified columns exist - Parses WITH options (model, template, schedule, batch_size) using the exprutil.Evaluator.KVOptions pattern - Creates a companion embeddings table via internal SQL with: - UUID primary key, source FK with CASCADE delete - chunk_seq/chunk/embedding columns, VECTOR(384) type - UNIQUE constraint on (source columns, chunk_seq) - Creates a companion view joining source with embeddings - Sets the Vectorizer protobuf config on the source table descriptor DROP VECTORIZER [IF EXISTS] ON <table>: - Resolves the source table, validates ownership - Checks that a vectorizer is configured (or handles IF EXISTS) - Drops the companion view and table via internal SQL - Clears the Vectorizer config from the descriptor Also registers both plan node types in plan_names.go. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
Add the jobs framework integration for the vectorizer feature. When CREATE VECTORIZER runs, it now also creates a scheduled job that periodically fires a vectorizer job. When DROP VECTORIZER runs, it deletes the schedule. The vectorizer job (resumer) processes rows from the source table that don't yet have embeddings in the companion table: it queries for pending rows, generates embeddings via the embedding engine, and inserts them. Components added: - `jobs.proto`: VectorizerDetails, VectorizerProgress messages and TypeVectorizer enum value (35) - `catalog.proto`: ScheduledVectorizerArgs for schedule executor args - `tree/show.go`: ScheduledVectorizerExecutor constant - `vectorizerschedule/`: scheduled job executor that creates vectorizer jobs when the schedule fires - `vectorizerjob/`: job resumer that processes pending rows and generates embeddings - `create_vectorizer.go`: creates scheduled job and stores schedule ID in the Vectorizer proto - `drop_vectorizer.go`: deletes the schedule on DROP VECTORIZER Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
Add a new crdb_internal.vectorizer_status virtual table that exposes vectorizer configuration for tables with active vectorizers. The table shows source table identity, source columns, model, schedule, batch size, and companion table/schedule IDs. This provides operational visibility into configured vectorizers without requiring direct descriptor inspection. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add support for fetching and embedding content from cloud storage URIs (S3, GCS, HTTP, etc.) in the vectorizer pipeline. - Add field to the Vectorizer protobuf message to distinguish between inline column text (column) and URI-based loading (uri). - Add option to CREATE VECTORIZER DDL with validation: requires exactly one STRING column. - Add SQL builtin that fetches file content via the existing cloud storage infrastructure and extracts text. Supports text formats (.txt, .md, .csv, .json, .xml, .html, .yml) and PDF text extraction via ledongthuc/pdf. - Add package for content type detection and text extraction with tests.
sql/vectorizer: add S3 URI loading mode and read_uri builtin
Introduces the Embedder interface and adds support for remote embedding
models alongside the existing local ONNX engine. Users can now call
embed() and embed_chunks() with a model argument to use remote providers:
SELECT embed('hello', 'openai/text-embedding-3-small');
SELECT embed('hello', 'google/text-embedding-004');
Remote models require a matching external connection:
CREATE EXTERNAL CONNECTION openai AS 'https://api.openai.com/v1?api_key=sk-...';
CREATE EXTERNAL CONNECTION google AS 'https://REGION-aiplatform.googleapis.com/v1?project=PROJECT&credentials=BASE64_SA_KEY';
Key changes:
- Extract Embedder interface; add context.Context to Engine methods.
- Add model registry with dimension/provider metadata for 6 models.
- Implement OpenAI and Google Vertex AI clients with retry, error
classification (pgcodes), and response size limits.
- Google client supports both static access tokens (for testing) and
service account key authentication with automatic token refresh.
- Add ResolveRemoteEmbedder that parses connection URIs per-provider.
- Wire up 2-arg overloads of embed() and embed_chunks() that resolve
the model, look up the external connection, and call the remote API.
- CREATE VECTORIZER now uses the model registry for dimensions and
validates the external connection at DDL time.
Release note: None
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The vectorizer background job was hardcoded to use the local ONNX engine. Update it to read the model and connection name from the vectorizer config and resolve the appropriate embedder (local or remote) via ResolveRemoteEmbedder. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The vectorizer background job previously only detected missing rows (new inserts). This extends it with two capabilities: 1. Stale row detection: compares each source row's MVCC timestamp against last_embedded_at in the companion table to find rows whose content changed after embedding. Stale rows are re-embedded via ON CONFLICT DO UPDATE. 2. URI loading mode: when the vectorizer's loading mode is "uri", the job fetches file content from cloud storage (S3, GCS, HTTP) via ExternalStorageFromURI, extracts text, and embeds that instead of the raw column value. The companion table schema gains a last_embedded_at TIMESTAMPTZ column to support the stale detection comparison. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Adds support for embedding images alongside text using Google Vertex AI's multimodalembedding@001 model. Text and images are projected into the same 1408-dim vector space, enabling cross-modal semantic search (e.g., searching images with text queries). Key changes: - Add ImageEmbedder interface (EmbedImage, EmbedImageBatch) and Modality bitmask (ModalityText, ModalityImage) to the embedding package. - Extend the Google Vertex AI client to handle both text-only and multimodal request/response formats, with EmbedImage/EmbedImageBatch methods that base64-encode images for the API. - Add embed_image(image BYTES, model STRING) SQL builtin for ad-hoc image embedding queries. - Add VectorizerInputType enum (TEXT/IMAGE) to the Vectorizer protobuf. CREATE VECTORIZER inspects column types: BYTES columns automatically select the image path and validate the model supports images. - Update the background job to dispatch to ImageEmbedder for image input types. Tested end-to-end: 5 distinct images (cat, dog, car, beach, mountain) embedded via Vertex AI, text queries rank the correct image first in all cases. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Pradyum/remote models
Add two enhancements to the vectorizer: 1. Add a `sql.vectorizer.default_schedule` cluster setting that controls the default cron expression for vectorizer background jobs. Previously hardcoded to `@every 5m`, this can now be changed globally. Individual vectorizers can still override via `WITH schedule = '...'`. 2. Automatically create a vector index with cosine distance on the companion embeddings table. The `VECTOR INDEX (embedding vector_cosine_ops)` is included inline in the CREATE TABLE statement, so every new vectorizer gets an efficient nearest-neighbor index without requiring manual index creation. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable embedding of image files (jpg, png, etc.) fetched from cloud storage URIs using multimodal models like Vertex AI's multimodalembedding@001. - Add content type classification (ContentType, ClassifyURI, IsImage) to the content package for routing image URIs to the ImageEmbedder pipeline instead of text extraction. - Update the vectorizer background job to detect image URIs and embed them via ImageEmbedder.EmbedImage(), while continuing to batch-embed text URIs via EmbedBatch(). Mixed batches containing both text and image URIs are handled within a single job invocation. - Add read_uri_bytes(uri) SQL builtin that returns raw bytes from cloud storage, enabling manual image embedding via embed_image(). - Fix Vertex AI multimodal client to send one instance per API request in both EmbedBatch and EmbedImageBatch, since the multimodal endpoint rejects multi-instance requests. - Remove DDL restriction that prevented URI loading mode with image- capable models. - Consolidate readURIContent as a thin wrapper over readURIBytes to eliminate duplicated cloud storage fetching logic.
sql/vectorizer: integrate S3 URI loading with multimodal image embedding
Add a standalone Go terminal chat application that demonstrates CockroachDB's vectorizer feature. The app performs semantic search over a books catalog stored in CockroachDB and generates responses using a local Ollama LLM (llama 3.1). Key features: - Two-step hybrid search: LLM extracts structured filters (price, region) from natural language, then CockroachDB executes a filtered vector similarity query combining SQL WHERE clauses with the embed() builtin and cosine distance operator. - Streaming LLM responses via Ollama's native chat API. - Demo dataset of 10 programming books with price and region data. - Automatic setup via --setup flag (creates table, inserts data, creates vectorizer). Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a `sql.vectorize.enabled` cluster setting (default: false) that gates all vectorization features. When disabled, `embed()`, `embed_chunks()`, `embed_image()`, `CREATE VECTORIZER`, `DROP VECTORIZER`, and the vectorizer background job all return a clear user-facing error directing users to enable the setting. Also default `--embedding-libs` to `./onnxruntime/lib` relative to the working directory for easier local development. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…nto crdb_vectorizer
Extend the LLM query classification step to detect messages that don't require a database search (greetings, follow-ups from history, unrelated questions). When needs_search is false, the app responds directly from conversation history without hitting CockroachDB, avoiding irrelevant vector search results and unnecessary latency. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Replace the hardcoded bookstore demo with a live CS research paper
archive that fetches ~500 papers from arXiv across 10 categories
(cs.AI, cs.CL, cs.CR, cs.CV, cs.DB, cs.DC, cs.DS, cs.LG, cs.SE,
cs.PL).
Key changes:
- Add arxiv.go with arXiv API client (XML/Atom parsing, pagination,
rate limiting, dedup).
- New papers table schema with arxiv_id, title, authors, abstract,
category, pdf_link, and published columns.
- Split --setup into subcommands: `setup load` (fetch and insert
papers) and `setup vectorizer` (create vectorizer). Bare `setup`
runs both.
- Update RAG pipeline: search filters use category and year range
instead of price and region. Prompts rewritten for research paper
context.
- Improve handling of relative time expressions ("last year",
"recent") in filter extraction prompt.
- Change default vectorizer schedule to @every 30s.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add an in-database vectorizer that automatically generates and maintains vector embeddings for table rows.
Epic: none
🤖 Generated with Claude Code