diff --git a/docs/en/guides/51-ai-functions/02-built-in-functions.md b/docs/en/guides/51-ai-functions/02-built-in-functions.md index e54b382fac..70c2913820 100644 --- a/docs/en/guides/51-ai-functions/02-built-in-functions.md +++ b/docs/en/guides/51-ai-functions/02-built-in-functions.md @@ -2,6 +2,10 @@ title: Built-in AI Functions --- +import FunctionDescription from '@site/src/components/FunctionDescription'; + + + # Built-in AI Functions Databend provides built-in AI functions powered by Azure OpenAI Service for seamless integration of AI capabilities into your SQL workflows. @@ -18,7 +22,7 @@ Databend provides built-in AI functions powered by Azure OpenAI Service for seam ## Vector Storage in Databend -Databend stores embedding vectors using the `ARRAY(FLOAT NOT NULL)` data type, enabling direct similarity calculations with the `cosine_distance` function in SQL. +Databend stores embedding vectors using the `VECTOR(1536)` data type, enabling direct similarity calculations with the `cosine_distance` function in SQL. ## Example: Semantic Search with Embeddings @@ -28,7 +32,8 @@ CREATE TABLE articles ( id INT, title VARCHAR, content VARCHAR, - embedding ARRAY(FLOAT NOT NULL) + embedding VECTOR(1536), + VECTOR INDEX idx_embedding(embedding) distance='cosine' ); -- Store documents with their vector embeddings diff --git a/docs/en/sql-reference/00-sql-reference/10-data-types/index.md b/docs/en/sql-reference/00-sql-reference/10-data-types/index.md index eaf017ca19..4515639cb1 100644 --- a/docs/en/sql-reference/00-sql-reference/10-data-types/index.md +++ b/docs/en/sql-reference/00-sql-reference/10-data-types/index.md @@ -34,6 +34,7 @@ The following is a list of semi-structured data types in Databend: | [TUPLE](tuple.md) | N/A | ('2023-02-14','Valentine Day') | An ordered collection of values of different data types, accessed by their index. | | [MAP](map.md) | N/A | `{"a":1, "b":2, "c":3}` | A set of key-value pairs where each key is unique and maps to a value. | | [VARIANT](variant.md) | JSON | `[1,{"a":1,"b":{"c":2}}]` | Collection of elements of different data types, including `ARRAY` and `OBJECT`. | +| [VECTOR](vector.md) | N/A | [1.0, 2.1, 3.2] | Multi-dimensional arrays of 32-bit floating-point numbers for machine learning and similarity search operations. | | [BITMAP](bitmap.md) | N/A | 0101010101 | A binary data type used to represent a set of values, where each bit represents the presence or absence of a value. | ## Data Type Conversions diff --git a/docs/en/sql-reference/00-sql-reference/10-data-types/vector.md b/docs/en/sql-reference/00-sql-reference/10-data-types/vector.md new file mode 100644 index 0000000000..7d7f93e4b4 --- /dev/null +++ b/docs/en/sql-reference/00-sql-reference/10-data-types/vector.md @@ -0,0 +1,142 @@ +--- +title: Vector +--- + +import FunctionDescription from '@site/src/components/FunctionDescription'; + + + +import EEFeature from '@site/src/components/EEFeature'; + + + + +The VECTOR data type stores multi-dimensional arrays of 32-bit floating-point numbers, designed for machine learning, AI applications, and similarity search operations. Each vector has a fixed dimension (length) specified at creation time. + +## Syntax + +```sql +column_name VECTOR() +``` + +Where: +- `dimension`: The dimension (length) of the vector. Must be a positive integer with a maximum value of 4096. +- Elements are 32-bit floating-point numbers. + +## Vector Indexing + +Databend supports creating vector indexes using the HNSW (Hierarchical Navigable Small World) algorithm for fast approximate nearest neighbor search, delivering **23x faster** query performance. + +### Index Syntax + +```sql +VECTOR INDEX index_name(column_name) distance='cosine,l1,l2' +``` + +Where: +- `index_name`: Name of the vector index +- `column_name`: Name of the VECTOR column to index +- `distance`: Distance functions to support. Can be `'cosine'`, `'l1'`, `'l2'`, or combinations like `'cosine,l1,l2'` + + +### Supported Distance Functions + +| Function | Description | Use Case | +|----------|-------------|----------| +| **[cosine_distance](/sql/sql-functions/vector-distance-functions/vector-cosine-distance)** | Calculates cosine distance between vectors | Semantic similarity, text embeddings | +| **[l1_distance](/sql/sql-functions/vector-distance-functions/vector-l1-distance)** | Calculates L1 distance (Manhattan distance) | Feature comparison, sparse data | +| **[l2_distance](/sql/sql-functions/vector-distance-functions/vector-l2-distance)** | Calculates L2 distance (Euclidean distance) | Geometric similarity, image features | + +## Basic Usage + +### Step 1: Create Table with Vector + +```sql +-- Create table with vector index for efficient similarity search +CREATE OR REPLACE TABLE products ( + id INT, + name VARCHAR, + features VECTOR(3), + VECTOR INDEX idx_features(features) distance='cosine' +); +``` + +**Note**: The vector index is automatically built when data is inserted into the table. + +### Step 2: Insert Vector Data + +```sql +-- Insert product feature vectors +INSERT INTO products VALUES + (1, 'Product A', [1.0, 2.0, 3.0]::VECTOR(3)), + (2, 'Product B', [2.0, 1.0, 4.0]::VECTOR(3)), + (3, 'Product C', [1.5, 2.5, 2.0]::VECTOR(3)), + (4, 'Product D', [3.0, 1.0, 1.0]::VECTOR(3)); +``` + +### Step 3: Perform Similarity Search + +```sql +-- Find products similar to a query vector [1.2, 2.1, 2.8] +SELECT + id, + name, + features, + cosine_distance(features, [1.2, 2.1, 2.8]::VECTOR(3)) AS distance +FROM products +ORDER BY distance ASC +LIMIT 3; +``` + +Result: +``` +┌─────┬───────────┬───────────────┬──────────────────┐ +│ id │ name │ features │ distance │ +├─────┼───────────┼───────────────┼──────────────────┤ +│ 2 │ Product B │ [2.0,1.0,4.0] │ 0.5384207 │ +│ 3 │ Product C │ [1.5,2.5,2.0] │ 0.5772848 │ +│ 1 │ Product A │ [1.0,2.0,3.0] │ 0.60447836 │ +└─────┴───────────┴───────────────┴──────────────────┘ +``` + +**Explanation**: The query finds the 3 most similar products to the search vector `[1.2, 2.1, 2.8]`. Lower cosine distance values indicate higher similarity. + +## Unloading and Loading Vector Data + +### Unloading Vector Data + +```sql +-- Export vector data to stage +COPY INTO @mystage/unload/ +FROM ( + SELECT + id, + name, + features + FROM products +) +FILE_FORMAT = (TYPE = 'PARQUET'); +``` + +### Loading Vector Data + +```sql +-- Create target table for import +CREATE OR REPLACE TABLE products_imported ( + id INT, + name VARCHAR, + features VECTOR(3), + VECTOR INDEX idx_features(features) distance='cosine' +); + +-- Import vector data +COPY INTO products_imported (id, name, features) +FROM ( + SELECT + id, + name, + features + FROM @mystage/unload/ +) +FILE_FORMAT = (TYPE = 'PARQUET'); +``` diff --git a/docs/en/sql-reference/20-sql-functions/11-ai-functions/02-ai-embedding-vector.md b/docs/en/sql-reference/20-sql-functions/11-ai-functions/02-ai-embedding-vector.md index cc40822b2b..d4248402ad 100644 --- a/docs/en/sql-reference/20-sql-functions/11-ai-functions/02-ai-embedding-vector.md +++ b/docs/en/sql-reference/20-sql-functions/11-ai-functions/02-ai-embedding-vector.md @@ -1,8 +1,11 @@ --- title: "AI_EMBEDDING_VECTOR" -description: "Creating embeddings using the ai_embedding_vector function in Databend" --- +import FunctionDescription from '@site/src/components/FunctionDescription'; + + + This document provides an overview of the ai_embedding_vector function in Databend and demonstrates how to create document embeddings using this function. The main code implementation can be found [here](https://github.com/databendlabs/databend/blob/1e93c5b562bd159ecb0f336bb88fd1b7f9dc4a62/src/common/openai/src/embedding.rs). @@ -50,7 +53,8 @@ CREATE TABLE documents ( id INT, title VARCHAR, content VARCHAR, - embedding ARRAY(FLOAT NOT NULL) + embedding VECTOR(1536), + VECTOR INDEX idx_embedding(embedding) distance='cosine' ); ``` diff --git a/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/00-vector-cosine-distance.md b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/00-vector-cosine-distance.md index e23e9e7af7..d0c9561434 100644 --- a/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/00-vector-cosine-distance.md +++ b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/00-vector-cosine-distance.md @@ -13,8 +13,8 @@ COSINE_DISTANCE(vector1, vector2) ## Arguments -- `vector1`: First vector (ARRAY(FLOAT NOT NULL)) -- `vector2`: Second vector (ARRAY(FLOAT NOT NULL)) +- `vector1`: First vector (VECTOR Data Type) +- `vector2`: Second vector (VECTOR Data Type) ## Returns @@ -51,7 +51,8 @@ Create a table with vector data: ```sql CREATE OR REPLACE TABLE vectors ( id INT, - vec ARRAY(FLOAT NOT NULL) + vec VECTOR(3), + VECTOR INDEX idx_vec(vec) distance='cosine' ); INSERT INTO vectors VALUES @@ -65,7 +66,7 @@ Find the vector most similar to [1, 2, 3]: ```sql SELECT vec, - COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance + COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]::VECTOR(3)) AS distance FROM vectors ORDER BY diff --git a/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/01-vector-l2-distance.md b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/01-vector-l2-distance.md index 90c5247100..fefb301c73 100644 --- a/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/01-vector-l2-distance.md +++ b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/01-vector-l2-distance.md @@ -1,8 +1,11 @@ --- title: 'L2_DISTANCE' -description: 'Measuring Euclidean distance between vectors in Databend' --- +import FunctionDescription from '@site/src/components/FunctionDescription'; + + + Calculates the Euclidean (L2) distance between two vectors, measuring the straight-line distance between them in vector space. ## Syntax @@ -13,8 +16,8 @@ L2_DISTANCE(vector1, vector2) ## Arguments -- `vector1`: First vector (ARRAY(FLOAT NOT NULL)) -- `vector2`: Second vector (ARRAY(FLOAT NOT NULL)) +- `vector1`: First vector (VECTOR Data Type) +- `vector2`: Second vector (VECTOR Data Type) ## Returns @@ -51,7 +54,8 @@ Create a table with vector data: ```sql CREATE OR REPLACE TABLE vectors ( id INT, - vec ARRAY(FLOAT NOT NULL) + vec VECTOR(3), + VECTOR INDEX idx_vec(vec) distance='l2' ); INSERT INTO vectors VALUES @@ -66,7 +70,7 @@ Find the vector closest to [1, 2, 3] using L2 distance: SELECT id, vec, - L2_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance + L2_DISTANCE(vec, [1.0000, 2.0000, 3.0000]::VECTOR(3)) AS distance FROM vectors ORDER BY diff --git a/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/02-vector-l1-distance.md b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/02-vector-l1-distance.md new file mode 100644 index 0000000000..a9b2c085ae --- /dev/null +++ b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/02-vector-l1-distance.md @@ -0,0 +1,74 @@ +--- +title: 'L1_DISTANCE' +--- + +import FunctionDescription from '@site/src/components/FunctionDescription'; + + + +Calculates the Manhattan (L1) distance between two vectors, measuring the sum of absolute differences between corresponding elements. + +## Syntax + +```sql +L1_DISTANCE(vector1, vector2) +``` + +## Arguments + +- `vector1`: First vector (VECTOR Data Type) +- `vector2`: Second vector (VECTOR Data Type) + +## Returns + +Returns a FLOAT value representing the Manhattan (L1) distance between the two vectors. The value is always non-negative: +- 0: Identical vectors +- Larger values: Vectors that are farther apart + +## Description + +The L1 distance, also known as Manhattan distance or taxicab distance, calculates the sum of absolute differences between corresponding elements of two vectors. It's useful for feature comparison and sparse data analysis. + +Formula: `L1_DISTANCE(a, b) = |a1 - b1| + |a2 - b2| + ... + |an - bn|` + +## Examples + +### Basic Usage + +```sql +-- Calculate L1 distance between two vectors +SELECT L1_DISTANCE([1.0, 2.0, 3.0], [4.0, 5.0, 6.0]) AS distance; +``` + +Result: +``` +┌──────────┐ +│ distance │ +├──────────┤ +│ 9.0 │ +└──────────┘ +``` + +### Using with VECTOR Type + +```sql +-- Create table with VECTOR columns +CREATE TABLE products ( + id INT, + features VECTOR(3), + VECTOR INDEX idx_features(features) distance='l1' +); + +INSERT INTO products VALUES + (1, [1.0, 2.0, 3.0]::VECTOR(3)), + (2, [2.0, 3.0, 4.0]::VECTOR(3)); + +-- Find products similar to a query vector using L1 distance +SELECT + id, + features, + L1_DISTANCE(features, [1.5, 2.5, 3.5]::VECTOR(3)) AS distance +FROM products +ORDER BY distance ASC +LIMIT 5; +``` diff --git a/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/index.md b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/index.md index 8f4d4bd8cb..e9c8f1a5bd 100644 --- a/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/index.md +++ b/docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/index.md @@ -10,11 +10,13 @@ This section provides reference information for vector distance functions in Dat | Function | Description | Example | |----------|-------------|--------| | [COSINE_DISTANCE](./00-vector-cosine-distance.md) | Calculates angular distance between vectors (range: 0-1) | `COSINE_DISTANCE([1,2,3], [4,5,6])` | +| [L1_DISTANCE](./02-vector-l1-distance.md) | Calculates Manhattan (L1) distance between vectors | `L1_DISTANCE([1,2,3], [4,5,6])` | | [L2_DISTANCE](./01-vector-l2-distance.md) | Calculates Euclidean (straight-line) distance | `L2_DISTANCE([1,2,3], [4,5,6])` | ## Function Comparison | Function | Description | Range | Best For | Use Cases | |----------|-------------|-------|----------|-----------| -| [L2_DISTANCE](./01-vector-l2-distance.md) | Euclidean (straight-line) distance | [0, ∞) | When magnitude matters | • Image similarity
• Geographical data
• Anomaly detection
• Feature-based clustering | | [COSINE_DISTANCE](./00-vector-cosine-distance.md) | Angular distance between vectors | [0, 1] | When direction matters more than magnitude | • Document similarity
• Semantic search
• Recommendation systems
• Text analysis | +| [L1_DISTANCE](./02-vector-l1-distance.md) | Calculates Manhattan (L1) distance between vectors | [0, ∞) | When direction matters more than magnitude | • Document similarity
• Semantic search
• Recommendation systems
• Text analysis | +| [L2_DISTANCE](./01-vector-l2-distance.md) | Euclidean (straight-line) distance | [0, ∞) | When magnitude matters | • Image similarity
• Geographical data
• Anomaly detection
• Feature-based clustering |