feat: add vector distance and array math functions#21371
Open
crm26 wants to merge 3 commits intoapache:mainfrom
Open
feat: add vector distance and array math functions#21371crm26 wants to merge 3 commits intoapache:mainfrom
crm26 wants to merge 3 commits intoapache:mainfrom
Conversation
Add 6 new scalar functions to datafusion-functions-nested: - cosine_distance(array, array) — cosine distance (1 - cosine similarity) - inner_product(array, array) — dot product - array_normalize(array) — L2 unit normalization - array_add(array, array) — element-wise addition - array_subtract(array, array) — element-wise subtraction - array_scale(array, float) — scalar multiplication Shared math primitives (dot_product, magnitude, sum_of_squares) extracted into vector_math.rs to avoid duplication across functions. Includes aliases (list_*, dot_product), 29 unit tests, and a sqllogictest file with vector search pattern coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds cosine_distance, inner_product, array_normalize, array_add, array_subtract, and array_scale to datafusion-functions-nested. Shared primitives in vector_math.rs (dot_product_f64, magnitude_f64, sum_of_squares_f64, convert_to_f64_array) are reused across all functions and the existing array_distance. Consolidates the duplicate convert_to_f64_array from distance.rs into the shared module. Functions: cosine_distance(a, b) → float64 (aliases: list_cosine_distance) inner_product(a, b) → float64 (aliases: list_inner_product, dot_product) array_normalize(a) → list(float64) (aliases: list_normalize) array_add(a, b) → list(float64) (aliases: list_add) array_subtract(a, b) → list(float64) (aliases: list_subtract) array_scale(a, f) → list(float64) (aliases: list_scale) Enables vector search in standard SQL: SELECT id, cosine_distance(embedding, ARRAY[0.1, 0.2, ...]) as dist FROM documents ORDER BY dist LIMIT 10 79 tests, sqllogictest coverage, clippy clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds vector distance and array math functions to
datafusion-functions-nested, enabling vector search and array algebra in standard SQL.Functions
cosine_distance(a, b)inner_product(a, b)array_normalize(a)array_add(a, b)array_subtract(a, b)array_scale(a, f)All have
list_*aliases.inner_productalso aliased asdot_product.Design
Shared primitives in
vector_math.rs:dot_product_f64(a, b)— used byinner_productandcosine_distancemagnitude_f64(a)— used bycosine_distanceandarray_normalizesum_of_squares_f64(a)— used bymagnitude_f64convert_to_f64_array(a)— shared with existingarray_distanceThe existing
distance.rsduplicateconvert_to_f64_arrayis consolidated into the shared module.Follows the exact pattern of the existing
array_distancefunction: same signature style,coerce_types, null handling, and type support (Float32, Float64, Int32, Int64, FixedSizeList, LargeList, List).Tests
79 tests including: normal inputs, null handling, zero vectors, orthogonal vectors, empty arrays, Float32/Float64, mismatched lengths, vector search ranking pattern. Sqllogictest coverage in
vector_functions.slt. Clippy clean.