feat(vector): Add Spark VECTOR Search TVF with intial KNN algorithm by rahil-c · Pull Request #18432 · apache/hudi

rahil-c · 2026-03-31T22:34:15Z

Describe the issue this Pull Request addresses

This PR adds distributed vector search capability to Hudi's Spark SQL integration by introducing two new table-valued functions for single-query and batch-query KNN search. Currently there is no way to perform nearest-neighbor search over embedding columns stored in Hudi tables without an external vector database.

Summary and Changelog

Users can now find k-nearest neighbors over Hudi tables using standard Spark SQL, with support for cosine, L2, and dot product distance metrics across float, double, and byte embedding types.

Changes:

Added hudi_vector_search TVF for single-query KNN search
Added hudi_vector_search_batch TVF for batch-query KNN with broadcast cross-join and window-based top-k ranking
Implemented brute-force search behind an extensible algorithm trait for future index-based algorithms
Added distance utility UDFs that pre-compute query vector and norm to halve per-row allocations
Extended ResolveReferences Catalyst rule to resolve vector search TVF nodes into executable Spark logical plans
Added VECTOR metadata dimension validation at analysis time and case-insensitive column lookup
Added input validation guards for null k, non-array query vectors, byte corpus integrality, and element type compatibility
Added 38 end-to-end tests covering all metrics, element types, COW/MOR tables, edge cases, and exact distance verification

Impact

Two new Spark SQL table-valued functions with output columns _hudi_distance and _hudi_qid. Adds spark-mllib-local dependency (local-only, no cluster overhead). No breaking changes to existing APIs, table format, or write/read paths.

Risk Level

Low - Purely additive feature isolated in new files, registered alongside existing Hudi TVFs using the same SparkSessionExtensions mechanism.

Documentation Update

Requires documentation for TVF syntax/parameters, supported distance metrics and embedding types, output schema contract, and batch-query semantics.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

Implements a new Spark TVF `hudi_vector_search` that performs exact nearest-neighbor search over embeddings stored in a Hudi table using brute-force KNN, without requiring a persistent vector index. New files: - HoodieVectorSearchTableValuedFunction: unresolved logical node holding raw args; parses single-query (table, col, ARRAY(...), k [, metric]) and batch-query (corpus, col, query_table, col, k [, metric]) modes; supports cosine, l2/euclidean, dot_product distance metrics. - HoodieVectorSearchPlanBuilder: builds the execution plan — single-query mode uses withColumn + orderBy + limit(k); batch-query mode uses crossJoin(broadcast(queries)) + window rank per _query_id; handles Float/Double/Byte embedding types via cast to Double before UDF. Fixes cross-join column ambiguity when corpus and query share column names by renaming clashing query columns to _query_<colname> before the join. Modified files: - TableValuedFunctions: registers hudi_vector_search alongside existing Hudi TVFs so all Spark adapters (3.3/3.4/3.5/4.0) inject it via SparkSessionExtensions. - HoodieSparkBaseAnalysis (ResolveReferences): resolves HoodieVectorSearchTableValuedFunction — evaluates the query vector constant at analysis time, resolves table by name or path, delegates to HoodieVectorSearchPlanBuilder. - TestHoodieVectorSearchFunction: end-to-end tests covering single/batch query modes, all three distance metrics, DataFrame API, path-based and view-based resolution, Float/Double embedding types, MOR tables, composability (WHERE, subqueries), error handling, exact distance verification, and same-column-name batch query regression test.

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala

rahil-c · 2026-03-31T23:30:49Z

...spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala

+        while (i < numElements) { result(i) = arrayData.getFloat(i).toDouble; i += 1 }
+      case IntegerType =>
+        while (i < numElements) { result(i) = arrayData.getInt(i).toDouble; i += 1 }
+      case LongType =>


why is long type here, i thought VECTOR in hudi only supports float, double, int?

Spark SQL can infer integer literals as LongType in some contexts (e.g., ARRAY(1, 2, 3) might produce LongType depending on how the expression is folded). Supporting it defensively here avoids a confusing runtime error for users, even if Hudi's VECTOR type itself only stores float/double/int.

ok let me look into adding this.

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala

rahil-c · 2026-04-01T00:10:23Z

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala

+    val meta = field.metadata
+    if (meta.contains(HoodieSchema.TYPE_METADATA_FIELD)) {
+      val typeDesc = meta.getString(HoodieSchema.TYPE_METADATA_FIELD)
+      val dimPattern = """VECTOR\((\d+)""".r


Is their a cleaner way of getting the dimension this also feels brittle

Unfortunately Spark's ArrayType doesn't encode dimension in the schema, so you'd always need to inspect actual data. One option: require the user to pass dimension explicitly as a TVF argument, which avoids the first-row inspection entirely and makes validation upfront. Otherwise, evaluating the first row or the query vector length is about as clean as it gets.

I actually think this is a cleaner solution we can leverage the HoodieSchema.parseTypeDescriptor and then if VECTOR type we can get the dimension.

private def extractVectorDimension(df: DataFrame, colName: String): Option[Int] = { df.schema.fields.find(_.name == colName).flatMap { field => val meta = field.metadata if (meta.contains(HoodieSchema.TYPE_METADATA_FIELD)) { val typeDesc = meta.getString(HoodieSchema.TYPE_METADATA_FIELD) Try(HoodieSchema.parseTypeDescriptor(typeDesc)) match { case Success(v: HoodieSchema.Vector) => Some(v.getDimension) case Success(_) => throw new HoodieAnalysisException( s"Column '$colName' has type '$typeDesc' which is not a VECTOR type. " + "Only VECTOR columns are supported for vector search.") case Failure(e) => throw new HoodieAnalysisException( s"Column '$colName' has malformed type metadata '$typeDesc': ${e.getMessage}") } } else None } }

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala

rahil-c · 2026-04-01T00:21:03Z

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala

+    // Rename any query column that clashes with a corpus column or internal columns.
+    // Uses a double prefix if the standard rename would itself clash (e.g. "id" -> "_query_id"
+    // would collide with the internal _query_id column).
+    val renamedQuery = queryWithId.columns.foldLeft(queryWithId) { (df, qCol) =>


is there a easier way we can avoid clashes so we dont need to maintain this extra code?

One common approach is to alias the DataFrames before the join (e.g., corpusDf.as("corpus") / queryDf.as("query")) and reference columns by qualified name throughout. That way you avoid the rename/restore dance entirely, though it does require all downstream column references to be qualified.

@yihua i think you are right that seems to be the cleanest solution let me try this out.

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

...udi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala

...spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala

rahil-c · 2026-04-01T20:10:52Z

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

+
+    // doc_4 [0.707,0.707,0]: cosine distance to [1,0,0] = 1 - 0.707 ~= 0.293
+    assertEquals("doc_4", result(1).getAs[String]("id"))
+    assertEquals(1.0 - 0.70710678, result(1).getAs[Double]("_hudi_distance"), 1e-4)


is there a cleaner way we can do these assertions?

...udi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala

voonhous

Let's fix the PR validation failures. LGTM

…r seperation

rahil-c · 2026-04-02T16:56:16Z

@bvaradar @yihua if we can do one more review pass today would be great. I think we are close.

rahil-c · 2026-04-02T20:46:08Z

Thanks @bvaradar for taking a look, will just check with @yihua once and then we should be good.

yihua

@rahil-c could you fix the PR description to follow the template?

yihua · 2026-04-02T23:51:50Z

Codecov Report

❌ Patch coverage is 72.39057% with 82 lines in your changes missing coverage. Please review. ✅ Project coverage is 68.55%. Comparing base (35e2bbf) to head (c19c701). ⚠️ Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala 55.17% 18 Missing and 8 partials ⚠️
...ogical/HoodieVectorSearchTableValuedFunction.scala 75.29% 13 Missing and 8 partials ⚠️
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala 79.20% 10 Missing and 11 partials ⚠️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 68.88% 7 Missing and 7 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18432      +/-   ##
============================================
+ Coverage     68.52%   68.55%   +0.03%     
- Complexity    27968    28045      +77     
============================================
  Files          2440     2447       +7     
  Lines        134456   135051     +595     
  Branches      16226    16357     +131     
============================================
+ Hits          92138    92588     +450     
- Misses        35054    35151      +97     
- Partials       7264     7312      +48     
Flag Coverage Δ
common-and-other-modules 44.26% <4.39%> (-0.08%) ⬇️
hadoop-mr-java-client 44.93% <ø> (-0.06%) ⬇️
spark-client-hadoop-common 48.46% <ø> (+0.08%) ⬆️
spark-java-tests 48.86% <72.39%> (+0.09%) ⬆️
spark-scala-tests 45.51% <4.37%> (-0.14%) ⬇️
utilities 38.26% <4.39%> (-0.10%) ⬇️
Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...spark/sql/hudi/analysis/TableValuedFunctions.scala 100.00% <100.00%> (ø)
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 73.00% <68.88%> (-1.17%) ⬇️
...ogical/HoodieVectorSearchTableValuedFunction.scala 75.29% <75.29%> (ø)
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala 79.20% <79.20%> (ø)
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala 55.17% <55.17%> (ø)
... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:

@rahil-c could you check the lines that miss coverage and see if more tests should be added?

yihua

Nice work on adding the vector search TVF. Left a few comments.

yihua · 2026-04-02T23:02:30Z

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

+    assertEquals(2, result.length)
+    assertEquals("b1", result(0).getAs[String]("id"))
+    assertEquals(0.0, result(0).getAs[Double]("_hudi_distance"), 1e-5)
+    // b2: sqrt(10^2 + 10^2) = sqrt(200) ~= 14.14


Is requiring element-type match (float vs double) between corpus and query table intentional? Many vector search systems silently upcast. If a user stores embeddings as float but their query pipeline produces double, this would be a confusing error. Worth a comment in the TVF docs if this is by design.

Yes I think this is intentional for now to minimize the complexity around doing this implicit casting. If you happen to know which system do this silent upcast let me know, or i can do some more research on which system allows this and follow up in another pr (I have some future optimization prs planned for this work so can batch it there)

For now can leave a comment in the implementation saying we have this defensive type matching requirement.

to create Github issue for followup

I was thinking of the Spark SQL support with VALUES clause with raw values in the query, and Spark can infer the value type. Not a blocker on this PR; we can follow up.

yihua · 2026-04-02T23:02:30Z

...udi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala

+        requireSameLength(corpus.length, queryVector.length)
+        distFn(new DenseVector(corpus.iterator.map(_.toDouble).toArray), queryDv, queryNorm)
+      })
+      case DoubleType => udf((corpus: Seq[Double]) => {


In the FloatType branch, corpus.iterator.map(_.toDouble).toArray allocates a full Array[Double] copy on every row. For the batch UDF factories (createFloatDistanceUdf at line 140), the same pattern appears for both arguments. Have you considered using a reusable buffer or at least corpus.view.map(...) to reduce GC pressure? For single-query mode this is on the hot path for every corpus row.

let me look into this further.

@yihua I tried to think of some alternatives but I think leveraging DenseVector from spark mllib makes this tricky, so thinking we might just have to document this for now.

Some ideas considered which are not ideal:

Reusable buffer inside UDF is not safe. Spark doesn't guarantee single-threaded UDF execution per partition.

corpus.view.map(...) does not help either. DenseVector constructor forces materialization anyway, so the array allocation still happens.

Dropping the use of spark mllib's DenseVector, and inlining the math ourselves. We'd lose MLlib's optimized BLAS-backed dot() and sqdist(). For high-dimensional vectors, the native BLAS path is significantly faster than a manual Scala loop.

Based on this I think the Array[Double] allocation is fundamentally required by DenseVector. Every approach either has the same allocation or trades off(losing BLAS, unsafe concurrency).

github follow for micro benchmark idea

Got it. Overall, my concern is that per-vector processing through DenseVector could introduce latency overhead. Given this is the initial implementation, we can check in the code and should follow up with micro-benchmarks.

yihua · 2026-04-02T23:02:30Z

...spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala

+  private def resolveTableToDf(tableName: String): DataFrame = {
+    if (tableName.contains(StoragePath.SEPARATOR)) {
+      spark.read.format("hudi").load(tableName)
+    } else {


When tableName doesn't contain a path separator, spark.table(tableName) is called — but this won't resolve a multi-part identifier like catalog.db.table. Is that intentional? Also, if the table doesn't exist, the Spark exception won't mention hudi_vector_search, which could be confusing. It might be worth wrapping this in a try-catch that rethrows with a more contextual message.

@yihua Actually it seems that spark.table(tableName) can resolve multipart indentifier. However will add the exception to make it more clear.

private def resolveTableToDf(tableName: String): DataFrame = { try { if (tableName.contains(StoragePath.SEPARATOR)) { spark.read.format("hudi").load(tableName) } else { // spark.table() supports multi-part identifiers (e.g. catalog.db.table) spark.table(tableName) } } catch { case e: Exception => throw new HoodieAnalysisException( s"hudi_vector_search: unable to resolve table '$tableName': ${e.getMessage}") } }

rahil to test this once

...udi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala

yihua · 2026-04-03T05:40:53Z

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

+    // Verify output columns
+    val columns = resultDf.columns
+    assertTrue(columns.contains("_hudi_distance"))
+    assertTrue(columns.contains("_hudi_qid"))


Should _hudi_qid column value be validated?

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

yihua · 2026-04-03T05:43:30Z

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

+         |""".stripMargin
+    ).collect()
+
+    assertEquals(3, result.length)


Also validate the third row?

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: Adds vector-search table-valued functions (hudi_vector_search, hudi_vector_search_batch) to Hudi Spark SQL, plus analysis, plan-building, distance UDFs, Maven dependency on Spark MLlib local, and an extensive end-to-end test suite.

Greptile Summary: This PR adds hudi_vector_search and hudi_vector_search_batch SQL table-valued functions for brute-force KNN vector similarity search over Hudi tables with support for cosine, L2, and dot-product distance metrics across float, double, and byte embeddings.

Key changes:

New unresolved TVF plan nodes (HoodieVectorSearchTableValuedFunction / HoodieVectorSearchBatchTableValuedFunction) with robust argument parsing for 4–7 arguments
BruteForceSearchAlgorithm implementing single-query (orderBy + limit) and batch-query (broadcast cross-join + row_number() window) plans behind a pluggable VectorSearchAlgorithm trait that cleanly supports future algorithm additions (HNSW, RowMatrix, etc.)
VectorDistanceUtils with per-type UDF factories that pre-compute the query DenseVector once for single-query mode, avoiding per-row DenseVector allocation overhead
evaluateQueryVector correctly handles DecimalType inference for SQL ARRAY(1.0, 0.5) literals, but is missing a type guard before asInstanceOf[ArrayData] that would surface as a ClassCastException instead of a user-friendly HoodieAnalysisException when a non-array expression is supplied as the query vector

Greptile Confidence Score: 4/5
Safe to merge after the ClassCastException guard in evaluateQueryVector is fixed; remaining issues are non-blocking style improvements

The feature is well-structured with a clean pluggable algorithm design, pre-compute optimizations for single-query UDFs, and comprehensive test coverage. One concrete P1 bug (ClassCastException on invalid query vector type) is straightforward to fix with a two-line type guard. The prior NPE issue in parseK has already been addressed. Score of 4 reflects a single targeted fix remaining before the happy path to merge.

HoodieSparkBaseAnalysis.scala (evaluateQueryVector method, lines 359-361)

Sequence Diagram (CodeRabbit):

sequenceDiagram
    participant User as User / SQL
    participant Analyzer as Spark Analyzer
    participant ResolveRefs as ResolveReferences
    participant PlanBuilder as HoodieVectorSearchPlanBuilder
    participant DistanceUtils as VectorDistanceUtils
    participant Executor as Spark Executor

    User->>Analyzer: SELECT hudi_vector_search(...)
    Analyzer->>ResolveRefs: Resolve TVF
    ResolveRefs->>ResolveRefs: parse args, resolve table, eval query vector
    ResolveRefs->>PlanBuilder: buildSingleQueryPlan(corpusDf, ..., queryVector, k, metric)
    PlanBuilder->>DistanceUtils: create distance UDF
    PlanBuilder->>PlanBuilder: build logical plan (filter, map distance, order, limit)
    PlanBuilder->>Analyzer: return analyzed LogicalPlan
    Analyzer->>Executor: execute plan
    Executor->>Executor: compute distances, sort, return top-k

Sequence Diagram (CodeRabbit):

sequenceDiagram
    participant User as User / SQL
    participant Analyzer as Spark Analyzer
    participant ResolveRefs as ResolveReferences
    participant PlanBuilder as HoodieVectorSearchPlanBuilder
    participant DistanceUtils as VectorDistanceUtils
    participant Executor as Spark Executor

    User->>Analyzer: SELECT hudi_vector_search_batch(corpus, query, ...)
    Analyzer->>ResolveRefs: Resolve TVF
    ResolveRefs->>ResolveRefs: parse args, resolve corpus & query tables
    ResolveRefs->>PlanBuilder: buildBatchQueryPlan(corpusDf, queryDf, ...)
    PlanBuilder->>DistanceUtils: create distance UDF
    PlanBuilder->>PlanBuilder: build plan (broadcast, compute distances, window rank, top-k)
    PlanBuilder->>Analyzer: return analyzed LogicalPlan
    Analyzer->>Executor: execute plan
    Executor->>Executor: cross-join, compute distances, rank per query, return results

Sequence Diagram (Greptile):

sequenceDiagram
    participant User as SQL User
    participant Spark as Spark Analyzer
    participant TVF as HoodieVectorSearch TVF
    participant Analysis as ResolveReferences Rule
    participant Builder as BruteForceSearchAlgorithm
    participant UDF as VectorDistanceUtils

    User->>Spark: SELECT * FROM hudi_vector_search(...)
    Spark->>TVF: Create HoodieVectorSearchTableValuedFunction(args)
    Spark->>Analysis: Apply ResolveReferences rule
    Analysis->>TVF: parseArgs(args)
    TVF-->>Analysis: ParsedArgs(table, embeddingCol, queryVectorExpr, k, metric)
    Analysis->>Analysis: resolveTableToDf(table) → corpusDf
    Analysis->>Analysis: evaluateQueryVector(expr) → Array[Double]
    Analysis->>Builder: buildSingleQueryPlan(spark, corpusDf, col, queryVector, k, metric)
    Builder->>Builder: validateEmbeddingColumn(corpusDf, col)
    Builder->>UDF: createSingleQueryDistanceUdf(metric, elemType, queryVector)
    UDF-->>Builder: distanceUdf (closes over queryDv + queryNorm)
    Builder->>Builder: filteredDf.withColumn(_hudi_distance).drop(col).orderBy.limit(k)
    Builder-->>Analysis: analyzed LogicalPlan
    Analysis-->>Spark: Resolved plan replaces TVF node
    Spark-->>User: Result rows with _hudi_distance column

CodeRabbit: yihua#10 (review)
Greptile: yihua#10 (review)

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala

yihua · 2026-04-04T21:26:22Z

...spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala

+          s"query vector element at index $i is null")
+      getElement(i)
+    }.toArray
+  }


⚠️ Potential issue | 🟡 Minor

Add validation that expr.dataType is ArrayType before casting.

If a user mistakenly passes a non-array expression as the query vector, lines 362 and 364 will throw ClassCastException instead of a descriptive HoodieAnalysisException. Consider adding an early check:

🛡️ Proposed fix to add type validation

private def evaluateQueryVector(expr: Expression): Array[Double] = { + expr.dataType match { + case _: ArrayType => // valid + case other => throw new HoodieAnalysisException( + s"Function '${HoodieVectorSearchTableValuedFunction.FUNC_NAME}': " + + s"query vector must be an array type, got: $other") + } if (!expr.foldable) { throw new HoodieAnalysisException(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala` around lines 350 - 386, In evaluateQueryVector, add an early check that expr.dataType is an ArrayType before casting: if it's not an ArrayType throw a HoodieAnalysisException with a clear message that the query vector must be an array of numeric types; then safely cast expr.dataType.asInstanceOf[ArrayType] (used for elementType) and proceed with existing element handling (preserving the existing numeric element type matching and null checks). This ensures non-array inputs to evaluateQueryVector raise a descriptive HoodieAnalysisException instead of a ClassCastException.

— CodeRabbit (original) (source:comment#3036004780)

dont think is valid, not sure why user would pass a non array expression for this case, but will address

yihua · 2026-04-04T21:26:22Z

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala

+      ).collect()
+    })
+    assertTrue(ex.getMessage.contains("nonexistent_col") ||
+      ex.getCause.getMessage.contains("nonexistent_col"))


⚠️ Potential issue | 🟡 Minor

Stop dereferencing getCause blindly in these negative tests.

If Spark throws a top-level exception with no cause, these assertions NPE and hide the real regression. You already use the safer pattern on Lines 1404-1405; hoisting that into a shared helper would make these checks deterministic.

Suggested cleanup

+ private def rootMessage(t: Throwable): String = + Option(t.getCause).map(rootMessage).getOrElse(Option(t.getMessage).getOrElse(""))

- assertTrue(ex.getMessage.contains("nonexistent_col") || - ex.getCause.getMessage.contains("nonexistent_col")) + assertTrue(rootMessage(ex).contains("nonexistent_col"))

Also applies to: 565-566, 582-583

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala` around lines 545 - 546, The test currently dereferences ex.getCause blindly (in TestHoodieVectorSearchFunction) causing NPEs; change the negative assertions to safely check both the top-level message and the cause only if present (e.g., check ex.getMessage contains "nonexistent_col" OR (ex.getCause != null && ex.getCause.getMessage contains "nonexistent_col")), and extract that logic into a small shared helper (e.g., assertExceptionContains(Throwable ex, String substr)) used by the failing tests instead of repeating the pattern so lines 545-546 (and the similar checks at the other locations) become deterministic and null-safe.

✅ Addressed in commits 41bcb03 to f43a6ea

— CodeRabbit (original) (source:comment#3036004782)

ack can address.

This is a minor problem. We can pass on this. I would suggest validating full exception message, as a general way of writing stronger tests.

...spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala

yihua · 2026-04-04T21:26:23Z

...spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala

+
+    val arrayData = value.asInstanceOf[ArrayData]
+    val numElements = arrayData.numElements()
+    val elementType = expr.dataType.asInstanceOf[ArrayType].elementType


Unchecked cast to ArrayData/ArrayType yields ClassCastException on non-array input

If a user passes any non-array foldable expression as the query_vector argument — e.g. hudi_vector_search('t', 'emb', 1.0, 5) or hudi_vector_search('t', 'emb', 'text', 5) — both value.asInstanceOf[ArrayData] and expr.dataType.asInstanceOf[ArrayType] throw an unhandled ClassCastException rather than a HoodieAnalysisException. The user sees a raw JVM stack trace with no hint about how to fix the call.

A type guard on expr.dataType should precede both casts:

Suggested change

val elementType = expr.dataType.asInstanceOf[ArrayType].elementType

if (!expr.dataType.isInstanceOf[ArrayType]) {

throw new HoodieAnalysisException(

s"Function '${HoodieVectorSearchTableValuedFunction.FUNC_NAME}': " +

s"query vector must be an array type, got ${expr.dataType.simpleString}")

}

val arrayData = value.asInstanceOf[ArrayData]

val numElements = arrayData.numElements()

val elementType = expr.dataType.asInstanceOf[ArrayType].elementType

— Greptile (original) (source:comment#3036042593)

I still do not understand why the user would pass a single scaler value like
hudi_vector_search('t', 'emb', 1.0, 5) or hudi_vector_search('t', 'emb', 'text', 5).

I can address this but seems highly unlikely.

I think this is a suggestion of defensive coding to avoid invalid input causing undefined behavior.

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala

hudi-bot · 2026-04-06T01:08:32Z

CI report:

402b24e Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM as initial support

codecov-commenter · 2026-04-08T14:11:08Z

Codecov Report

❌ Patch coverage is 70.35831% with 91 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.75%. Comparing base (35e2bbf) to head (402b24e).
⚠️ Report is 21 commits behind head on master.

Files with missing lines	Patch %	Lines
...ogical/HoodieVectorSearchTableValuedFunction.scala	71.59%	15 Missing and 10 partials ⚠️
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala	78.50%	14 Missing and 9 partials ⚠️
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala	57.40%	18 Missing and 5 partials ⚠️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala	60.00%	11 Missing and 9 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18432      +/-   ##
============================================
+ Coverage     68.52%   68.75%   +0.22%     
- Complexity    27968    28064      +96     
============================================
  Files          2440     2449       +9     
  Lines        134456   134765     +309     
  Branches      16226    16319      +93     
============================================
+ Hits          92138    92655     +517     
+ Misses        35054    34793     -261     
- Partials       7264     7317      +53

Flag	Coverage Δ
common-and-other-modules	`44.41% <4.24%> (+0.07%)`	⬆️
hadoop-mr-java-client	`44.91% <ø> (-0.08%)`	⬇️
spark-client-hadoop-common	`48.49% <ø> (+0.11%)`	⬆️
spark-java-tests	`48.83% <70.35%> (+0.06%)`	⬆️
spark-scala-tests	`45.49% <4.23%> (-0.16%)`	⬇️
utilities	`38.24% <4.24%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...spark/sql/hudi/analysis/TableValuedFunctions.scala	`100.00% <100.00%> (ø)`
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala	`70.99% <60.00%> (-3.19%)`	⬇️
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala	`78.50% <78.50%> (ø)`
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala	`57.40% <57.40%> (ø)`
...ogical/HoodieVectorSearchTableValuedFunction.scala	`71.59% <71.59%> (ø)`

... and 62 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rahil-c commented Mar 31, 2026

View reviewed changes

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala Outdated Show resolved Hide resolved

github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 31, 2026

rahil-c commented Mar 31, 2026

View reviewed changes

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala Outdated Show resolved Hide resolved

github-actions bot added size:XS PR with lines of changes in <= 10 and removed size:XL PR with lines of changes > 1000 labels Mar 31, 2026

rahil-c commented Mar 31, 2026

View reviewed changes

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala Outdated Show resolved Hide resolved

rahil-c commented Mar 31, 2026

View reviewed changes

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala Outdated Show resolved Hide resolved

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:XS PR with lines of changes in <= 10 labels Mar 31, 2026

rahil-c commented Mar 31, 2026

View reviewed changes

rahil-c mentioned this pull request Mar 31, 2026

Implement initial TVF for Spark VECTOR Search #18430

Closed

rahil-c commented Mar 31, 2026

View reviewed changes

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala Outdated Show resolved Hide resolved

rahil-c commented Mar 31, 2026

View reviewed changes

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala Outdated Show resolved Hide resolved

github-actions bot added size:XS PR with lines of changes in <= 10 and removed size:XL PR with lines of changes > 1000 labels Apr 1, 2026

rahil-c commented Apr 1, 2026

View reviewed changes

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala Outdated Show resolved Hide resolved

rahil-c commented Apr 1, 2026

View reviewed changes

...common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala Outdated Show resolved Hide resolved

rahil-c commented Apr 1, 2026

View reviewed changes

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala Outdated Show resolved Hide resolved

rahil-c commented Apr 1, 2026

View reviewed changes

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala Outdated Show resolved Hide resolved

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:XS PR with lines of changes in <= 10 labels Apr 1, 2026

address intial self review comment

336a2c6

rahil-c requested review from bvaradar, voonhous and yihua April 1, 2026 04:00

rahil-c marked this pull request as ready for review April 1, 2026 04:01

voonhous reviewed Apr 1, 2026

View reviewed changes

address further self review comments

b93d8c3

rahil-c commented Apr 1, 2026

View reviewed changes

test consolidation

45761dd

bvaradar reviewed Apr 1, 2026

View reviewed changes

...udi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala Show resolved Hide resolved

...cala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala Outdated Show resolved Hide resolved

rahil-c added 2 commits April 1, 2026 15:31

fix test issue

51e10c9

address balaji comments for using libs for distance calculation

b16f802

rahil-c requested review from bvaradar and voonhous April 2, 2026 00:51

voonhous approved these changes Apr 2, 2026

View reviewed changes

rahil-c changed the title ~~feat{vector): Add Spark VECTOR Search TVF with intial KNN algorithm~~ feat(vector): Add Spark VECTOR Search TVF with intial KNN algorithm Apr 2, 2026

address balaji comments for using a seperate tvf for batch for cleare…

f63f37b

…r seperation

bvaradar approved these changes Apr 2, 2026

View reviewed changes

futher self review

c19c701

yihua reviewed Apr 2, 2026

View reviewed changes

yihua reviewed Apr 3, 2026

View reviewed changes

rahil-c added 4 commits April 3, 2026 11:16

address ethan core comments

3e963cb

test fix

18c4c11

addressed further comments

c98580f

add back pom change, minor table name change

41bcb03

yihua mentioned this pull request Apr 4, 2026

[OSS PR #18432] feat(vector): Add Spark VECTOR Search TVF with intial KNN algorithm yihua/hudi#10

Open

test consolidation

f43a6ea

yihua reviewed Apr 4, 2026

View reviewed changes

address ethan final comments

402b24e

yihua approved these changes Apr 6, 2026

View reviewed changes

yihua merged commit fd20018 into apache:master Apr 6, 2026
61 of 66 checks passed

Conversation

rahil-c commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

voonhous left a comment

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Apr 2, 2026

Uh oh!

rahil-c commented Apr 2, 2026

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua commented Apr 2, 2026

Codecov Report

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rahil-c commented Mar 31, 2026 •

edited

Loading

rahil-c Apr 3, 2026 •

edited

Loading

rahil-c Apr 5, 2026 •

edited

Loading

rahil-c Apr 5, 2026 •

edited

Loading