Skip to content

feat(vector): Add Spark VECTOR Search TVF with intial KNN algorithm#18432

Merged
yihua merged 15 commits intoapache:masterfrom
rahil-c:rahil/vector-tvf-master
Apr 6, 2026
Merged

feat(vector): Add Spark VECTOR Search TVF with intial KNN algorithm#18432
yihua merged 15 commits intoapache:masterfrom
rahil-c:rahil/vector-tvf-master

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented Mar 31, 2026

Describe the issue this Pull Request addresses

This PR adds distributed vector search capability to Hudi's Spark SQL integration by introducing two new table-valued functions for single-query and batch-query KNN search. Currently there is no way to perform nearest-neighbor search over embedding columns stored in Hudi tables without an external vector database.

Summary and Changelog

Users can now find k-nearest neighbors over Hudi tables using standard Spark SQL, with support for cosine, L2, and dot product distance metrics across float, double, and byte embedding types.

Changes:

  • Added hudi_vector_search TVF for single-query KNN search
  • Added hudi_vector_search_batch TVF for batch-query KNN with broadcast cross-join and window-based top-k ranking
  • Implemented brute-force search behind an extensible algorithm trait for future index-based algorithms
  • Added distance utility UDFs that pre-compute query vector and norm to halve per-row allocations
  • Extended ResolveReferences Catalyst rule to resolve vector search TVF nodes into executable Spark logical plans
  • Added VECTOR metadata dimension validation at analysis time and case-insensitive column lookup
  • Added input validation guards for null k, non-array query vectors, byte corpus integrality, and element type compatibility
  • Added 38 end-to-end tests covering all metrics, element types, COW/MOR tables, edge cases, and exact distance verification

Impact

Two new Spark SQL table-valued functions with output columns _hudi_distance and _hudi_qid. Adds spark-mllib-local dependency (local-only, no cluster overhead). No breaking changes to existing APIs, table format, or write/read paths.

Risk Level

Low - Purely additive feature isolated in new files, registered alongside existing Hudi TVFs using the same SparkSessionExtensions mechanism.

Documentation Update

Requires documentation for TVF syntax/parameters, supported distance metrics and embedding types, output schema contract, and batch-query semantics.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Implements a new Spark TVF `hudi_vector_search` that performs exact
nearest-neighbor search over embeddings stored in a Hudi table using
brute-force KNN, without requiring a persistent vector index.

New files:
- HoodieVectorSearchTableValuedFunction: unresolved logical node holding
  raw args; parses single-query (table, col, ARRAY(...), k [, metric])
  and batch-query (corpus, col, query_table, col, k [, metric]) modes;
  supports cosine, l2/euclidean, dot_product distance metrics.
- HoodieVectorSearchPlanBuilder: builds the execution plan — single-query
  mode uses withColumn + orderBy + limit(k); batch-query mode uses
  crossJoin(broadcast(queries)) + window rank per _query_id; handles
  Float/Double/Byte embedding types via cast to Double before UDF.
  Fixes cross-join column ambiguity when corpus and query share column
  names by renaming clashing query columns to _query_<colname> before
  the join.

Modified files:
- TableValuedFunctions: registers hudi_vector_search alongside existing
  Hudi TVFs so all Spark adapters (3.3/3.4/3.5/4.0) inject it via
  SparkSessionExtensions.
- HoodieSparkBaseAnalysis (ResolveReferences): resolves
  HoodieVectorSearchTableValuedFunction — evaluates the query vector
  constant at analysis time, resolves table by name or path, delegates
  to HoodieVectorSearchPlanBuilder.
- TestHoodieVectorSearchFunction: end-to-end tests covering single/batch
  query modes, all three distance metrics, DataFrame API, path-based and
  view-based resolution, Float/Double embedding types, MOR tables,
  composability (WHERE, subqueries), error handling, exact distance
  verification, and same-column-name batch query regression test.
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 31, 2026
@github-actions github-actions bot added size:XS PR with lines of changes in <= 10 and removed size:XL PR with lines of changes > 1000 labels Mar 31, 2026
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:XS PR with lines of changes in <= 10 labels Mar 31, 2026
while (i < numElements) { result(i) = arrayData.getFloat(i).toDouble; i += 1 }
case IntegerType =>
while (i < numElements) { result(i) = arrayData.getInt(i).toDouble; i += 1 }
case LongType =>
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is long type here, i thought VECTOR in hudi only supports float, double, int?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark SQL can infer integer literals as LongType in some contexts (e.g., ARRAY(1, 2, 3) might produce LongType depending on how the expression is folded). Supporting it defensively here avoids a confusing runtime error for users, even if Hudi's VECTOR type itself only stores float/double/int.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let me look into adding this.

@github-actions github-actions bot added size:XS PR with lines of changes in <= 10 and removed size:XL PR with lines of changes > 1000 labels Apr 1, 2026
val meta = field.metadata
if (meta.contains(HoodieSchema.TYPE_METADATA_FIELD)) {
val typeDesc = meta.getString(HoodieSchema.TYPE_METADATA_FIELD)
val dimPattern = """VECTOR\((\d+)""".r
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is their a cleaner way of getting the dimension this also feels brittle

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately Spark's ArrayType doesn't encode dimension in the schema, so you'd always need to inspect actual data. One option: require the user to pass dimension explicitly as a TVF argument, which avoids the first-row inspection entirely and makes validation upfront. Otherwise, evaluating the first row or the query vector length is about as clean as it gets.

Copy link
Copy Markdown
Collaborator Author

@rahil-c rahil-c Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this is a cleaner solution we can leverage the HoodieSchema.parseTypeDescriptor and then if VECTOR type we can get the dimension.

private def extractVectorDimension(df: DataFrame, colName: String): Option[Int] = {
    df.schema.fields.find(_.name == colName).flatMap { field =>
      val meta = field.metadata
      if (meta.contains(HoodieSchema.TYPE_METADATA_FIELD)) {
        val typeDesc = meta.getString(HoodieSchema.TYPE_METADATA_FIELD)
        Try(HoodieSchema.parseTypeDescriptor(typeDesc)) match {
          case Success(v: HoodieSchema.Vector) => Some(v.getDimension)
          case Success(_) => throw new HoodieAnalysisException(
            s"Column '$colName' has type '$typeDesc' which is not a VECTOR type. " +
              "Only VECTOR columns are supported for vector search.")
          case Failure(e) => throw new HoodieAnalysisException(
            s"Column '$colName' has malformed type metadata '$typeDesc': ${e.getMessage}")
        }
      } else None
    }
  }

// Rename any query column that clashes with a corpus column or internal columns.
// Uses a double prefix if the standard rename would itself clash (e.g. "id" -> "_query_id"
// would collide with the internal _query_id column).
val renamedQuery = queryWithId.columns.foldLeft(queryWithId) { (df, qCol) =>
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a easier way we can avoid clashes so we dont need to maintain this extra code?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One common approach is to alias the DataFrames before the join (e.g., corpusDf.as("corpus") / queryDf.as("query")) and reference columns by qualified name throughout. That way you avoid the rename/restore dance entirely, though it does require all downstream column references to be qualified.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua i think you are right that seems to be the cleanest solution let me try this out.

@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:XS PR with lines of changes in <= 10 labels Apr 1, 2026
@rahil-c rahil-c requested review from bvaradar, voonhous and yihua April 1, 2026 04:00
@rahil-c rahil-c marked this pull request as ready for review April 1, 2026 04:01

// doc_4 [0.707,0.707,0]: cosine distance to [1,0,0] = 1 - 0.707 ~= 0.293
assertEquals("doc_4", result(1).getAs[String]("id"))
assertEquals(1.0 - 0.70710678, result(1).getAs[Double]("_hudi_distance"), 1e-4)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a cleaner way we can do these assertions?

@rahil-c rahil-c requested review from bvaradar and voonhous April 2, 2026 00:51
Copy link
Copy Markdown
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the PR validation failures. LGTM

@rahil-c rahil-c changed the title feat{vector): Add Spark VECTOR Search TVF with intial KNN algorithm feat(vector): Add Spark VECTOR Search TVF with intial KNN algorithm Apr 2, 2026
@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 2, 2026

@bvaradar @yihua if we can do one more review pass today would be great. I think we are close.

@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 2, 2026

Thanks @bvaradar for taking a look, will just check with @yihua once and then we should be good.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahil-c could you fix the PR description to follow the template?

@yihua
Copy link
Copy Markdown
Contributor

yihua commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 72.39057% with 82 lines in your changes missing coverage. Please review. ✅ Project coverage is 68.55%. Comparing base (35e2bbf) to head (c19c701). ⚠️ Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala 55.17% 18 Missing and 8 partials ⚠️
...ogical/HoodieVectorSearchTableValuedFunction.scala 75.29% 13 Missing and 8 partials ⚠️
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala 79.20% 10 Missing and 11 partials ⚠️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 68.88% 7 Missing and 7 partials ⚠️
Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18432      +/-   ##
============================================
+ Coverage     68.52%   68.55%   +0.03%     
- Complexity    27968    28045      +77     
============================================
  Files          2440     2447       +7     
  Lines        134456   135051     +595     
  Branches      16226    16357     +131     
============================================
+ Hits          92138    92588     +450     
- Misses        35054    35151      +97     
- Partials       7264     7312      +48     

Flag Coverage Δ
common-and-other-modules 44.26% <4.39%> (-0.08%) ⬇️
hadoop-mr-java-client 44.93% <ø> (-0.06%) ⬇️
spark-client-hadoop-common 48.46% <ø> (+0.08%) ⬆️
spark-java-tests 48.86% <72.39%> (+0.09%) ⬆️
spark-scala-tests 45.51% <4.37%> (-0.14%) ⬇️
utilities 38.26% <4.39%> (-0.10%) ⬇️
Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...spark/sql/hudi/analysis/TableValuedFunctions.scala 100.00% <100.00%> (ø)
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 73.00% <68.88%> (-1.17%) ⬇️
...ogical/HoodieVectorSearchTableValuedFunction.scala 75.29% <75.29%> (ø)
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala 79.20% <79.20%> (ø)
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala 55.17% <55.17%> (ø)
... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:

@rahil-c could you check the lines that miss coverage and see if more tests should be added?

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on adding the vector search TVF. Left a few comments.

assertEquals(2, result.length)
assertEquals("b1", result(0).getAs[String]("id"))
assertEquals(0.0, result(0).getAs[Double]("_hudi_distance"), 1e-5)
// b2: sqrt(10^2 + 10^2) = sqrt(200) ~= 14.14
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is requiring element-type match (float vs double) between corpus and query table intentional? Many vector search systems silently upcast. If a user stores embeddings as float but their query pipeline produces double, this would be a confusing error. Worth a comment in the TVF docs if this is by design.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this is intentional for now to minimize the complexity around doing this implicit casting. If you happen to know which system do this silent upcast let me know, or i can do some more research on which system allows this and follow up in another pr (I have some future optimization prs planned for this work so can batch it there)

For now can leave a comment in the implementation saying we have this defensive type matching requirement.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to create Github issue for followup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the Spark SQL support with VALUES clause with raw values in the query, and Spark can infer the value type. Not a blocker on this PR; we can follow up.

requireSameLength(corpus.length, queryVector.length)
distFn(new DenseVector(corpus.iterator.map(_.toDouble).toArray), queryDv, queryNorm)
})
case DoubleType => udf((corpus: Seq[Double]) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the FloatType branch, corpus.iterator.map(_.toDouble).toArray allocates a full Array[Double] copy on every row. For the batch UDF factories (createFloatDistanceUdf at line 140), the same pattern appears for both arguments. Have you considered using a reusable buffer or at least corpus.view.map(...) to reduce GC pressure? For single-query mode this is on the hot path for every corpus row.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me look into this further.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua I tried to think of some alternatives but I think leveraging DenseVector from spark mllib makes this tricky, so thinking we might just have to document this for now.

Some ideas considered which are not ideal:

  1. Reusable buffer inside UDF is not safe. Spark doesn't guarantee single-threaded UDF execution per partition.
  2. corpus.view.map(...) does not help either. DenseVector constructor forces materialization anyway, so the array allocation still happens.
  3. Dropping the use of spark mllib's DenseVector, and inlining the math ourselves. We'd lose MLlib's optimized BLAS-backed dot() and sqdist(). For high-dimensional vectors, the native BLAS path is significantly faster than a manual Scala loop.

Based on this I think the Array[Double] allocation is fundamentally required by DenseVector. Every approach either has the same allocation or trades off(losing BLAS, unsafe concurrency).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github follow for micro benchmark idea

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Overall, my concern is that per-vector processing through DenseVector could introduce latency overhead. Given this is the initial implementation, we can check in the code and should follow up with micro-benchmarks.

private def resolveTableToDf(tableName: String): DataFrame = {
if (tableName.contains(StoragePath.SEPARATOR)) {
spark.read.format("hudi").load(tableName)
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When tableName doesn't contain a path separator, spark.table(tableName) is called — but this won't resolve a multi-part identifier like catalog.db.table. Is that intentional? Also, if the table doesn't exist, the Spark exception won't mention hudi_vector_search, which could be confusing. It might be worth wrapping this in a try-catch that rethrows with a more contextual message.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua Actually it seems that spark.table(tableName) can resolve multipart indentifier. However will add the exception to make it more clear.

private def resolveTableToDf(tableName: String): DataFrame = {
    try {
      if (tableName.contains(StoragePath.SEPARATOR)) {
        spark.read.format("hudi").load(tableName)
      } else {
        // spark.table() supports multi-part identifiers (e.g. catalog.db.table)
        spark.table(tableName)
      }
    } catch {
      case e: Exception => throw new HoodieAnalysisException(
        s"hudi_vector_search: unable to resolve table '$tableName': ${e.getMessage}")
    }
  }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rahil to test this once

// Verify output columns
val columns = resultDf.columns
assertTrue(columns.contains("_hudi_distance"))
assertTrue(columns.contains("_hudi_qid"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should _hudi_qid column value be validated?

|""".stripMargin
).collect()

assertEquals(3, result.length)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also validate the third row?

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: Adds vector-search table-valued functions (hudi_vector_search, hudi_vector_search_batch) to Hudi Spark SQL, plus analysis, plan-building, distance UDFs, Maven dependency on Spark MLlib local, and an extensive end-to-end test suite.

Greptile Summary: This PR adds hudi_vector_search and hudi_vector_search_batch SQL table-valued functions for brute-force KNN vector similarity search over Hudi tables with support for cosine, L2, and dot-product distance metrics across float, double, and byte embeddings.

Key changes:

  • New unresolved TVF plan nodes (HoodieVectorSearchTableValuedFunction / HoodieVectorSearchBatchTableValuedFunction) with robust argument parsing for 4–7 arguments
  • BruteForceSearchAlgorithm implementing single-query (orderBy + limit) and batch-query (broadcast cross-join + row_number() window) plans behind a pluggable VectorSearchAlgorithm trait that cleanly supports future algorithm additions (HNSW, RowMatrix, etc.)
  • VectorDistanceUtils with per-type UDF factories that pre-compute the query DenseVector once for single-query mode, avoiding per-row DenseVector allocation overhead
  • evaluateQueryVector correctly handles DecimalType inference for SQL ARRAY(1.0, 0.5) literals, but is missing a type guard before asInstanceOf[ArrayData] that would surface as a ClassCastException instead of a user-friendly HoodieAnalysisException when a non-array expression is supplied as the query vector

Greptile Confidence Score: 4/5
Safe to merge after the ClassCastException guard in evaluateQueryVector is fixed; remaining issues are non-blocking style improvements

The feature is well-structured with a clean pluggable algorithm design, pre-compute optimizations for single-query UDFs, and comprehensive test coverage. One concrete P1 bug (ClassCastException on invalid query vector type) is straightforward to fix with a two-line type guard. The prior NPE issue in parseK has already been addressed. Score of 4 reflects a single targeted fix remaining before the happy path to merge.

HoodieSparkBaseAnalysis.scala (evaluateQueryVector method, lines 359-361)

Sequence Diagram (CodeRabbit):

sequenceDiagram
    participant User as User / SQL
    participant Analyzer as Spark Analyzer
    participant ResolveRefs as ResolveReferences
    participant PlanBuilder as HoodieVectorSearchPlanBuilder
    participant DistanceUtils as VectorDistanceUtils
    participant Executor as Spark Executor

    User->>Analyzer: SELECT hudi_vector_search(...)
    Analyzer->>ResolveRefs: Resolve TVF
    ResolveRefs->>ResolveRefs: parse args, resolve table, eval query vector
    ResolveRefs->>PlanBuilder: buildSingleQueryPlan(corpusDf, ..., queryVector, k, metric)
    PlanBuilder->>DistanceUtils: create distance UDF
    PlanBuilder->>PlanBuilder: build logical plan (filter, map distance, order, limit)
    PlanBuilder->>Analyzer: return analyzed LogicalPlan
    Analyzer->>Executor: execute plan
    Executor->>Executor: compute distances, sort, return top-k
Loading

Sequence Diagram (CodeRabbit):

sequenceDiagram
    participant User as User / SQL
    participant Analyzer as Spark Analyzer
    participant ResolveRefs as ResolveReferences
    participant PlanBuilder as HoodieVectorSearchPlanBuilder
    participant DistanceUtils as VectorDistanceUtils
    participant Executor as Spark Executor

    User->>Analyzer: SELECT hudi_vector_search_batch(corpus, query, ...)
    Analyzer->>ResolveRefs: Resolve TVF
    ResolveRefs->>ResolveRefs: parse args, resolve corpus & query tables
    ResolveRefs->>PlanBuilder: buildBatchQueryPlan(corpusDf, queryDf, ...)
    PlanBuilder->>DistanceUtils: create distance UDF
    PlanBuilder->>PlanBuilder: build plan (broadcast, compute distances, window rank, top-k)
    PlanBuilder->>Analyzer: return analyzed LogicalPlan
    Analyzer->>Executor: execute plan
    Executor->>Executor: cross-join, compute distances, rank per query, return results
Loading

Sequence Diagram (Greptile):

sequenceDiagram
    participant User as SQL User
    participant Spark as Spark Analyzer
    participant TVF as HoodieVectorSearch TVF
    participant Analysis as ResolveReferences Rule
    participant Builder as BruteForceSearchAlgorithm
    participant UDF as VectorDistanceUtils

    User->>Spark: SELECT * FROM hudi_vector_search(...)
    Spark->>TVF: Create HoodieVectorSearchTableValuedFunction(args)
    Spark->>Analysis: Apply ResolveReferences rule
    Analysis->>TVF: parseArgs(args)
    TVF-->>Analysis: ParsedArgs(table, embeddingCol, queryVectorExpr, k, metric)
    Analysis->>Analysis: resolveTableToDf(table) → corpusDf
    Analysis->>Analysis: evaluateQueryVector(expr) → Array[Double]
    Analysis->>Builder: buildSingleQueryPlan(spark, corpusDf, col, queryVector, k, metric)
    Builder->>Builder: validateEmbeddingColumn(corpusDf, col)
    Builder->>UDF: createSingleQueryDistanceUdf(metric, elemType, queryVector)
    UDF-->>Builder: distanceUdf (closes over queryDv + queryNorm)
    Builder->>Builder: filteredDf.withColumn(_hudi_distance).drop(col).orderBy.limit(k)
    Builder-->>Analysis: analyzed LogicalPlan
    Analysis-->>Spark: Resolved plan replaces TVF node
    Spark-->>User: Result rows with _hudi_distance column
Loading

CodeRabbit: yihua#10 (review)
Greptile: yihua#10 (review)

s"query vector element at index $i is null")
getElement(i)
}.toArray
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation that expr.dataType is ArrayType before casting.

If a user mistakenly passes a non-array expression as the query vector, lines 362 and 364 will throw ClassCastException instead of a descriptive HoodieAnalysisException. Consider adding an early check:

🛡️ Proposed fix to add type validation
   private def evaluateQueryVector(expr: Expression): Array[Double] = {
+    expr.dataType match {
+      case _: ArrayType => // valid
+      case other => throw new HoodieAnalysisException(
+        s"Function '${HoodieVectorSearchTableValuedFunction.FUNC_NAME}': " +
+          s"query vector must be an array type, got: $other")
+    }
     if (!expr.foldable) {
       throw new HoodieAnalysisException(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala`
around lines 350 - 386, In evaluateQueryVector, add an early check that
expr.dataType is an ArrayType before casting: if it's not an ArrayType throw a
HoodieAnalysisException with a clear message that the query vector must be an
array of numeric types; then safely cast expr.dataType.asInstanceOf[ArrayType]
(used for elementType) and proceed with existing element handling (preserving
the existing numeric element type matching and null checks). This ensures
non-array inputs to evaluateQueryVector raise a descriptive
HoodieAnalysisException instead of a ClassCastException.

CodeRabbit (original) (source:comment#3036004780)

Copy link
Copy Markdown
Collaborator Author

@rahil-c rahil-c Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont think is valid, not sure why user would pass a non array expression for this case, but will address

).collect()
})
assertTrue(ex.getMessage.contains("nonexistent_col") ||
ex.getCause.getMessage.contains("nonexistent_col"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stop dereferencing getCause blindly in these negative tests.

If Spark throws a top-level exception with no cause, these assertions NPE and hide the real regression. You already use the safer pattern on Lines 1404-1405; hoisting that into a shared helper would make these checks deterministic.

Suggested cleanup
+  private def rootMessage(t: Throwable): String =
+    Option(t.getCause).map(rootMessage).getOrElse(Option(t.getMessage).getOrElse(""))
-    assertTrue(ex.getMessage.contains("nonexistent_col") ||
-      ex.getCause.getMessage.contains("nonexistent_col"))
+    assertTrue(rootMessage(ex).contains("nonexistent_col"))

Also applies to: 565-566, 582-583

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala`
around lines 545 - 546, The test currently dereferences ex.getCause blindly (in
TestHoodieVectorSearchFunction) causing NPEs; change the negative assertions to
safely check both the top-level message and the cause only if present (e.g.,
check ex.getMessage contains "nonexistent_col" OR (ex.getCause != null &&
ex.getCause.getMessage contains "nonexistent_col")), and extract that logic into
a small shared helper (e.g., assertExceptionContains(Throwable ex, String
substr)) used by the failing tests instead of repeating the pattern so lines
545-546 (and the similar checks at the other locations) become deterministic and
null-safe.

✅ Addressed in commits 41bcb03 to f43a6ea

CodeRabbit (original) (source:comment#3036004782)

Copy link
Copy Markdown
Collaborator Author

@rahil-c rahil-c Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack can address.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor problem. We can pass on this. I would suggest validating full exception message, as a general way of writing stronger tests.


val arrayData = value.asInstanceOf[ArrayData]
val numElements = arrayData.numElements()
val elementType = expr.dataType.asInstanceOf[ArrayType].elementType
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unchecked cast to ArrayData/ArrayType yields ClassCastException on non-array input

If a user passes any non-array foldable expression as the query_vector argument — e.g. hudi_vector_search('t', 'emb', 1.0, 5) or hudi_vector_search('t', 'emb', 'text', 5) — both value.asInstanceOf[ArrayData] and expr.dataType.asInstanceOf[ArrayType] throw an unhandled ClassCastException rather than a HoodieAnalysisException. The user sees a raw JVM stack trace with no hint about how to fix the call.

A type guard on expr.dataType should precede both casts:

Suggested change
val elementType = expr.dataType.asInstanceOf[ArrayType].elementType
if (!expr.dataType.isInstanceOf[ArrayType]) {
throw new HoodieAnalysisException(
s"Function '${HoodieVectorSearchTableValuedFunction.FUNC_NAME}': " +
s"query vector must be an array type, got ${expr.dataType.simpleString}")
}
val arrayData = value.asInstanceOf[ArrayData]
val numElements = arrayData.numElements()
val elementType = expr.dataType.asInstanceOf[ArrayType].elementType

Greptile (original) (source:comment#3036042593)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not understand why the user would pass a single scaler value like
hudi_vector_search('t', 'emb', 1.0, 5) or hudi_vector_search('t', 'emb', 'text', 5).

I can address this but seems highly unlikely.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a suggestion of defensive coding to avoid invalid input causing undefined behavior.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 6, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as initial support

@yihua yihua merged commit fd20018 into apache:master Apr 6, 2026
61 of 66 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.35831% with 91 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.75%. Comparing base (35e2bbf) to head (402b24e).
⚠️ Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
...ogical/HoodieVectorSearchTableValuedFunction.scala 71.59% 15 Missing and 10 partials ⚠️
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala 78.50% 14 Missing and 9 partials ⚠️
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala 57.40% 18 Missing and 5 partials ⚠️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 60.00% 11 Missing and 9 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18432      +/-   ##
============================================
+ Coverage     68.52%   68.75%   +0.22%     
- Complexity    27968    28064      +96     
============================================
  Files          2440     2449       +9     
  Lines        134456   134765     +309     
  Branches      16226    16319      +93     
============================================
+ Hits          92138    92655     +517     
+ Misses        35054    34793     -261     
- Partials       7264     7317      +53     
Flag Coverage Δ
common-and-other-modules 44.41% <4.24%> (+0.07%) ⬆️
hadoop-mr-java-client 44.91% <ø> (-0.08%) ⬇️
spark-client-hadoop-common 48.49% <ø> (+0.11%) ⬆️
spark-java-tests 48.83% <70.35%> (+0.06%) ⬆️
spark-scala-tests 45.49% <4.23%> (-0.16%) ⬇️
utilities 38.24% <4.24%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...spark/sql/hudi/analysis/TableValuedFunctions.scala 100.00% <100.00%> (ø)
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 70.99% <60.00%> (-3.19%) ⬇️
.../hudi/analysis/HoodieVectorSearchPlanBuilder.scala 78.50% <78.50%> (ø)
.../spark/sql/hudi/analysis/VectorDistanceUtils.scala 57.40% <57.40%> (ø)
...ogical/HoodieVectorSearchTableValuedFunction.scala 71.59% <71.59%> (ø)

... and 62 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants