Skip to content

[Feature] Support Spark expression: array_size #3155

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.

Comet does not currently support the Spark array_size function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

The ArraySize expression returns the number of elements in an array. It is a runtime-replaceable expression that internally delegates to the Size expression with legacySizeOfNull set to false, providing consistent null handling behavior.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

Describe the potential solution

Spark Specification

Syntax:

array_size(array_expr)
// DataFrame API
col("array_column").expr("array_size(array_column)")

Arguments:

Argument Type Description
child Expression The array expression whose size will be calculated

Return Type: Returns an IntegerType representing the number of elements in the array.

Supported Data Types:

  • ArrayType - Arrays of any element type (string, numeric, complex types, etc.)

Edge Cases:

  • Null arrays: Returns null when the input array is null (non-legacy behavior)

  • Empty arrays: Returns 0 for arrays with no elements

  • Nested arrays: Counts only the top-level elements, not nested array contents

  • Arrays with null elements: Null elements within the array are counted as regular elements

Examples:

-- Basic array size calculation
SELECT array_size(array('b', 'd', 'c', 'a'));
-- Returns: 4

-- Empty array
SELECT array_size(array());
-- Returns: 0

-- Null array
SELECT array_size(NULL);
-- Returns: NULL

-- Array with null elements
SELECT array_size(array('a', NULL, 'c'));
-- Returns: 3
// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(expr("array_size(array_column)"))

// With array creation
df.select(expr("array_size(array('a', 'b', 'c'))"))

Implementation Approach

See the Comet guide on adding new expressions for detailed instructions.

  1. Scala Serde: Add expression handler in spark/src/main/scala/org/apache/comet/serde/
  2. Register: Add to appropriate map in QueryPlanSerde.scala
  3. Protobuf: Add message type in native/proto/src/proto/expr.proto if needed
  4. Rust: Implement in native/spark-expr/src/ (check if DataFusion has built-in support first)

Additional context

Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.ArraySize

Related:

  • Size - The underlying expression that performs the actual size calculation
  • array - Function to create arrays
  • cardinality - Alternative function for getting array/map size

This issue was auto-generated from Spark reference documentation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions