Skip to content

feat: add native support for cardinality() expression#4376

Closed
oglego wants to merge 3 commits into
apache:mainfrom
oglego:feature/support-cardinality-expression
Closed

feat: add native support for cardinality() expression#4376
oglego wants to merge 3 commits into
apache:mainfrom
oglego:feature/support-cardinality-expression

Conversation

@oglego
Copy link
Copy Markdown

@oglego oglego commented May 20, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

cardinality is a commonly used Spark SQL function (since 2.4.0) that was not yet supported natively by Comet, causing fallback to Spark execution. It returns the number of elements in an array or the number of key-value pairs in a map.

Per the Spark docs, cardinality returns -1 for null input only when both spark.sql.legacy.sizeOfNull=true AND spark.sql.ansi.enabled=false. With default settings it always returns NULL for null input. In Spark's implementation, cardinality is a direct alias for the Size expression with legacySizeOfNull = false hardcoded at parse time — the two-config gate never applies to cardinality.

What changes are included in this PR?

Adds native support for cardinality(expr) for both array and map inputs by extending the existing CometSize serde. There was no separate Cardinality class to wire — both cardinality and size produce a Size node in the logical plan. Two gaps in the Scala layer and one in the Rust layer prevented cardinality from offloading:

  1. CometSize.getSupportLevel was returning Unsupported for MapType inputs
  2. CometSize.convert was reading SQLConf.get.legacySizeOfNull (the global config) rather than expr.legacySizeOfNull (the instance field set at parse time), so cardinality could incorrectly return -1 for NULL when the legacy config was enabled
  3. SparkSizeFunc in the Rust layer used Signature::uniform with exact type stubs, causing DataFusion's type coercion to reject real map columns with concrete inner field types

Changes:

  • Extended CometSize in arrays.scala to accept MapType as Compatible()
  • Fixed CometSize.convert to use expr.legacySizeOfNull instead of the global config
  • Changed SparkSizeFunc signature to Signature::any(1) so real map and array types are accepted regardless of inner field types
  • Fixed null map entries in SparkSizeFunc to append null instead of hardcoded -1
  • Removed spark_answer_only from the size(arr), size(m) column query in size.sql since map inputs now run natively
  • Annotated the Size row in expressions.md to note it also covers the cardinality SQL alias

The implement-comet-expression and wire-datafusion-function skills were used to scaffold this implementation.

How are these changes tested?

New Comet SQL Test at spark/src/test/resources/sql-tests/expressions/array/cardinality.sql. Covers:

  • Array column input
  • Map column input
  • Array and map columns in the same query
  • NULL array and map inputs via column reference (asserts NULL is returned, not -1)
  • Nested array column input (array<array<int>>)
  • Array-of-structs column input (array<struct<a: int>>)
  • Literal array and map arguments (spark_answer_onlyCreateArray/CreateMap not yet natively supported by Comet, but correctness is verified against Spark)

The existing size.sql test now also covers the map-input path natively (removed spark_answer_only).

./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite cardinality" -Dtest=none
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite size" -Dtest=none

…p inputs

test: update cardinality test suite
SELECT cardinality(struct_arr) FROM test_cardinality

-- literal array and map arguments (spark_answer_only: CreateArray/CreateMap not yet natively supported)
query spark_answer_only
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we should support literals for array 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid tests when not selecting from the parquet table, Comet is not needed for such scalar queries

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about those mistakes, I've got those removed now!

@oglego
Copy link
Copy Markdown
Author

oglego commented May 21, 2026

I found the below when looking into the build errors:

CometIfSuite:

  • if expression (489 milliseconds)
    CometCoalesceSuite:
  • coalesce should return correct datatype (393 milliseconds)
  • test coalesce lazy eval (165 milliseconds)
  • string with coalesce (215 milliseconds)
    CometCaseWhenSuite:
  • case_when (857 milliseconds)
    Run completed in 15 minutes, 19 seconds.
    Total number of tests run: 736
    Suites: completed 37, aborted 0
    Tests: succeeded 734, failed 2, canceled 6, ignored 14, pending 0
    *** 2 TESTS FAILED ***

I don't think these are related to any changes I've made but If I am missing something on this just let me know and I can dig into it further. Thank you for the review on this I really appreciate it!

@oglego
Copy link
Copy Markdown
Author

oglego commented May 21, 2026

Apologies, it looks like the errors are related to

CometMapExpressionSuite:

  • read map[int, int] from parquet (648 milliseconds)
  • read map[struct, struct] from parquet (875 milliseconds)
  • map_from_arrays (283 milliseconds)
  • fallback for size with map input *** FAILED *** (126 milliseconds)

I'll dig into this more on my side, thanks again for the help!

@oglego
Copy link
Copy Markdown
Author

oglego commented May 21, 2026

I dug into this a bit more, and the underlying cardinality support is coming through Size. Enabling MapType there does allow cardinality/size on existing map-typed inputs, but constructor cases like cardinality(map(...)) still fall back because CreateMap is not supported yet. Given that, I’m going to hold off on developing this for now and wait until CreateMap support is in place so the feature boundary is cleaner and the map support story is more consistent.

@oglego oglego closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants