Skip to content

collection expression audit follow-ups (from #4473) #4504

@andygrove

Description

@andygrove

Tracking issue for follow-up work surfaced by the collection expression audit in #4473. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a behavioural gap the audit documented but did not implement.

High priority

1. CometConcat does not mark non-default Spark 4.0+ collations as Incompatible

Spark 4.0 widens Concat.allowedTypes from StringType to StringTypeWithCollation(supportsTrimCollation = true) and preserves the collation in the merged result type. CometConcat.getSupportLevel (spark/src/main/scala/org/apache/comet/serde/strings.scala:232-238) returns Compatible() whenever every child is StringType, regardless of collation, and the native concat UDF always produces Utf8 (UTF8_BINARY semantics). Per the audit-comet-expression skill (rule 11), Spark 4.0+ collation gaps must flip the support level to Incompatible(Some(reason)) so EXPLAIN and the auto-generated compatibility doc surface the divergence. Cross-reference #2190 / #4496.

2. CometReverse does not mark non-default Spark 4.0+ collations as Incompatible

Spark 4.0 widens Reverse.inputTypes from TypeCollection(StringType, ArrayType) to TypeCollection(StringTypeWithCollation(supportsTrimCollation = true), ArrayType) and propagates the collation through dataType. CometReverse.getSupportLevel (spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:32-38) returns Compatible() for the non-array (string) branch and the native reverse UDF reverses code units, producing Utf8. The string branch should report Incompatible(Some(reason)) for non-UTF8_BINARY collations, mirroring the concat fix. Cross-reference #2190 / #4496.

3. CometReverse.getIncompatibleReasons delegates only to the array branch

CometReverse.getIncompatibleReasons() (spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:29-30) returns CometArrayReverse.getIncompatibleReasons(). Once the collation gap from item 2 is reflected in getSupportLevel, the string-collation reason also needs to appear in getIncompatibleReasons() so it reaches the compatibility doc. Per skill rule 2, every distinct Incompatible(Some(r)) branch must be represented in the reasons method.

Medium priority

4. array/size.sql MapType column-reference path uses spark_answer_only

spark/src/test/resources/sql-tests/expressions/array/size.sql:24-25 runs the MapType column-reference case as query spark_answer_only, which accepts whatever Spark returns and does not lock in that Comet falls back. The intended behaviour after #4472 is that size(map_col) falls back to Spark, so the case should be query expect_fallback(...) with the same reason string that CometSize.getSupportLevel returns for MapType.

5. CometSize defensive non-array / non-map branch

CometSize.getSupportLevel (spark/src/main/scala/org/apache/comet/serde/arrays.scala:649-652) ends with an Unsupported(Some(s"Unsupported child data type: $other")) branch tagged "this should be unreachable because Spark only supports map and array inputs". Either drop the branch and let the match fail (turning a planner bug into a clear exception), or list the reason in getUnsupportedReasons() so the audit consistency check (skill rule 2) passes. Today the reason can be returned by getSupportLevel but is not enumerated in getUnsupportedReasons(), which only lists the MapType reason.


Surfaced by the audit-comet-expression skill run in #4473.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions