collection expression audit follow-ups (from #4473)

Tracking issue for follow-up work surfaced by the collection expression audit in #4473. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a behavioural gap the audit documented but did not implement.

## High priority

### 1. `CometConcat` does not mark non-default Spark 4.0+ collations as `Incompatible`

Spark 4.0 widens `Concat.allowedTypes` from `StringType` to `StringTypeWithCollation(supportsTrimCollation = true)` and preserves the collation in the merged result type. `CometConcat.getSupportLevel` (`spark/src/main/scala/org/apache/comet/serde/strings.scala:232-238`) returns `Compatible()` whenever every child is `StringType`, regardless of collation, and the native `concat` UDF always produces `Utf8` (`UTF8_BINARY` semantics). Per the `audit-comet-expression` skill (rule 11), Spark 4.0+ collation gaps must flip the support level to `Incompatible(Some(reason))` so EXPLAIN and the auto-generated compatibility doc surface the divergence. Cross-reference #2190 / #4496.

### 2. `CometReverse` does not mark non-default Spark 4.0+ collations as `Incompatible`

Spark 4.0 widens `Reverse.inputTypes` from `TypeCollection(StringType, ArrayType)` to `TypeCollection(StringTypeWithCollation(supportsTrimCollation = true), ArrayType)` and propagates the collation through `dataType`. `CometReverse.getSupportLevel` (`spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:32-38`) returns `Compatible()` for the non-array (string) branch and the native `reverse` UDF reverses code units, producing `Utf8`. The string branch should report `Incompatible(Some(reason))` for non-`UTF8_BINARY` collations, mirroring the concat fix. Cross-reference #2190 / #4496.

### 3. `CometReverse.getIncompatibleReasons` delegates only to the array branch

`CometReverse.getIncompatibleReasons()` (`spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:29-30`) returns `CometArrayReverse.getIncompatibleReasons()`. Once the collation gap from item 2 is reflected in `getSupportLevel`, the string-collation reason also needs to appear in `getIncompatibleReasons()` so it reaches the compatibility doc. Per skill rule 2, every distinct `Incompatible(Some(r))` branch must be represented in the reasons method.

## Medium priority

### 4. `array/size.sql` MapType column-reference path uses `spark_answer_only`

`spark/src/test/resources/sql-tests/expressions/array/size.sql:24-25` runs the MapType column-reference case as `query spark_answer_only`, which accepts whatever Spark returns and does not lock in that Comet falls back. The intended behaviour after #4472 is that `size(map_col)` falls back to Spark, so the case should be `query expect_fallback(...)` with the same reason string that `CometSize.getSupportLevel` returns for `MapType`.

### 5. `CometSize` defensive non-array / non-map branch

`CometSize.getSupportLevel` (`spark/src/main/scala/org/apache/comet/serde/arrays.scala:649-652`) ends with an `Unsupported(Some(s"Unsupported child data type: $other"))` branch tagged "this should be unreachable because Spark only supports map and array inputs". Either drop the branch and let the match fail (turning a planner bug into a clear exception), or list the reason in `getUnsupportedReasons()` so the audit consistency check (skill rule 2) passes. Today the reason can be returned by `getSupportLevel` but is not enumerated in `getUnsupportedReasons()`, which only lists the `MapType` reason.

---

Surfaced by the `audit-comet-expression` skill run in #4473.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collection expression audit follow-ups (from #4473) #4504

High priority

1. `CometConcat` does not mark non-default Spark 4.0+ collations as `Incompatible`

2. `CometReverse` does not mark non-default Spark 4.0+ collations as `Incompatible`

3. `CometReverse.getIncompatibleReasons` delegates only to the array branch

Medium priority

4. `array/size.sql` MapType column-reference path uses `spark_answer_only`

5. `CometSize` defensive non-array / non-map branch

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

collection expression audit follow-ups (from #4473) #4504

Description

High priority

1. CometConcat does not mark non-default Spark 4.0+ collations as Incompatible

2. CometReverse does not mark non-default Spark 4.0+ collations as Incompatible

3. CometReverse.getIncompatibleReasons delegates only to the array branch

Medium priority

4. array/size.sql MapType column-reference path uses spark_answer_only

5. CometSize defensive non-array / non-map branch

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `CometConcat` does not mark non-default Spark 4.0+ collations as `Incompatible`

2. `CometReverse` does not mark non-default Spark 4.0+ collations as `Incompatible`

3. `CometReverse.getIncompatibleReasons` delegates only to the array branch

4. `array/size.sql` MapType column-reference path uses `spark_answer_only`

5. `CometSize` defensive non-array / non-map branch