Tracking issue for follow-up work surfaced by the collection expression audit in #4473. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a behavioural gap the audit documented but did not implement.
High priority
1. CometConcat does not mark non-default Spark 4.0+ collations as Incompatible
Spark 4.0 widens Concat.allowedTypes from StringType to StringTypeWithCollation(supportsTrimCollation = true) and preserves the collation in the merged result type. CometConcat.getSupportLevel (spark/src/main/scala/org/apache/comet/serde/strings.scala:232-238) returns Compatible() whenever every child is StringType, regardless of collation, and the native concat UDF always produces Utf8 (UTF8_BINARY semantics). Per the audit-comet-expression skill (rule 11), Spark 4.0+ collation gaps must flip the support level to Incompatible(Some(reason)) so EXPLAIN and the auto-generated compatibility doc surface the divergence. Cross-reference #2190 / #4496.
2. CometReverse does not mark non-default Spark 4.0+ collations as Incompatible
Spark 4.0 widens Reverse.inputTypes from TypeCollection(StringType, ArrayType) to TypeCollection(StringTypeWithCollation(supportsTrimCollation = true), ArrayType) and propagates the collation through dataType. CometReverse.getSupportLevel (spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:32-38) returns Compatible() for the non-array (string) branch and the native reverse UDF reverses code units, producing Utf8. The string branch should report Incompatible(Some(reason)) for non-UTF8_BINARY collations, mirroring the concat fix. Cross-reference #2190 / #4496.
3. CometReverse.getIncompatibleReasons delegates only to the array branch
CometReverse.getIncompatibleReasons() (spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:29-30) returns CometArrayReverse.getIncompatibleReasons(). Once the collation gap from item 2 is reflected in getSupportLevel, the string-collation reason also needs to appear in getIncompatibleReasons() so it reaches the compatibility doc. Per skill rule 2, every distinct Incompatible(Some(r)) branch must be represented in the reasons method.
Medium priority
4. array/size.sql MapType column-reference path uses spark_answer_only
spark/src/test/resources/sql-tests/expressions/array/size.sql:24-25 runs the MapType column-reference case as query spark_answer_only, which accepts whatever Spark returns and does not lock in that Comet falls back. The intended behaviour after #4472 is that size(map_col) falls back to Spark, so the case should be query expect_fallback(...) with the same reason string that CometSize.getSupportLevel returns for MapType.
5. CometSize defensive non-array / non-map branch
CometSize.getSupportLevel (spark/src/main/scala/org/apache/comet/serde/arrays.scala:649-652) ends with an Unsupported(Some(s"Unsupported child data type: $other")) branch tagged "this should be unreachable because Spark only supports map and array inputs". Either drop the branch and let the match fail (turning a planner bug into a clear exception), or list the reason in getUnsupportedReasons() so the audit consistency check (skill rule 2) passes. Today the reason can be returned by getSupportLevel but is not enumerated in getUnsupportedReasons(), which only lists the MapType reason.
Surfaced by the audit-comet-expression skill run in #4473.
Tracking issue for follow-up work surfaced by the collection expression audit in #4473. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a behavioural gap the audit documented but did not implement.
High priority
1.
CometConcatdoes not mark non-default Spark 4.0+ collations asIncompatibleSpark 4.0 widens
Concat.allowedTypesfromStringTypetoStringTypeWithCollation(supportsTrimCollation = true)and preserves the collation in the merged result type.CometConcat.getSupportLevel(spark/src/main/scala/org/apache/comet/serde/strings.scala:232-238) returnsCompatible()whenever every child isStringType, regardless of collation, and the nativeconcatUDF always producesUtf8(UTF8_BINARYsemantics). Per theaudit-comet-expressionskill (rule 11), Spark 4.0+ collation gaps must flip the support level toIncompatible(Some(reason))so EXPLAIN and the auto-generated compatibility doc surface the divergence. Cross-reference #2190 / #4496.2.
CometReversedoes not mark non-default Spark 4.0+ collations asIncompatibleSpark 4.0 widens
Reverse.inputTypesfromTypeCollection(StringType, ArrayType)toTypeCollection(StringTypeWithCollation(supportsTrimCollation = true), ArrayType)and propagates the collation throughdataType.CometReverse.getSupportLevel(spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:32-38) returnsCompatible()for the non-array (string) branch and the nativereverseUDF reverses code units, producingUtf8. The string branch should reportIncompatible(Some(reason))for non-UTF8_BINARYcollations, mirroring the concat fix. Cross-reference #2190 / #4496.3.
CometReverse.getIncompatibleReasonsdelegates only to the array branchCometReverse.getIncompatibleReasons()(spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:29-30) returnsCometArrayReverse.getIncompatibleReasons(). Once the collation gap from item 2 is reflected ingetSupportLevel, the string-collation reason also needs to appear ingetIncompatibleReasons()so it reaches the compatibility doc. Per skill rule 2, every distinctIncompatible(Some(r))branch must be represented in the reasons method.Medium priority
4.
array/size.sqlMapType column-reference path usesspark_answer_onlyspark/src/test/resources/sql-tests/expressions/array/size.sql:24-25runs the MapType column-reference case asquery spark_answer_only, which accepts whatever Spark returns and does not lock in that Comet falls back. The intended behaviour after #4472 is thatsize(map_col)falls back to Spark, so the case should bequery expect_fallback(...)with the same reason string thatCometSize.getSupportLevelreturns forMapType.5.
CometSizedefensive non-array / non-map branchCometSize.getSupportLevel(spark/src/main/scala/org/apache/comet/serde/arrays.scala:649-652) ends with anUnsupported(Some(s"Unsupported child data type: $other"))branch tagged "this should be unreachable because Spark only supports map and array inputs". Either drop the branch and let the match fail (turning a planner bug into a clear exception), or list the reason ingetUnsupportedReasons()so the audit consistency check (skill rule 2) passes. Today the reason can be returned bygetSupportLevelbut is not enumerated ingetUnsupportedReasons(), which only lists theMapTypereason.Surfaced by the
audit-comet-expressionskill run in #4473.