Describe the bug
Spark allows storing arbitrary byte sequences in STRING columns, including bytes that are not valid UTF-8 (for example via CAST(X'C1' AS STRING) and CAST(X'80' AS STRING)). Comet's native Parquet scan rejects these rows with:
org.apache.comet.CometNativeException
Arrow: Parquet argument error: Parquet error: encountered non UTF-8 data
This surfaces in Spark 4.1.1's hll.sql (newly added in 4.1) at query #10:
SELECT hll_sketch_estimate(hll_sketch_agg(s)) utf8_b FROM hll_string_test;
where hll_string_test is populated with INSERT INTO hll_string_test VALUES (''), (' '), (CAST(X'C1' AS STRING)), (CAST(X'80' AS STRING)), ... to exercise different collations including invalid Unicode bytes.
Steps to reproduce
Run Spark 4.1.1's SQL test suite with Comet enabled (the Spark SQL Tests matrix entry for 4.1.1). The test hll.sql in SQLQueryTestSuite fails.
Expected behavior
Comet should either (a) accept invalid-UTF-8 bytes in STRING columns the way Spark does, or (b) fall back to Spark when reading STRING columns whose Parquet data contains invalid UTF-8.
Workaround
hll.sql is currently disabled when Comet is enabled via --SET spark.comet.enabled = false at the top of the file in dev/diffs/4.1.1.diff (introduced as part of the Spark 4.1.1 SQL test matrix landing).
Additional context
PR #4093 enables Spark 4.1.1 in the Spark SQL Tests workflow.
Describe the bug
Spark allows storing arbitrary byte sequences in
STRINGcolumns, including bytes that are not valid UTF-8 (for example viaCAST(X'C1' AS STRING)andCAST(X'80' AS STRING)). Comet's native Parquet scan rejects these rows with:This surfaces in Spark 4.1.1's
hll.sql(newly added in 4.1) at query #10:where
hll_string_testis populated withINSERT INTO hll_string_test VALUES (''), (' '), (CAST(X'C1' AS STRING)), (CAST(X'80' AS STRING)), ...to exercise different collations including invalid Unicode bytes.Steps to reproduce
Run Spark 4.1.1's SQL test suite with Comet enabled (the
Spark SQL Testsmatrix entry for 4.1.1). The testhll.sqlinSQLQueryTestSuitefails.Expected behavior
Comet should either (a) accept invalid-UTF-8 bytes in STRING columns the way Spark does, or (b) fall back to Spark when reading STRING columns whose Parquet data contains invalid UTF-8.
Workaround
hll.sqlis currently disabled when Comet is enabled via--SET spark.comet.enabled = falseat the top of the file indev/diffs/4.1.1.diff(introduced as part of the Spark 4.1.1 SQL test matrix landing).Additional context
PR #4093 enables Spark 4.1.1 in the
Spark SQL Testsworkflow.