Skip to content

Comet native scan rejects invalid UTF-8 byte sequences in STRING column (hll.sql on Spark 4.1) #4121

@andygrove

Description

@andygrove

Describe the bug

Spark allows storing arbitrary byte sequences in STRING columns, including bytes that are not valid UTF-8 (for example via CAST(X'C1' AS STRING) and CAST(X'80' AS STRING)). Comet's native Parquet scan rejects these rows with:

org.apache.comet.CometNativeException
Arrow: Parquet argument error: Parquet error: encountered non UTF-8 data

This surfaces in Spark 4.1.1's hll.sql (newly added in 4.1) at query #10:

SELECT hll_sketch_estimate(hll_sketch_agg(s)) utf8_b FROM hll_string_test;

where hll_string_test is populated with INSERT INTO hll_string_test VALUES (''), (' '), (CAST(X'C1' AS STRING)), (CAST(X'80' AS STRING)), ... to exercise different collations including invalid Unicode bytes.

Steps to reproduce

Run Spark 4.1.1's SQL test suite with Comet enabled (the Spark SQL Tests matrix entry for 4.1.1). The test hll.sql in SQLQueryTestSuite fails.

Expected behavior

Comet should either (a) accept invalid-UTF-8 bytes in STRING columns the way Spark does, or (b) fall back to Spark when reading STRING columns whose Parquet data contains invalid UTF-8.

Workaround

hll.sql is currently disabled when Comet is enabled via --SET spark.comet.enabled = false at the top of the file in dev/diffs/4.1.1.diff (introduced as part of the Spark 4.1.1 SQL test matrix landing).

Additional context

PR #4093 enables Spark 4.1.1 in the Spark SQL Tests workflow.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions