Comet native scan rejects invalid UTF-8 byte sequences in STRING column (hll.sql on Spark 4.1)

## Describe the bug

Spark allows storing arbitrary byte sequences in `STRING` columns, including bytes that are not valid UTF-8 (for example via `CAST(X'C1' AS STRING)` and `CAST(X'80' AS STRING)`). Comet's native Parquet scan rejects these rows with:

```
org.apache.comet.CometNativeException
Arrow: Parquet argument error: Parquet error: encountered non UTF-8 data
```

This surfaces in Spark 4.1.1's `hll.sql` (newly added in 4.1) at query #10:

```sql
SELECT hll_sketch_estimate(hll_sketch_agg(s)) utf8_b FROM hll_string_test;
```

where `hll_string_test` is populated with `INSERT INTO hll_string_test VALUES (''), ('  '), (CAST(X'C1' AS STRING)), (CAST(X'80' AS STRING)), ...` to exercise different collations including invalid Unicode bytes.

## Steps to reproduce

Run Spark 4.1.1's SQL test suite with Comet enabled (the `Spark SQL Tests` matrix entry for 4.1.1). The test `hll.sql` in `SQLQueryTestSuite` fails.

## Expected behavior

Comet should either (a) accept invalid-UTF-8 bytes in STRING columns the way Spark does, or (b) fall back to Spark when reading STRING columns whose Parquet data contains invalid UTF-8.

## Workaround

`hll.sql` is currently disabled when Comet is enabled via `--SET spark.comet.enabled = false` at the top of the file in `dev/diffs/4.1.1.diff` (introduced as part of the Spark 4.1.1 SQL test matrix landing).

## Additional context

PR #4093 enables Spark 4.1.1 in the `Spark SQL Tests` workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comet native scan rejects invalid UTF-8 byte sequences in STRING column (hll.sql on Spark 4.1) #4121

Describe the bug

Steps to reproduce

Expected behavior

Workaround

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Comet native scan rejects invalid UTF-8 byte sequences in STRING column (hll.sql on Spark 4.1) #4121

Description

Describe the bug

Steps to reproduce

Expected behavior

Workaround

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions