Native scan file-read failures should surface as Spark's FAILED_READ_FILE.NO_HINT

### Problem

When a native Parquet scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, Comet rethrows the raw DataFusion / arrow-rs / object_store message instead of Spark's `FAILED_READ_FILE.NO_HINT`. The native path does **not** go through Spark's `FileScanRDD`, so `InputFileBlockHolder` is usually unpopulated and the offending path is missing. On `main` the only handling is a regex on `^Parquet error:` that rewrites to `_LEGACY_ERROR_TEMP_2254` / "File is not a Parquet file." — which both misses most file-read failures and doesn't match what Spark's own reader produces.

Observed native messages that should all map to `FAILED_READ_FILE`:

- `Parquet error: ...` — DataFusion, corrupt footer etc.
- `Arrow: Parquet argument error: External: failed to fill whole buffer` — arrow-rs, corrupt page/column data hit **during execution** (Spark's JVM-side footer pre-check passes, so this only surfaces natively).
- `Requested range was invalid` — object_store, 0-byte / truncated file.
- `Object at location ... not found` — object_store, deleted file.

### Proposed fix

- `CometExecIterator.isFileReadError` classifies file-read failures by matching those specific phrasings — deliberately **not** the broad `Generic <Store> error:` prefix, which also covers non-file config errors (e.g. `Generic HadoopFileSystem error: Hdfs support is not enabled in this build`) that must surface as-is.
- `ShimSparkErrorConverter.wrapNativeParquetError` (in both the spark-3.5 and spark-4.x shims) wraps the cause via `QueryExecutionErrors.cannotReadFilesError(cause, path)`.
- Thread per-partition file paths from `CometNativeScanExec` → `CometExecRDD` → `CometExecIterator` so the wrapped error names the actual file, with an `InputFileBlockHolder` fallback for any path that does populate it.

Scope note: this wraps the direct native-scan execution path (`CometNativeScanExec.doExecuteColumnar`). Surfacing the path when a scan is fused into a larger native block is a follow-up.

### Relationship to the Delta integration

Standalone error-compatibility improvement for all native Parquet scans. It is **required for** the in-progress Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the `FAILED_READ_FILE` message and path), so it would help to prioritize it accordingly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native scan file-read failures should surface as Spark's FAILED_READ_FILE.NO_HINT #4529

Problem

Proposed fix

Relationship to the Delta integration

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Native scan file-read failures should surface as Spark's FAILED_READ_FILE.NO_HINT #4529

Description

Problem

Proposed fix

Relationship to the Delta integration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions