Skip to content

Native scan file-read failures should surface as Spark's FAILED_READ_FILE.NO_HINT #4529

@schenksj

Description

@schenksj

Problem

When a native Parquet scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, Comet rethrows the raw DataFusion / arrow-rs / object_store message instead of Spark's FAILED_READ_FILE.NO_HINT. The native path does not go through Spark's FileScanRDD, so InputFileBlockHolder is usually unpopulated and the offending path is missing. On main the only handling is a regex on ^Parquet error: that rewrites to _LEGACY_ERROR_TEMP_2254 / "File is not a Parquet file." — which both misses most file-read failures and doesn't match what Spark's own reader produces.

Observed native messages that should all map to FAILED_READ_FILE:

  • Parquet error: ... — DataFusion, corrupt footer etc.
  • Arrow: Parquet argument error: External: failed to fill whole buffer — arrow-rs, corrupt page/column data hit during execution (Spark's JVM-side footer pre-check passes, so this only surfaces natively).
  • Requested range was invalid — object_store, 0-byte / truncated file.
  • Object at location ... not found — object_store, deleted file.

Proposed fix

  • CometExecIterator.isFileReadError classifies file-read failures by matching those specific phrasings — deliberately not the broad Generic <Store> error: prefix, which also covers non-file config errors (e.g. Generic HadoopFileSystem error: Hdfs support is not enabled in this build) that must surface as-is.
  • ShimSparkErrorConverter.wrapNativeParquetError (in both the spark-3.5 and spark-4.x shims) wraps the cause via QueryExecutionErrors.cannotReadFilesError(cause, path).
  • Thread per-partition file paths from CometNativeScanExecCometExecRDDCometExecIterator so the wrapped error names the actual file, with an InputFileBlockHolder fallback for any path that does populate it.

Scope note: this wraps the direct native-scan execution path (CometNativeScanExec.doExecuteColumnar). Surfacing the path when a scan is fused into a larger native block is a follow-up.

Relationship to the Delta integration

Standalone error-compatibility improvement for all native Parquet scans. It is required for the in-progress Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the FAILED_READ_FILE message and path), so it would help to prioritize it accordingly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions