Problem
When a native Parquet scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, Comet rethrows the raw DataFusion / arrow-rs / object_store message instead of Spark's FAILED_READ_FILE.NO_HINT. The native path does not go through Spark's FileScanRDD, so InputFileBlockHolder is usually unpopulated and the offending path is missing. On main the only handling is a regex on ^Parquet error: that rewrites to _LEGACY_ERROR_TEMP_2254 / "File is not a Parquet file." — which both misses most file-read failures and doesn't match what Spark's own reader produces.
Observed native messages that should all map to FAILED_READ_FILE:
Parquet error: ... — DataFusion, corrupt footer etc.
Arrow: Parquet argument error: External: failed to fill whole buffer — arrow-rs, corrupt page/column data hit during execution (Spark's JVM-side footer pre-check passes, so this only surfaces natively).
Requested range was invalid — object_store, 0-byte / truncated file.
Object at location ... not found — object_store, deleted file.
Proposed fix
CometExecIterator.isFileReadError classifies file-read failures by matching those specific phrasings — deliberately not the broad Generic <Store> error: prefix, which also covers non-file config errors (e.g. Generic HadoopFileSystem error: Hdfs support is not enabled in this build) that must surface as-is.
ShimSparkErrorConverter.wrapNativeParquetError (in both the spark-3.5 and spark-4.x shims) wraps the cause via QueryExecutionErrors.cannotReadFilesError(cause, path).
- Thread per-partition file paths from
CometNativeScanExec → CometExecRDD → CometExecIterator so the wrapped error names the actual file, with an InputFileBlockHolder fallback for any path that does populate it.
Scope note: this wraps the direct native-scan execution path (CometNativeScanExec.doExecuteColumnar). Surfacing the path when a scan is fused into a larger native block is a follow-up.
Relationship to the Delta integration
Standalone error-compatibility improvement for all native Parquet scans. It is required for the in-progress Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the FAILED_READ_FILE message and path), so it would help to prioritize it accordingly.
Problem
When a native Parquet scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, Comet rethrows the raw DataFusion / arrow-rs / object_store message instead of Spark's
FAILED_READ_FILE.NO_HINT. The native path does not go through Spark'sFileScanRDD, soInputFileBlockHolderis usually unpopulated and the offending path is missing. Onmainthe only handling is a regex on^Parquet error:that rewrites to_LEGACY_ERROR_TEMP_2254/ "File is not a Parquet file." — which both misses most file-read failures and doesn't match what Spark's own reader produces.Observed native messages that should all map to
FAILED_READ_FILE:Parquet error: ...— DataFusion, corrupt footer etc.Arrow: Parquet argument error: External: failed to fill whole buffer— arrow-rs, corrupt page/column data hit during execution (Spark's JVM-side footer pre-check passes, so this only surfaces natively).Requested range was invalid— object_store, 0-byte / truncated file.Object at location ... not found— object_store, deleted file.Proposed fix
CometExecIterator.isFileReadErrorclassifies file-read failures by matching those specific phrasings — deliberately not the broadGeneric <Store> error:prefix, which also covers non-file config errors (e.g.Generic HadoopFileSystem error: Hdfs support is not enabled in this build) that must surface as-is.ShimSparkErrorConverter.wrapNativeParquetError(in both the spark-3.5 and spark-4.x shims) wraps the cause viaQueryExecutionErrors.cannotReadFilesError(cause, path).CometNativeScanExec→CometExecRDD→CometExecIteratorso the wrapped error names the actual file, with anInputFileBlockHolderfallback for any path that does populate it.Scope note: this wraps the direct native-scan execution path (
CometNativeScanExec.doExecuteColumnar). Surfacing the path when a scan is fused into a larger native block is a follow-up.Relationship to the Delta integration
Standalone error-compatibility improvement for all native Parquet scans. It is required for the in-progress Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the
FAILED_READ_FILEmessage and path), so it would help to prioritize it accordingly.