feat: surface native parquet read failures as FAILED_READ_FILE by schenksj · Pull Request #4534 · apache/datafusion-comet

schenksj · 2026-05-30T11:23:32Z

Which issue does this PR close?

Closes #4529.

Rationale for this change

When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, it rethrew the raw native message instead of Spark's FAILED_READ_FILE.NO_HINT. The native path does not go through Spark's FileScanRDD, so InputFileBlockHolder is usually unpopulated and the offending path was missing.

The prior handling matched only a ^Parquet error: regex and rewrote to _LEGACY_ERROR_TEMP_2254 / "File is not a Parquet file.", which both missed most file-read failures and didn't match what Spark's own reader produces.

This is a standalone error-compatibility improvement for all native Parquet scans. It was surfaced while working on the Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the FAILED_READ_FILE message and path), so prioritizing it helps that effort.

What changes are included in this PR?

CometExecIterator.isFileReadError classifies file-read failures by matching the specific native phrasings: DataFusion Parquet error:, arrow-rs Parquet argument error / failed to fill whole buffer (corrupt page/column data hit during execution), and object_store Requested range was invalid (truncated/empty) and Object at location ... not found (deleted). It deliberately does not match the broad Generic <Store> error: prefix, which also covers non-file config errors (e.g. Generic HadoopFileSystem error: Hdfs support is not enabled in this build) that must surface as-is.
ShimSparkErrorConverter.wrapNativeParquetError (in both the spark-3.5 and spark-4.x shims) wraps the cause via QueryExecutionErrors.cannotReadFilesError(cause, path).
Per-partition file paths are threaded from CometNativeScanExec → CometExecRDD → CometExecIterator so the wrapped error names the actual file, with an InputFileBlockHolder fallback for any path that does populate it.

Scope note: this wraps the direct native-scan execution path (CometNativeScanExec.doExecuteColumnar); surfacing the path when a scan is fused into a larger native block is a follow-up.

How are these changes tested?

New test in CometExecSuite writes a valid parquet file, corrupts column/page data in the middle while leaving the footer intact (so Spark's JVM footer pre-check passes and the failure surfaces from the native reader during execution), reads it through Comet, and asserts the cause chain contains a FAILED_READ_FILE exception naming the file. The test fails on main (raw CometNativeException) and passes with this change. CometExecSuite passes (127/0).

When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, it rethrew the raw native message instead of Spark's FAILED_READ_FILE.NO_HINT. The native path does not go through Spark's FileScanRDD, so InputFileBlockHolder is usually unpopulated and the offending path was missing. The prior handling matched only a "^Parquet error:" regex and rewrote to _LEGACY_ERROR_TEMP_2254 / "File is not a Parquet file.", which both missed most file-read failures and didn't match what Spark's own reader produces. - CometExecIterator.isFileReadError classifies file-read failures by matching the specific native phrasings (DataFusion "Parquet error:", arrow-rs "Parquet argument error" / "failed to fill whole buffer", object_store "Requested range was invalid" and "Object at location ... not found") -- deliberately not the broad "Generic <Store> error:" prefix, which also covers non-file config errors that must surface as-is. - ShimSparkErrorConverter.wrapNativeParquetError (spark-3.5 and spark-4.x) wraps the cause via QueryExecutionErrors.cannotReadFilesError. - Per-partition file paths are threaded from CometNativeScanExec through CometExecRDD to CometExecIterator so the wrapped error names the actual file, with an InputFileBlockHolder fallback. Closes apache#4529 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

schenksj · 2026-05-30T12:04:55Z

Moving to draft: reworking the JVM-side string classification (isFileReadError) into a typed native error channel. Instead of substring-matching DataFusion/arrow-rs/object_store error prose on the JVM, the native side will classify file-read failures into a typed SparkError variant that crosses JNI as the existing structured CometQueryExecutionException JSON payload and is decoded by SparkErrorConverter — removing the fragile string matching entirely. Will mark ready once the rework is in.

schenksj · 2026-05-30T13:14:57Z

closing until i can land my fix

schenksj marked this pull request as draft May 30, 2026 12:04

schenksj closed this May 30, 2026

schenksj mentioned this pull request May 30, 2026

feat: surface native parquet read failures as FAILED_READ_FILE #4536

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface native parquet read failures as FAILED_READ_FILE#4534

feat: surface native parquet read failures as FAILED_READ_FILE#4534
schenksj wants to merge 1 commit into
apache:mainfrom
schenksj:fix/failed-read-file-wrapping

schenksj commented May 30, 2026

Uh oh!

schenksj commented May 30, 2026

Uh oh!

schenksj commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

schenksj commented May 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

schenksj commented May 30, 2026

Uh oh!

schenksj commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant