Skip to content

feat: surface native parquet read failures as FAILED_READ_FILE#4534

Closed
schenksj wants to merge 1 commit into
apache:mainfrom
schenksj:fix/failed-read-file-wrapping
Closed

feat: surface native parquet read failures as FAILED_READ_FILE#4534
schenksj wants to merge 1 commit into
apache:mainfrom
schenksj:fix/failed-read-file-wrapping

Conversation

@schenksj
Copy link
Copy Markdown

Which issue does this PR close?

Closes #4529.

Rationale for this change

When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, it rethrew the raw native message instead of Spark's FAILED_READ_FILE.NO_HINT. The native path does not go through Spark's FileScanRDD, so InputFileBlockHolder is usually unpopulated and the offending path was missing.

The prior handling matched only a ^Parquet error: regex and rewrote to _LEGACY_ERROR_TEMP_2254 / "File is not a Parquet file.", which both missed most file-read failures and didn't match what Spark's own reader produces.

This is a standalone error-compatibility improvement for all native Parquet scans. It was surfaced while working on the Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the FAILED_READ_FILE message and path), so prioritizing it helps that effort.

What changes are included in this PR?

  • CometExecIterator.isFileReadError classifies file-read failures by matching the specific native phrasings: DataFusion Parquet error:, arrow-rs Parquet argument error / failed to fill whole buffer (corrupt page/column data hit during execution), and object_store Requested range was invalid (truncated/empty) and Object at location ... not found (deleted). It deliberately does not match the broad Generic <Store> error: prefix, which also covers non-file config errors (e.g. Generic HadoopFileSystem error: Hdfs support is not enabled in this build) that must surface as-is.
  • ShimSparkErrorConverter.wrapNativeParquetError (in both the spark-3.5 and spark-4.x shims) wraps the cause via QueryExecutionErrors.cannotReadFilesError(cause, path).
  • Per-partition file paths are threaded from CometNativeScanExecCometExecRDDCometExecIterator so the wrapped error names the actual file, with an InputFileBlockHolder fallback for any path that does populate it.

Scope note: this wraps the direct native-scan execution path (CometNativeScanExec.doExecuteColumnar); surfacing the path when a scan is fused into a larger native block is a follow-up.

How are these changes tested?

New test in CometExecSuite writes a valid parquet file, corrupts column/page data in the middle while leaving the footer intact (so Spark's JVM footer pre-check passes and the failure surfaces from the native reader during execution), reads it through Comet, and asserts the cause chain contains a FAILED_READ_FILE exception naming the file. The test fails on main (raw CometNativeException) and passes with this change. CometExecSuite passes (127/0).

When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column
data, a truncated/empty file, or a deleted file, it rethrew the raw native
message instead of Spark's FAILED_READ_FILE.NO_HINT. The native path does not
go through Spark's FileScanRDD, so InputFileBlockHolder is usually unpopulated
and the offending path was missing. The prior handling matched only a
"^Parquet error:" regex and rewrote to _LEGACY_ERROR_TEMP_2254 / "File is not a
Parquet file.", which both missed most file-read failures and didn't match what
Spark's own reader produces.

- CometExecIterator.isFileReadError classifies file-read failures by matching the
  specific native phrasings (DataFusion "Parquet error:", arrow-rs "Parquet
  argument error" / "failed to fill whole buffer", object_store "Requested range
  was invalid" and "Object at location ... not found") -- deliberately not the
  broad "Generic <Store> error:" prefix, which also covers non-file config errors
  that must surface as-is.
- ShimSparkErrorConverter.wrapNativeParquetError (spark-3.5 and spark-4.x) wraps
  the cause via QueryExecutionErrors.cannotReadFilesError.
- Per-partition file paths are threaded from CometNativeScanExec through
  CometExecRDD to CometExecIterator so the wrapped error names the actual file,
  with an InputFileBlockHolder fallback.

Closes apache#4529

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@schenksj schenksj marked this pull request as draft May 30, 2026 12:04
@schenksj
Copy link
Copy Markdown
Author

Moving to draft: reworking the JVM-side string classification (isFileReadError) into a typed native error channel. Instead of substring-matching DataFusion/arrow-rs/object_store error prose on the JVM, the native side will classify file-read failures into a typed SparkError variant that crosses JNI as the existing structured CometQueryExecutionException JSON payload and is decoded by SparkErrorConverter — removing the fragile string matching entirely. Will mark ready once the rework is in.

@schenksj
Copy link
Copy Markdown
Author

closing until i can land my fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Native scan file-read failures should surface as Spark's FAILED_READ_FILE.NO_HINT

1 participant