feat: surface native parquet read failures as FAILED_READ_FILE by schenksj · Pull Request #4536 · apache/datafusion-comet

schenksj · 2026-05-30T13:37:08Z

Which issue does this PR close?

Closes #4529.

(Supersedes #4534, which was closed; this PR carries the typed-classification design described below.)

Rationale for this change

When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column data, or a truncated/empty file, it rethrew the raw native message instead of Spark's FAILED_READ_FILE, and the offending path was usually missing.

A genuinely-missing file (one deleted/vacuumed between planning and execution) is a different case: Spark does not report it as FAILED_READ_FILE, but via readCurrentFileNotFoundError ("It is possible the underlying files have been updated. You can explicitly invalidate the cache ... by recreating the Dataset/DataFrame involved."). Comet must distinguish the two so the right Spark error and message are produced.

This is a standalone error-compatibility improvement for all native Parquet scans. It was surfaced while working on the Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the FAILED_READ_FILE message and path; Delta's CDC-after-VACUUM suites assert the readCurrentFileNotFoundError message).

What changes are included in this PR?

File-read failures are classified by typed DataFusionError variant in the native error path — not by matching error-message prose. The messages come from three upstream crates (DataFusion, arrow-rs, object_store) and drift across version bumps with no compile-time signal; the typed match arms are checked by the compiler.

native:
- SparkError::CannotReadFile { file_path, message } for a readable-but-broken file (corrupt footer/page, truncated, generic IO). try_classify_file_read_error in the JNI bridge maps the relevant file-read DataFusionError variants (ParquetError, ObjectStore, ArrowError wrapping a ParquetError, IoError; unwrapping Context/Shared) into it.
- SparkError::FileNotFound { message } for a missing file — an object_store::Error::NotFound. The NotFound may arrive directly (DataFusionError::ObjectStore) or wrapped by the parquet reader as ParquetError::External(..) / ArrowError::ParquetError, so a source_chain_has_object_store_not_found helper walks the typed source chain (still no message matching).
- It deliberately does not match object_store::Error::Generic / DataFusionError::Execution, which also carry non-file config errors (e.g. "Hdfs support is not enabled in this build") that must surface as-is.
JVM: the structured error crosses JNI as the existing CometQueryExecutionException JSON payload; SparkErrorConverter decodes "CannotReadFile" (filling the path from the per-task file list when the native error carried none) and "FileNotFound". The spark-3.5 and spark-4.x shims wrap CannotReadFile via QueryExecutionErrors.cannotReadFilesError (FAILED_READ_FILE) and FileNotFound via QueryExecutionErrors.readCurrentFileNotFoundError. No JVM-side message matching.

How are these changes tested?

Coverage across all three layers of the new design:

native classifier (errors.rs unit tests, 8): each DataFusionError variant — ParquetError, ArrowError(ParquetError), IoError, Context/Shared unwrapping — classifies as a file-read error; an object_store::Error::NotFound, both direct and wrapped in ParquetError::External, classifies as FileNotFound; a corrupt ParquetError stays CannotReadFile; Execution/Internal (non-file) do not classify.
JVM decode (SparkErrorConverterSuite, 3): CannotReadFile → FAILED_READ_FILE naming the file; empty native path falls back to the per-task file list; a native path is preferred over the fallback.
end-to-end (CometExecSuite, 2):
- corrupt file: writes a valid parquet file, corrupts column/page data while leaving the footer intact (so Spark's JVM footer pre-check passes and the failure surfaces from the native reader during execution), reads it through Comet, and asserts the cause chain contains a cannotReadFilesError (FAILED_READ_FILE) naming the file. The assertion is spark-version-stable — it checks the message ("Encountered error while reading file ..."), which is present on both Spark 3.5 and 4.x; only 4.x prepends the [FAILED_READ_FILE.NO_HINT] error-class tag.
- missing file: reads a parquet file deleted between planning and execution and asserts the cause chain contains readCurrentFileNotFoundError ("It is possible the underlying files have been updated."). Fails on main (a missing file was wrapped as cannotReadFilesError), passes here.

Verified locally on both Spark 4.x and Spark 3.5 / Delta 3.3.2; the missing-file classification fixes Delta's DeltaVacuumSuite "vacuum for cdc" suites (which read _change_data files removed by VACUUM).

When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, it rethrew the raw native message instead of Spark's FAILED_READ_FILE. The native path does not go through Spark's FileScanRDD, so the offending path was usually missing too. Classify these failures by TYPED DataFusionError variant in the native error path (ParquetError / ObjectStore / ArrowError-wrapping-ParquetError / IoError, unwrapping Context/Shared) rather than by matching error-message prose -- the strings come from three upstream crates (DataFusion, arrow-rs, object_store) and drift across version bumps with no compile-time signal. The match arms are checked by the compiler. - native: new SparkError::CannotReadFile { file_path, message } variant; a typed try_classify_file_read_error in the JNI bridge converts a file-read DataFusionError into it, replacing the previous "not found"/"No such file" string match. file_path is taken from object_store::Error::NotFound when available. Deliberately does NOT match object_store Generic errors (also used for non-file config errors that must surface as-is). - JVM: the structured error crosses JNI as the existing CometQueryExecutionException JSON payload; SparkErrorConverter decodes "CannotReadFile" and, when the native error carried no path, fills it from the per-task file list threaded from CometNativeScanExec via CometExecRDD. The shims wrap it via QueryExecutionErrors.cannotReadFilesError. No JVM-side message matching. Closes apache#4529 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

andygrove · 2026-05-31T13:54:02Z

There are some test failures:

SparkErrorConverterSuite:
- CannotReadFile converts to a FAILED_READ_FILE SparkException naming the file *** FAILED *** (3 milliseconds)
  "Encountered error while reading file file:/tmp/data/part-0.parquet. Details:" did not contain "FAILED_READ_FILE" (SparkErrorConverterSuite.scala:37)
- CannotReadFile with empty native path falls back to the per-task file list *** FAILED *** (1 millisecond)
  "Encountered error while reading file file:/tmp/data/part-7.parquet. Details:" did not contain "FAILED_READ_FILE" (SparkErrorConverterSuite.scala:50)

andygrove · 2026-05-31T14:05:05Z

Nice work on this. Classifying file-read failures by typed DataFusionError variant rather than matching message prose is the right call, since those messages come from three upstream crates and drift silently across version bumps. Deliberately leaving Execution and object_store::Generic unmatched so config errors still surface as-is is a thoughtful touch too.

The CI failures sort into three root causes, and the encouraging part is that the feature itself works. The Spark 4.0 PR Build passes the end-to-end test, so the wrapping is sound. The rest is reachability and message-matching.

1. The spark-3.4 shim was not updated. spark-3.4/.../ShimSparkErrorConverter.scala is a separate full implementation, and it does not have the CannotReadFile case, only 3.5 and 4.x got it. So on Spark 3.4 the error is never wrapped and the new CometExecSuite test fails with messages.exists(_.contains("FAILED_READ_FILE")) was false. Could the case be added here too, adjusting for any 3.4 differences in cannotReadFilesError?

2. The tests assert on a version-dependent message substring. On 3.5 the error actually is wrapped correctly. The stack trace shows cannotReadFilesError being reached through the new shim case. The test still fails because it asserts getMessage.contains("FAILED_READ_FILE"), and on 3.5 the rendered message is Encountered error while reading file ... Details: with no FAILED_READ_FILE literal. Spark 4.0 prepends [FAILED_READ_FILE] to getMessage, which is why only 3.4 and 3.5 fail. This hits two SparkErrorConverterSuite tests and the CometExecSuite test. Could these assert on the error class instead, via SparkThrowable.getErrorClass on 3.x and getCondition on 4.0 behind the shim? The third decode test passes precisely because it checks the path rather than the literal, which is a good hint about the robust assertion style.

3. The corrupt/missing-file message needs to match what Spark's tests expect. The Spark SQL Test jobs fail on ParquetQuerySuite "Enabling/disabling ignoreCorruptFiles" and "ignoreMissingFiles using parquet" because Spark asserts the cause contains is not a Parquet file, while Comet's native reader reports Invalid Parquet file. Corrupt footer. For these to pass, the surfaced message has to carry Spark's wording rather than DataFusion's. It would be worth deciding where to normalize that. The classifier already has the typed variant in hand, so it could map a corrupt-footer ParquetError to a message Spark's tests expect, and similarly for the truncated/too-short file case that drives the missing-files test.

There is also a behavior question hiding in those test names. They exercise spark.sql.files.ignoreCorruptFiles and ignoreMissingFiles, which tell Spark to skip the bad file rather than fail the query. Does Comet's native scan honor those configs, or does it now always throw FAILED_READ_FILE? If it does not honor them, enabling the config would no longer skip corrupt or missing files under Comet, which would be a behavior gap on top of the wording mismatch.

Smaller thing on the native side. In try_classify_file_read_error, DataFusionError::IoError(_) is classified as a file read unconditionally. Parquet and ObjectStore are clearly file reads, but IoError can also come from non-scan paths like spill or shuffle. Is there a risk this mislabels a non-scan IO error as FAILED_READ_FILE and attaches a per-task path to it? Scoping that arm a bit more tightly might be safer. The guard arm also calls try_classify_file_read_error(source) twice (.is_some() then .unwrap()), which an if let Some(...) would tidy up.

Disclosure: I used Claude Code to help review this PR, including pulling the failing CI logs and tracing each failure to its root cause.

…Error), not CannotReadFile A genuinely-missing file (object_store NotFound) is distinct from a corrupt/truncated one: Spark surfaces it via `readCurrentFileNotFoundError` ("It is possible the underlying files have been updated."), not `cannotReadFilesError` (FAILED_READ_FILE). `try_classify_file_read_error` mapped every per-file read failure -- including NotFound -- to `SparkError::CannotReadFile`, so a file removed between planning and execution produced the wrong Spark error. Classify object_store NotFound as `SparkError::FileNotFound` instead. The NotFound may arrive directly (`DataFusionError::ObjectStore`) or wrapped by the parquet reader as `ParquetError::External(..)` / `ArrowError::ParquetError`, so a `source_chain_has_object_store_not_found` helper walks the typed source chain (never message text). Corrupt/truncated reads stay CannotReadFile -> FAILED_READ_FILE. The JVM shim already maps the `FileNotFound` errorType to `readCurrentFileNotFoundError`, so no shim change is needed. Surfaced by Delta's CDC-after-VACUUM read: `DeltaVacuumSuite` "vacuum for cdc - update/merge" and "... - delete tombstones" vacuum the `_change_data` files and assert the subsequent read throws `readCurrentFileNotFoundError`; with the native scan these failed because Comet returned the cannotReadFilesError message. Both pass with this fix (verified locally). Tests: - Rust unit tests for the classifier: object_store NotFound (direct and ParquetError::External-wrapped) -> FileNotFound; corrupt ParquetError stays CannotReadFile. - Spark `CometExecSuite` "native parquet read of a missing file surfaces readCurrentFileNotFoundError" (red before, green after): reads a file deleted between planning and execution. - Made the existing FAILED_READ_FILE corrupt-file assertion spark-version-stable (assert "Encountered error while reading file" -- present on both 3.5 and 4.x; only 4.x prepends the [FAILED_READ_FILE.NO_HINT] class tag), so the test passes under -Pspark-3.5 as well. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

schenksj · 2026-05-31T14:27:58Z

spark 3.4 is still a thing?! (just kidding). Working on it

@andygrove

…ons, IoError scoping Addresses @andygrove's review on apache#4536: - spark-3.4 shim: add the `CannotReadFile` case (it only existed in the 3.5 and 4.x shims), so a corrupt/truncated read is wrapped via `cannotReadFilesError` (FAILED_READ_FILE) on Spark 3.4 too. (The `FileNotFound` case was already present on 3.4.) - SparkErrorConverterSuite: assert on the version-stable message ("Encountered error while reading file ...") instead of the `FAILED_READ_FILE` literal, which only Spark 4.x prepends to getMessage as the error-class tag (3.4/3.5 render only the message). Fixes the two failing tests on 3.4/3.5; same version-stable style already applied to the CometExecSuite e2e test. - native classifier: stop treating a bare `DataFusionError::IoError` as a file read. Scans surface read failures as a typed ParquetError/ObjectStore error; a bare IoError can also come from non-scan paths (spill, shuffle), which must not be relabelled FAILED_READ_FILE with a per-task path attached. Test updated accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@andygrove

…ow path Follow-up to @andygrove's review on apache#4536: - (point 3, wording) parquet-rs reports a bad magic / unreadable footer as "Invalid Parquet file. Corrupt footer", whereas Spark's reader -- and Spark's `ParquetQuerySuite` ("ignoreCorruptFiles", "ignoreMissingFiles using parquet") -- phrase it as "<file> is not a Parquet file". `cannot_read_file_message` now appends Spark's phrasing for the magic/footer case so the FAILED_READ_FILE cause carries it. The outer `cannotReadFilesError` wrapper ("Encountered error while reading file …") is unchanged, so this composes with Spark's tests and does not disturb the Delta shims that match Comet's outer message. Other read failures keep their native message. (On behavior: the native scan already declines and falls back to Spark when `spark.sql.files.ignoreCorruptFiles`/`ignoreMissingFiles` is enabled -- CometNativeScan.scala -- so the skip semantics are preserved; no behavior gap.) - (point 5, tidy) `try_classify_file_read_error` is no longer evaluated twice (`.is_some()` guard + `.unwrap()`): the DataFusion arm is a single `if let Some(..)`, and the generic fallback is extracted to `throw_generic_exception`. Tests: classifier unit tests for the magic/footer wording (added) vs other parquet errors (unchanged native message). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

schenksj · 2026-05-31T17:17:48Z

Thanks for the thorough review, @andygrove — and for pulling the CI logs and tracing each failure. All points addressed:

1. spark-3.4 shim. Added the missing CannotReadFile case (3.4's cannotReadFilesError(Throwable, String) matches 3.5; the FileNotFound case was already present on 3.4).

2. Version-dependent message assertions. The two SparkErrorConverterSuite tests and the CometExecSuite e2e test now assert on the version-stable message — "Encountered error while reading file …" — instead of the FAILED_READ_FILE literal, which only Spark 4.x prepends to getMessage. (Went with the stable substring over a getErrorClass/getCondition test shim; glad to switch if you'd rather assert the error class.)

3. ignoreCorruptFiles / ignoreMissingFiles. Two parts:

Behavior: no gap — CometNativeScan already declines the native scan and falls back to Spark when either config is enabled (CometNativeScan.scala:75-87), so Spark's skip semantics are preserved.
Wording: cannot_read_file_message now appends Spark's phrasing ("is not a Parquet file") for parquet-rs's magic/footer error ("Invalid Parquet file. Corrupt footer"), so ParquetQuerySuite's assertion matches. The outer cannotReadFilesError wrapper is unchanged, so it composes with Spark's tests and doesn't disturb the existing message expectations.

4. IoError scoping. Dropped the unconditional IoError arm — scans surface read failures as typed ParquetError/ObjectStore, so a bare IoError (spill/shuffle) is no longer relabelled FAILED_READ_FILE.

5. Double try_classify call. Tidied — the DataFusion arm is a single if let Some(..), with the generic fallback extracted to throw_generic_exception.

Also (a separate case surfaced by the same area): a missing file — an object_store NotFound, including when wrapped by the parquet reader as ParquetError::External — now classifies as FileNotFound → readCurrentFileNotFoundError, matching what Spark does for a file deleted between planning and execution. This fixes Delta's DeltaVacuumSuite "vacuum for cdc" suites.

Added Rust classifier unit tests for the NotFound / wrapped-NotFound, corrupt-vs-missing, IoError, and magic-footer-wording cases. Verified locally on Spark 4.x and Spark 3.5 / Delta 3.3.2 (the DeltaVacuumSuite "vacuum for cdc" tests now pass).

schenksj and others added 2 commits May 31, 2026 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface native parquet read failures as FAILED_READ_FILE#4536

feat: surface native parquet read failures as FAILED_READ_FILE#4536
schenksj wants to merge 4 commits into
apache:mainfrom
schenksj:fix/failed-read-file-wrapping

schenksj commented May 30, 2026 •

edited

Loading

Uh oh!

andygrove commented May 31, 2026

Uh oh!

andygrove commented May 31, 2026

Uh oh!

schenksj commented May 31, 2026

Uh oh!

schenksj commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

schenksj commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented May 31, 2026

Uh oh!

andygrove commented May 31, 2026

Uh oh!

schenksj commented May 31, 2026

Uh oh!

schenksj commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schenksj commented May 30, 2026 •

edited

Loading