feat: surface native parquet read failures as FAILED_READ_FILE#4536
feat: surface native parquet read failures as FAILED_READ_FILE#4536schenksj wants to merge 4 commits into
Conversation
When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column
data, a truncated/empty file, or a deleted file, it rethrew the raw native
message instead of Spark's FAILED_READ_FILE. The native path does not go through
Spark's FileScanRDD, so the offending path was usually missing too.
Classify these failures by TYPED DataFusionError variant in the native error
path (ParquetError / ObjectStore / ArrowError-wrapping-ParquetError / IoError,
unwrapping Context/Shared) rather than by matching error-message prose -- the
strings come from three upstream crates (DataFusion, arrow-rs, object_store) and
drift across version bumps with no compile-time signal. The match arms are
checked by the compiler.
- native: new SparkError::CannotReadFile { file_path, message } variant; a typed
try_classify_file_read_error in the JNI bridge converts a file-read
DataFusionError into it, replacing the previous "not found"/"No such file"
string match. file_path is taken from object_store::Error::NotFound when
available. Deliberately does NOT match object_store Generic errors (also used
for non-file config errors that must surface as-is).
- JVM: the structured error crosses JNI as the existing CometQueryExecutionException
JSON payload; SparkErrorConverter decodes "CannotReadFile" and, when the native
error carried no path, fills it from the per-task file list threaded from
CometNativeScanExec via CometExecRDD. The shims wrap it via
QueryExecutionErrors.cannotReadFilesError. No JVM-side message matching.
Closes apache#4529
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
There are some test failures: |
|
Nice work on this. Classifying file-read failures by typed The CI failures sort into three root causes, and the encouraging part is that the feature itself works. The Spark 4.0 PR Build passes the end-to-end test, so the wrapping is sound. The rest is reachability and message-matching. 1. The spark-3.4 shim was not updated. 2. The tests assert on a version-dependent message substring. On 3.5 the error actually is wrapped correctly. The stack trace shows 3. The corrupt/missing-file message needs to match what Spark's tests expect. The Spark SQL Test jobs fail on There is also a behavior question hiding in those test names. They exercise Smaller thing on the native side. In Disclosure: I used Claude Code to help review this PR, including pulling the failing CI logs and tracing each failure to its root cause. |
…Error), not CannotReadFile
A genuinely-missing file (object_store NotFound) is distinct from a corrupt/truncated one:
Spark surfaces it via `readCurrentFileNotFoundError` ("It is possible the underlying files
have been updated."), not `cannotReadFilesError` (FAILED_READ_FILE). `try_classify_file_read_error`
mapped every per-file read failure -- including NotFound -- to `SparkError::CannotReadFile`, so a
file removed between planning and execution produced the wrong Spark error.
Classify object_store NotFound as `SparkError::FileNotFound` instead. The NotFound may arrive
directly (`DataFusionError::ObjectStore`) or wrapped by the parquet reader as
`ParquetError::External(..)` / `ArrowError::ParquetError`, so a `source_chain_has_object_store_not_found`
helper walks the typed source chain (never message text). Corrupt/truncated reads stay
CannotReadFile -> FAILED_READ_FILE. The JVM shim already maps the `FileNotFound` errorType to
`readCurrentFileNotFoundError`, so no shim change is needed.
Surfaced by Delta's CDC-after-VACUUM read: `DeltaVacuumSuite` "vacuum for cdc - update/merge" and
"... - delete tombstones" vacuum the `_change_data` files and assert the subsequent read throws
`readCurrentFileNotFoundError`; with the native scan these failed because Comet returned the
cannotReadFilesError message. Both pass with this fix (verified locally).
Tests:
- Rust unit tests for the classifier: object_store NotFound (direct and ParquetError::External-wrapped)
-> FileNotFound; corrupt ParquetError stays CannotReadFile.
- Spark `CometExecSuite` "native parquet read of a missing file surfaces readCurrentFileNotFoundError"
(red before, green after): reads a file deleted between planning and execution.
- Made the existing FAILED_READ_FILE corrupt-file assertion spark-version-stable (assert
"Encountered error while reading file" -- present on both 3.5 and 4.x; only 4.x prepends the
[FAILED_READ_FILE.NO_HINT] class tag), so the test passes under -Pspark-3.5 as well.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
spark 3.4 is still a thing?! (just kidding). Working on it |
…ons, IoError scoping Addresses @andygrove's review on apache#4536: - spark-3.4 shim: add the `CannotReadFile` case (it only existed in the 3.5 and 4.x shims), so a corrupt/truncated read is wrapped via `cannotReadFilesError` (FAILED_READ_FILE) on Spark 3.4 too. (The `FileNotFound` case was already present on 3.4.) - SparkErrorConverterSuite: assert on the version-stable message ("Encountered error while reading file ...") instead of the `FAILED_READ_FILE` literal, which only Spark 4.x prepends to getMessage as the error-class tag (3.4/3.5 render only the message). Fixes the two failing tests on 3.4/3.5; same version-stable style already applied to the CometExecSuite e2e test. - native classifier: stop treating a bare `DataFusionError::IoError` as a file read. Scans surface read failures as a typed ParquetError/ObjectStore error; a bare IoError can also come from non-scan paths (spill, shuffle), which must not be relabelled FAILED_READ_FILE with a per-task path attached. Test updated accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ow path Follow-up to @andygrove's review on apache#4536: - (point 3, wording) parquet-rs reports a bad magic / unreadable footer as "Invalid Parquet file. Corrupt footer", whereas Spark's reader -- and Spark's `ParquetQuerySuite` ("ignoreCorruptFiles", "ignoreMissingFiles using parquet") -- phrase it as "<file> is not a Parquet file". `cannot_read_file_message` now appends Spark's phrasing for the magic/footer case so the FAILED_READ_FILE cause carries it. The outer `cannotReadFilesError` wrapper ("Encountered error while reading file …") is unchanged, so this composes with Spark's tests and does not disturb the Delta shims that match Comet's outer message. Other read failures keep their native message. (On behavior: the native scan already declines and falls back to Spark when `spark.sql.files.ignoreCorruptFiles`/`ignoreMissingFiles` is enabled -- CometNativeScan.scala -- so the skip semantics are preserved; no behavior gap.) - (point 5, tidy) `try_classify_file_read_error` is no longer evaluated twice (`.is_some()` guard + `.unwrap()`): the DataFusion arm is a single `if let Some(..)`, and the generic fallback is extracted to `throw_generic_exception`. Tests: classifier unit tests for the magic/footer wording (added) vs other parquet errors (unchanged native message). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review, @andygrove — and for pulling the CI logs and tracing each failure. All points addressed: 1. spark-3.4 shim. Added the missing 2. Version-dependent message assertions. The two 3. ignoreCorruptFiles / ignoreMissingFiles. Two parts:
4. 5. Double Also (a separate case surfaced by the same area): a missing file — an Added Rust classifier unit tests for the NotFound / wrapped-NotFound, corrupt-vs-missing, IoError, and magic-footer-wording cases. Verified locally on Spark 4.x and Spark 3.5 / Delta 3.3.2 (the |
Which issue does this PR close?
Closes #4529.
(Supersedes #4534, which was closed; this PR carries the typed-classification design described below.)
Rationale for this change
When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column data, or a truncated/empty file, it rethrew the raw native message instead of Spark's
FAILED_READ_FILE, and the offending path was usually missing.A genuinely-missing file (one deleted/vacuumed between planning and execution) is a different case: Spark does not report it as
FAILED_READ_FILE, but viareadCurrentFileNotFoundError("It is possible the underlying files have been updated. You can explicitly invalidate the cache ... by recreating the Dataset/DataFrame involved."). Comet must distinguish the two so the right Spark error and message are produced.This is a standalone error-compatibility improvement for all native Parquet scans. It was surfaced while working on the Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the
FAILED_READ_FILEmessage and path; Delta's CDC-after-VACUUMsuites assert thereadCurrentFileNotFoundErrormessage).What changes are included in this PR?
File-read failures are classified by typed
DataFusionErrorvariant in the native error path — not by matching error-message prose. The messages come from three upstream crates (DataFusion, arrow-rs, object_store) and drift across version bumps with no compile-time signal; the typed match arms are checked by the compiler.SparkError::CannotReadFile { file_path, message }for a readable-but-broken file (corrupt footer/page, truncated, generic IO).try_classify_file_read_errorin the JNI bridge maps the relevant file-readDataFusionErrorvariants (ParquetError,ObjectStore,ArrowErrorwrapping aParquetError,IoError; unwrappingContext/Shared) into it.SparkError::FileNotFound { message }for a missing file — anobject_store::Error::NotFound. The NotFound may arrive directly (DataFusionError::ObjectStore) or wrapped by the parquet reader asParquetError::External(..)/ArrowError::ParquetError, so asource_chain_has_object_store_not_foundhelper walks the typed source chain (still no message matching).object_store::Error::Generic/DataFusionError::Execution, which also carry non-file config errors (e.g. "Hdfs support is not enabled in this build") that must surface as-is.CometQueryExecutionExceptionJSON payload;SparkErrorConverterdecodes"CannotReadFile"(filling the path from the per-task file list when the native error carried none) and"FileNotFound". The spark-3.5 and spark-4.x shims wrapCannotReadFileviaQueryExecutionErrors.cannotReadFilesError(FAILED_READ_FILE) andFileNotFoundviaQueryExecutionErrors.readCurrentFileNotFoundError. No JVM-side message matching.How are these changes tested?
Coverage across all three layers of the new design:
errors.rsunit tests, 8): eachDataFusionErrorvariant —ParquetError,ArrowError(ParquetError),IoError,Context/Sharedunwrapping — classifies as a file-read error; anobject_store::Error::NotFound, both direct and wrapped inParquetError::External, classifies as FileNotFound; a corruptParquetErrorstays CannotReadFile;Execution/Internal(non-file) do not classify.SparkErrorConverterSuite, 3):CannotReadFile→FAILED_READ_FILEnaming the file; empty native path falls back to the per-task file list; a native path is preferred over the fallback.CometExecSuite, 2):cannotReadFilesError(FAILED_READ_FILE) naming the file. The assertion is spark-version-stable — it checks the message ("Encountered error while reading file ..."), which is present on both Spark 3.5 and 4.x; only 4.x prepends the[FAILED_READ_FILE.NO_HINT]error-class tag.readCurrentFileNotFoundError("It is possible the underlying files have been updated."). Fails onmain(a missing file was wrapped ascannotReadFilesError), passes here.Verified locally on both Spark 4.x and Spark 3.5 / Delta 3.3.2; the missing-file classification fixes Delta's
DeltaVacuumSuite"vacuum for cdc" suites (which read_change_datafiles removed byVACUUM).