fix(parquet): return error instead of panicking on first-write failure#824
fix(parquet): return error instead of panicking on first-write failure#824zeroshade wants to merge 1 commit into
Conversation
apacheGH-820) (*file.Writer).startFile() previously panicked with "failed to write magic number" when the underlying sink's first Write returned an error or short-wrote. Because NewParquetWriter has no error return, callers on flaky network sinks (e.g. cloud-storage upload writers) had no idiomatic recovery path, and the higher-level pqarrow.NewFileWriter propagated the panic unchanged through its (*FileWriter, error) return. Add a sibling constructor file.NewParquetWriterWithError that returns (*Writer, error). The legacy file.NewParquetWriter is now a thin wrapper that panics with the original message string for back-compat. Switch pqarrow.NewFileWriter to the error-returning constructor so its error return finally covers this failure mode without any change to its own public signature. Closes apache#820.
Uhh, so @zeroshade do you (or your agent) want to file that issue? 😅 |
|
Filed #826 🫤 |
| // cloud-storage upload writer or other network-attached writer) should | ||
| // prefer [NewParquetWriterWithError], which surfaces the failure as a | ||
| // returned error instead. | ||
| func NewParquetWriter(w io.Writer, sc *schema.GroupNode, opts ...WriteOption) *Writer { |
There was a problem hiding this comment.
Should we deprecate this as well?
There was a problem hiding this comment.
For someone coming to the codebase without being a maintainer, these doc comments are too focused on historical knowledge. I think NewParquetWriterWithError should be documented without reference to NewParquetWriter and simply mention that it may fail if it cannot write the header (e.g. because of a flaky writer), and NewParquetWriter can document the history if needed.
Rationale for this change
Closes #820.
(*file.Writer).startFile()panics with the literal string"failed to write magic number"when the underlying sink's firstWritecall returns an error or short-writes. Becausefile.NewParquetWritercallsstartFilesynchronously inside the constructor and has noerrorreturn, callers had no idiomatic Go recovery path. The higher-levelpqarrow.NewFileWriteralready returns(*FileWriter, error), but the panic propagated through it unchanged, so itserrorreturn was never reached for this failure mode.In production this manifests when the sink is a network-attached writer (
*storage.Writerfrom GCS, an S3 multipart upload, an HDFS client, etc.). A transient network blip on the first 4-byte magic-header write reliably crashes the worker process, orphans any in-flight upload session, and forces a container restart.What changes are included in this PR?
(*Writer).startFile()now returnserrorinstead of panicking on sink-write failures. The configuration-validation panic for "encrypted column not found in file schema" is preserved — that path is a programmer error, not an I/O failure.file.NewParquetWriterWithError(w, sc, opts...) (*Writer, error)is the preferred entry point for callers whose sink may transiently fail.file.NewParquetWriteris now a thin wrapper overNewParquetWriterWithErrorthat panics with the exact same string ("failed to write magic number") on init failure. Any consumer with arecover()block that string-matches the panic value continues to work unchanged.pqarrow.NewFileWriterswitches toNewParquetWriterWithErrorinternally. Its public signature is unchanged; itserrorreturn now meaningfully covers transient sink failures, with no API change for any of its callers.Are these changes tested?
Yes. Six new tests in
parquet/file/file_writer_test.goexercise the new behavior using aflakyMagicSinktest helper modeled on the existingerrCloseWriterpattern in the same file:TestNewParquetWriterWithError_Success— happy path on abytes.Buffer.TestNewParquetWriterWithError_FirstWriteFails— firstWritereturnsio.ErrUnexpectedEOF; constructor returns a wrapped error and does not panic.TestNewParquetWriterWithError_FirstWriteShortWrites— firstWritereturnsn=len(p)-1, nil; constructor returns an error mentioning the short write.TestNewParquetWriter_PreservesPanicMessage— the legacy constructor still panics, the panic value is still astring, and the message is still the literal"failed to write magic number"(back-compat guard).TestPqarrowNewFileWriter_PropagatesInitError—pqarrow.NewFileWriterreturns the wrapped sink error rather than panicking.TestPqarrowNewFileWriter_PropagatesShortWrite— same for the short-write case.The existing
TestCloseError, fullparquet/file/...andparquet/pqarrow/...test suites all pass (withPARQUET_TEST_DATApointed atparquet-testing/data).Are there any user-facing changes?
No breaking changes.
file.NewParquetWriter(w, sc, opts...) *Writer— signature unchanged, still panics on first-write failure with the identical literal string"failed to write magic number". Existingrecover()workarounds keep working.file.NewParquetWriterWithError(w, sc, opts...) (*Writer, error)— new exported symbol. Adding an exported function is non-breaking in Go.pqarrow.NewFileWriter(...) (*FileWriter, error)— signature unchanged. Itserrorreturn now also covers the magic-header-write failure that previously escaped as a panic. Callers who already checkerrget strictly better behavior.The new constructor's docstring is the recommended path for users on cloud-storage upload sinks. The legacy constructor's docstring now documents the panic behavior the original issue called out as undocumented.
Out of scope (separate follow-up)
While investigating, I noticed
WriteBatchSpacedinparquet/file/column_writer_types.gen.golacks thedefer recover()thatWriteBatchhas, so an I/O error during a spaced write panics out of the calling goroutine instead of returning an error. Same class of bug as #820 but a different code path; happy to file a separate issue.