Skip to content

[spark] Refactor BatchWrite subclasses into base logic + per-version wrappers#7723

Merged
JingsongLi merged 1 commit into
apache:masterfrom
kerwin-zk:spark-batchwrite-refactor
May 23, 2026
Merged

[spark] Refactor BatchWrite subclasses into base logic + per-version wrappers#7723
JingsongLi merged 1 commit into
apache:masterfrom
kerwin-zk:spark-batchwrite-refactor

Conversation

@kerwin-zk
Copy link
Copy Markdown
Contributor

@kerwin-zk kerwin-zk commented Apr 28, 2026

Purpose

Follow-up of #7648 (Spark 4.1 module) and a sibling of #7721. After landing the reverse-shim layout, two of the files under paimon-spark-4.0/src/main only existed as shadows because their compilation unit defined a Scala class that extends BatchWrite. Spark 4.1 added a default method BatchWrite.commit(WriterCommitMessage[], WriteSummary) whose WriteSummary parameter type does not exist on Spark 4.0; a class compiled against 4.1 that mixes in BatchWrite carries the inherited commit(.., WriteSummary) signature in its method table, which JVM ObjectStreamClass.getPrivateMethod lazy-links during Spark task serialization and crashes 4.0 with ClassNotFoundException: WriteSummary.

This PR refactors both affected classes into the same base + per-version wrapper pattern:

  • PaimonBatchWrite (used by V2 writes)
  • FormatTableBatchWrite (used by FormatTable V2 writes — was previously a private case class inside PaimonFormatTable.scala)

For each, the body lives in a new abstract base in paimon-spark-common that deliberately does not extend BatchWrite (renamed protected helpers: commitMessages, abortMessages, createPaimonDataWriterFactory, createFormatTableDataWriterFactory). Each per-version module (paimon-spark3-common, paimon-spark4-common, paimon-spark-4.0/src/main) ships a thin wrapper that mixes in BatchWrite and forwards the four BatchWrite methods to the base helpers. Routing happens through two new SparkShim factories so each Spark version's scalac compiles the right extends BatchWrite mixin.

The Spark 4.0 shadow of PaimonFormatTable.scala is no longer needed and is deleted; only the new thin FormatTableBatchWrite.scala wrapper remains under paimon-spark-4.0/src/main.

Tests

CI

API and Format

No new public API. Two internal factories added to org.apache.spark.sql.paimon.shims.SparkShim:

  • createPaimonBatchWrite(table, writeSchema, dataSchema, overwritePartitions, copyOnWriteScan)
  • createFormatTableBatchWrite(table, overwriteDynamic, overwritePartitions, writeSchema)

Documentation

No user-facing changes.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Spark V2 BatchWrite implementations to avoid Spark 4.1’s inherited BatchWrite.commit(.., WriteSummary) signature from leaking into classes used on Spark 4.0 runtimes (which can trigger ClassNotFoundException: WriteSummary during task serialization). The change centralizes write business logic in Spark-version-agnostic base classes and moves the extends BatchWrite mixin into per-version thin wrappers constructed via SparkShim factories.

Changes:

  • Introduce PaimonBatchWriteBase / FormatTableBatchWriteBase in paimon-spark-common (do not extend BatchWrite) and add per-version wrapper classes that mix in BatchWrite.
  • Add SparkShim.createPaimonBatchWrite and SparkShim.createFormatTableBatchWrite factories and route call sites through SparkShimLoader.
  • Add Spark 4.0 shadow wrappers (and shim wiring) to ensure Spark 4.0-target-compiled class metadata is used at runtime.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
paimon-spark/paimon-spark4-common/src/main/scala/org/apache/spark/sql/paimon/shims/Spark4Shim.scala Implements new SparkShim factories for Spark 4.x to construct version-compiled BatchWrite wrappers.
paimon-spark/paimon-spark4-common/src/main/scala/org/apache/paimon/spark/write/PaimonBatchWrite.scala Spark 4.x thin BatchWrite wrapper delegating to PaimonBatchWriteBase.
paimon-spark/paimon-spark4-common/src/main/scala/org/apache/paimon/spark/format/FormatTableBatchWrite.scala Spark 4.x thin BatchWrite wrapper delegating to FormatTableBatchWriteBase.
paimon-spark/paimon-spark3-common/src/main/scala/org/apache/spark/sql/paimon/shims/Spark3Shim.scala Implements new SparkShim factories for Spark 3.x to construct version-compiled BatchWrite wrappers.
paimon-spark/paimon-spark3-common/src/main/scala/org/apache/paimon/spark/write/PaimonBatchWrite.scala Spark 3.x thin BatchWrite wrapper delegating to PaimonBatchWriteBase.
paimon-spark/paimon-spark3-common/src/main/scala/org/apache/paimon/spark/format/FormatTableBatchWrite.scala Spark 3.x thin BatchWrite wrapper delegating to FormatTableBatchWriteBase.
paimon-spark/paimon-spark-common/src/main/scala/org/apache/spark/sql/paimon/shims/SparkShim.scala Adds BatchWrite factory methods to route instantiation through per-version shims.
paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/write/PaimonV2Write.scala Switches V2 batch write construction to SparkShimLoader.shim.createPaimonBatchWrite.
paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/write/PaimonBatchWriteBase.scala Extracts BatchWrite logic into a Spark-version-agnostic base class (no extends BatchWrite).
paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/format/PaimonFormatTable.scala Switches FormatTable batch write construction to SparkShimLoader.shim.createFormatTableBatchWrite and removes the inline BatchWrite impl.
paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/format/FormatTableBatchWriteBase.scala Extracts FormatTable BatchWrite logic into a base class (no extends BatchWrite).
paimon-spark/paimon-spark-4.0/src/main/scala/org/apache/spark/sql/paimon/shims/Spark4Shim.scala Spark 4.0 shim override wiring the new factories to Spark-4.0-compiled wrappers.
paimon-spark/paimon-spark-4.0/src/main/scala/org/apache/paimon/spark/write/PaimonBatchWrite.scala Spark 4.0-compatible shadow BatchWrite wrapper delegating to PaimonBatchWriteBase.
paimon-spark/paimon-spark-4.0/src/main/scala/org/apache/paimon/spark/format/FormatTableBatchWrite.scala Spark 4.0-compatible shadow BatchWrite wrapper delegating to FormatTableBatchWriteBase.
Comments suppressed due to low confidence (2)

paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/format/FormatTableBatchWriteBase.scala:74

  • batchWriteBuilder.newCommit() returns a commit object that should be closed. Here the commit is never closed (including on exceptions), which can leak resources/file handles. Wrap the commit in try/finally (or equivalent) and close it in the finally block while preserving the existing error logging.
    paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/format/FormatTableBatchWriteBase.scala:86
  • abortMessages creates a new commit via batchWriteBuilder.newCommit() but never closes it. Close the commit in a finally block after aborting to avoid leaking resources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kerwin-zk kerwin-zk force-pushed the spark-batchwrite-refactor branch from d274193 to 642014f Compare May 22, 2026 02:01
@Zouxxyy
Copy link
Copy Markdown
Contributor

Zouxxyy commented May 22, 2026

Let me check it

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: base + per-version wrapper refactoring for BatchWrite

Overall

The approach is sound. Spark 4.1's addition of a default method BatchWrite.commit(.., WriteSummary) means any class compiled against 4.1 that mixes in BatchWrite will carry the WriteSummary reference in its method table, breaking Spark 4.0 at runtime via lazy-linking. Splitting the implementation into a base class (no extends BatchWrite) in paimon-spark-common plus thin per-version wrappers compiled against each Spark version is the correct fix. The code is well-documented with clear scaladoc explaining the rationale.

Correctness

  1. commitStarted visibility change (private -> protected): Correct and necessary since abortMessages now lives in the base class and checks this flag.

  2. FormatTableBatchWriteBase.abortMessages has no commitStarted guard: This is consistent with the original FormatTableBatchWrite which also lacked the guard. The PaimonBatchWriteBase.abortMessages retains it. No regression here.

  3. FormatTableDataWriter relocation: The class moved out of PaimonFormatTable.scala (in both the 4.0 shadow and common) into FormatTableBatchWriteBase.scala in common. This keeps the writer co-located with the batch-write base logic which makes sense, but please verify it is visible to per-version wrappers that call createFormatTableDataWriterFactory().

Design observations

  1. Companion object PaimonBatchWrite.apply in per-version modules: After this PR, all instantiation routes through SparkShimLoader.shim.createPaimonBatchWrite(...) which calls new PaimonBatchWrite(...) directly. The companion apply methods appear unused. If they exist solely for binary compatibility with call-sites that used the former case class constructor syntax, please confirm whether any such call-sites remain; otherwise these could be removed to avoid confusion.

  2. with Serializable on both base classes: Since BatchWrite instances live exclusively on the driver (only DataWriterFactory is serialized to executors), Serializable is not strictly required. It is harmless and preserves the implicit contract of the former case class, but worth noting. The transitive serializability of SparkMetricRegistry and BatchWriteBuilder held by PaimonBatchWriteBase could cause surprising NotSerializableExceptions if something does accidentally attempt serialization.

  3. Source duplication across wrappers: The three PaimonBatchWrite wrappers (spark3-common, spark4-common, spark-4.0) and three FormatTableBatchWrite wrappers are near-identical. This is unavoidable given the design constraint. A brief comment at the top of each (already present -- good) explaining this is a deliberate compile-target copy helps future maintainers.

Minor nits

  • In the paimon-spark-4.0 PaimonBatchWrite.scala scaladoc, the phrase "maven shade order picks paimon-spark-4.0/target/classes ahead of the shaded 4-common copy" is an implementation detail that may become stale; consider referencing the build mechanism more generically (e.g., "classpath ordering ensures the 4.0 shadow takes precedence").
  • The FormatTableBatchWriteBase constructor exposes batchWriteBuilder as protected val. If no subclass other than the thin wrappers needs it, private[format] would tighten visibility.

Summary

Well-structured refactoring that correctly isolates the Spark version-specific mixin. The pattern is consistent across both affected classes. No functional regressions observed. Only cosmetic/hygiene items noted above.

@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit 50ffca8 into apache:master May 23, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants