Skip to content

Core: implement writeManifestsWith executor on SnapshotUpdate#16108

Open
dramaticlly wants to merge 3 commits into
apache:mainfrom
dramaticlly:commitManifestsWith
Open

Core: implement writeManifestsWith executor on SnapshotUpdate#16108
dramaticlly wants to merge 3 commits into
apache:mainfrom
dramaticlly:commitManifestsWith

Conversation

@dramaticlly
Copy link
Copy Markdown
Contributor

Why

The SnapshotUpdate interface today already has scanManifestsWith(ExecutorService) (from #4147) which lets callers control the executor used for reading/scanning manifests. However, writing manifests during snapshot commit still hardcodes ThreadPools.getWorkerPool() for parallel manifest writes.

This means callers cannot control the thread pool used for I/O-heavy manifest writes and coordinate on shutdown behavior (see #15031) also difficult to pass additional contextual information such as tagging/logging/metrics for debug the problem.

@dramaticlly
Copy link
Copy Markdown
Contributor Author

@stevenzwu @huaxingao if you are interested to take a look

Comment thread core/src/main/java/org/apache/iceberg/SnapshotProducer.java
Comment thread core/src/main/java/org/apache/iceberg/SnapshotProducer.java
Comment thread core/src/main/java/org/apache/iceberg/SnapshotProducer.java Outdated
}

private static <F> List<ManifestFile> writeManifests(
private <F> List<ManifestFile> writeManifests(
Copy link
Copy Markdown
Contributor Author

@dramaticlly dramaticlly May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed static as now require to access the commitPoolSized for instance variable, also this method is called during apply() single-threaded per snapshot operation.

Comment thread core/src/main/java/org/apache/iceberg/SnapshotProducer.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/SnapshotProducer.java Outdated
Copy link
Copy Markdown
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread api/src/main/java/org/apache/iceberg/SnapshotUpdate.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/SnapshotProducer.java
@dramaticlly dramaticlly force-pushed the commitManifestsWith branch from 1fefff3 to 112d0d9 Compare May 19, 2026 22:54
@dramaticlly dramaticlly reopened this May 20, 2026
@dramaticlly
Copy link
Copy Markdown
Contributor Author

Some intermittent testcontainers failure in CI, reopen to trigger rerun

TestADLSFileIO > initializationError FAILED
    org.testcontainers.containers.ContainerFetchException at GenericContainer.java:1308
        Caused by: com.github.dockerjava.api.exception.NotFoundException at DefaultInvocationBuilder.java:241

* @param executorService the provided executor
* @return this for method chaining
*/
default ThisT commitManifestsWith(ExecutorService executorService) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two parts: a context note on the dual-method shape, then a recommendation on this specific method.

Dual-method shape is fine — it matches existing convention

Separate scanManifestsWith (read-side) + a write-side companion is consistent with ExpireSnapshots, which has had planWith(ExecutorService) for read-side planning and executeDeleteWith(ExecutorService) for delete-side IO as separate methods since 2020 — same pattern, same reason: two distinct parallelizable phases, one executor per phase. Callers who want a single pool can pass the same ExecutorService to both methods.

For history (this PR closes a long-standing gap rather than introducing a new shape): #4147 (Feb 2022) added scanManifestsWith scoped to reads because manifest writes weren't parallelized at the time. Writes got parallelized later in #11086, which hardcoded ThreadPools.getWorkerPool() instead of plumbing the existing executor hook through. The original Flink classloader/lifecycle motivation (#3776) applies just as strongly to writes, so plumbing a caller-provided pool here is a natural extension.

Two design suggestions on this method specifically

1. Rename commitManifestsWithwriteManifestsWith.

The natural counterpart of scanManifestsWith is writeManifestsWith, not commitManifestsWith. scan and write are verbs describing the operation on manifests; commit describes a phase. Three concrete reasons to prefer the verb pairing:

  • "Commit" is overloaded in Iceberg. Transaction.commitTransaction(), PendingUpdate.commit(), the catalog RPC at the end of SnapshotProducer.commit(), the commit retry budget — all reuse the word for different things. A reader landing on commitManifestsWith(executor) cold can reasonably read it as "use this executor to perform the commit," which is wrong: the catalog commit is a single RPC with nothing to parallelize.
  • The pool literally writes. commitPool() is invoked exactly once, inside the writeManifests method, around the writeFunc.apply(group) calls that produce manifest files. The internal naming (commitPool field driving writeManifests) already has a small inconsistency that the rename resolves.
  • Iceberg style: "method names should describe the specific behavior, not just be generic" — same instinct as selectInIdOrder over selectOrdered.

2. Take parallelism as an explicit second arg: writeManifestsWith(ExecutorService executor, int parallelism).

The current 1-arg shape conflates two concerns:

  • Lifecycle/threading: where the Runnables execute.
  • Layout determinism: how many manifest groups get produced (currently inferred from getMaximumPoolSize()).

Coupling them forces the instanceof ThreadPoolExecutor precondition, which:

  • Rejects Executors.newSingleThreadExecutor() — it returns a FinalizableDelegatedExecutorService wrapper, not a raw ThreadPoolExecutor. Asymmetric with Executors.newFixedThreadPool(1), which works.
  • Forecloses ForkJoinPool and virtual-thread executors permanently — a real cost as JDK 21+ pushes Executors.newVirtualThreadPerTaskExecutor() for I/O-bound work like manifest writes.
  • The recent getMaximumPoolSize() < Integer.MAX_VALUE guard plugs the newCachedThreadPool hole but doesn't address the structural issue.

Decoupling fixes all three:

  • Any ExecutorService works — ForkJoinPool, virtual threads, single-thread, cached pool, fixed pool.
  • Layout determinism is stated, not inferred — the caller knows the manifest count.
  • The strict-type precondition reduces to Preconditions.checkArgument(parallelism > 0).

On the "match Iceberg convention" pull for the second arg

Every existing executor method (scanManifestsWith, Scan.planWith, ExpireSnapshots.planWith, executeDeleteWith, executeWith) is single-arg with no explicit parallelism. But those methods all share a property writeManifestsWith doesn't: the pool is a pure performance knob. Pool size affects how fast, not what gets produced. The write-side pool is unique in that pool size determines manifest count — a correctness / durable-output concern. Different semantics, different shape is justifiable.

If staying aligned with convention is preferred, the TableMigrationUtil precedent of overload pairs (ExecutorService) and (ExecutorService, int parallelism) is available. But since this method hasn't shipped yet, landing the two-arg version directly is cleaner — no overload split, no future migration story.

Net recommendation

default ThisT writeManifestsWith(ExecutorService executor, int parallelism) {
  throw new UnsupportedOperationException(
      this.getClass().getName() + " does not support writeManifestsWith");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unresolved this comment for visibility to other reviewers on the API design choices.

@dramaticlly dramaticlly changed the title Core: implement commitManifestsWith executor on SnapshotUpdate Core: implement writeManifestsWith executor on SnapshotUpdate May 21, 2026
@dramaticlly
Copy link
Copy Markdown
Contributor Author

@amogh-jahagirdar wondering if you want to take a look as well? Appreciated

…parallelism

Rename the API method from commitManifestsWith(ExecutorService) to
writeManifestsWith(ExecutorService, int parallelism) per review feedback.

The "write" verb naturally pairs with scanManifestsWith, and the explicit
parallelism parameter decouples executor lifecycle from layout determinism.
This removes the instanceof ThreadPoolExecutor constraint, allowing any
ExecutorService (ForkJoinPool, virtual threads, etc.) to be used.
@dramaticlly dramaticlly force-pushed the commitManifestsWith branch from 177e9c1 to 17e0f64 Compare May 22, 2026 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants