Skip to content

Docs: recommendation for packaging uberjars#14292

Merged
nastra merged 2 commits into
apache:mainfrom
qinghui-xu:issue-14232
Oct 23, 2025
Merged

Docs: recommendation for packaging uberjars#14292
nastra merged 2 commits into
apache:mainfrom
qinghui-xu:issue-14232

Conversation

@qinghui-xu
Copy link
Copy Markdown
Contributor

@qinghui-xu qinghui-xu commented Oct 10, 2025

Sometimes iceberg may use verisons of parquet, avro, or other libs that are incompatible with what's deployed in engine runtime. Users should package only iceberg-spark-runtime (it provides shaded transitive deps) into the uberjar to avoid such issue.
Highlight this in the site docs.
Close #14232

@github-actions github-actions Bot added the docs label Oct 10, 2025
@qinghui-xu
Copy link
Copy Markdown
Contributor Author

cc @nastra @RussellSpitzer

@qinghui-xu
Copy link
Copy Markdown
Contributor Author

Could somebody provide some feedback? Thanks a lot!

Comment thread README.md Outdated
Iceberg also has modules for adding Iceberg support to processing engines:

* `iceberg-spark` is an implementation of Spark's Datasource V2 API for Iceberg with submodules for each spark versions (use runtime jars for a shaded version)
* When packaging user projects, keep only `iceberg-spark-runtime` in the uberjar. All other functional modules such as `iceberg-core` or `iceberg-parquet` (and their transitive dependencies) should be excluded from uberjar, because some libraries (`parquet`, `avro`, etc) they use may be of versions incompatible with those provided in Spark runtime classpath.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* When packaging user projects, keep only `iceberg-spark-runtime` in the uberjar. All other functional modules such as `iceberg-core` or `iceberg-parquet` (and their transitive dependencies) should be excluded from uberjar, because some libraries (`parquet`, `avro`, etc) they use may be of versions incompatible with those provided in Spark runtime classpath.
* When packaging user projects in an uberjar, only the `iceberg-spark-runtime` jar and potentially one of the storage-specific bundles, such as `iceberg-aws-bundle` or `iceberg-gcp-bundle` are needed. No other Iceberg modules, such as `iceberg-core` or `iceberg-parquet` (and their transitive dependencies) are needed, as this will lead to dependency mismatches/conflicts on the classpath.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add the same wording for iceberg-flink below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. I'd like to use less strong wording for the last phrase "they may lead to dependency mismatches/conflicts on runtime classpath" because with previous versions of iceberg (up to 1.8) I did not encounter such issues with my crooked setup.

@manuzhang
Copy link
Copy Markdown
Member

manuzhang commented Oct 21, 2025

We actually have guidance at https://iceberg.apache.org/multi-engine-support/#runtime-jar. The README is not up-to-date. Shall we link to docs site in README like https://github.com/apache/iceberg-python/blob/main/README.md?

@nastra
Copy link
Copy Markdown
Contributor

nastra commented Oct 21, 2025

thanks for pointing that out @manuzhang. @qinghui-xu can you just add a link to the respective section of the docs then?

Comment thread README.md Outdated
Iceberg also has modules for adding Iceberg support to processing engines:

* `iceberg-spark` is an implementation of Spark's Datasource V2 API for Iceberg with submodules for each spark versions (use runtime jars for a shaded version)
* When packaging user projects in an uberjar, only the `iceberg-spark-runtime` jar and potentially one of the storage-specific bundles, such as `iceberg-aws-bundle` or `iceberg-gcp-bundle` are needed, as suggested by [the documentation](https://iceberg.apache.org/multi-engine-support/#runtime-jar). No other Iceberg modules, such as `iceberg-core` or `iceberg-parquet` (and their transitive dependencies) are needed, as they may lead to dependency version mismatches/conflicts on the runtime classpath.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this entire subsection and rather add a link to the line above which then links to the respective section: ... for each spark version (use [runtime jars](link goes here) for a shaded version)

And then we can do the same for iceberg-flink-runtime below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it seems a bit redundant as information. But I'd like to highlight that we should exclude other modules from runtime classpath (eg. uberjar). Maybe I should put this into the doc instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes just add it to the linked doc and here please only add the link as otherwise we're duplicating information

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the site/docs and add a pointer in the README.

Sometimes iceberg may use verisons of parquet, avro, or other libs that are incompatible with what's deployed in Spark runtime.
Add a reminder in site docs to highlight such issues.

This closes apache#14232
@qinghui-xu qinghui-xu changed the title README: recommendation for packaging spark jobs Docs: recommendation for packaging uberjars Oct 23, 2025
When using Iceberg with these engines, the runtime jar is the only addition to the classpath needed in addition to vendor dependencies.
For example, to use Iceberg with Spark 3.5 and AWS integrations, `iceberg-spark-runtime-3.5_2.12` and AWS SDK dependencies are needed for the Spark installation.

> ℹ️ It's important to make sure that only the runtime jars (plus storage specific bundles if needed, eg. `iceberg-aws-bundle` or `iceberg-gcp-bundle`) are included in the runtime classpath.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this page render? Can you provide a screenshot when running the site locally?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot-iceberg-doc It should look like this.

Comment thread site/docs/multi-engine-support.md Outdated
Comment thread README.md Outdated
Comment thread README.md Outdated
Copy link
Copy Markdown
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @qinghui-xu

@nastra nastra merged commit de213af into apache:main Oct 23, 2025
3 checks passed
thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026
talatuyarer pushed a commit to talatuyarer/iceberg that referenced this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avro version incompatible with Spark 3.5/3.4

3 participants