Skip to content

fix(table): enable Parquet dictionary encoding by default#931

Open
twuebi wants to merge 1 commit into
apache:mainfrom
twuebi:tp/enable-dict-write-parquet
Open

fix(table): enable Parquet dictionary encoding by default#931
twuebi wants to merge 1 commit into
apache:mainfrom
twuebi:tp/enable-dict-write-parquet

Conversation

@twuebi
Copy link
Copy Markdown
Contributor

@twuebi twuebi commented Apr 22, 2026

The Parquet writer hardcoded WithDictionaryDefault(false), overriding the arrow-go writer's own default (true) with no way for users to opt back in.

Remove the override so the library default applies.

This matches Java Iceberg (which removed its equivalent table property in apache/iceberg#7665) and parquet-java/parquet-cpp defaults.

@twuebi twuebi requested a review from zeroshade as a code owner April 22, 2026 15:49
@twuebi
Copy link
Copy Markdown
Contributor Author

twuebi commented Apr 22, 2026

closing for now, for high cardinality columns this is a regression since the dictionary is not discarded on overflow

@twuebi twuebi closed this Apr 22, 2026
@zeroshade
Copy link
Copy Markdown
Member

@twuebi what do you mean? With the high-cardinality case it would fallback to plain encoding instead of dictionary, or is there a bug there?

@twuebi
Copy link
Copy Markdown
Contributor Author

twuebi commented Apr 23, 2026

@zeroshade yes, there seems to be a bug in arrow-go where both dictionary and plain ends up getting written for that page

@C-Loftus
Copy link
Copy Markdown
Contributor

Hi all, in light of this being closed, I was just curious to clarify the current status, so does iceberg-go currently not support dictionary encoding? If so, would it be reasonable to make a PR to enable opting in but not need to make it enabled by default (i.e. to prevent any issues with high cardinality as mentioned above)

Thank you very much for all your hard work on this library!

@twuebi twuebi reopened this May 13, 2026
@twuebi
Copy link
Copy Markdown
Contributor Author

twuebi commented May 13, 2026

Hi @C-Loftus, thanks for the poke, we've fixed the double encoding in arrow-go, once arrow-go makes another release, and we bump the dependency here, we should be able to allow dictionary encoding. Until then, enabling it would mean a regression for high cardinality columns.

@C-Loftus
Copy link
Copy Markdown
Contributor

Makes sense, thank you for the update and that context!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants