fix(table): enable Parquet dictionary encoding by default#931
Conversation
|
closing for now, for high cardinality columns this is a regression since the dictionary is not discarded on overflow |
|
@twuebi what do you mean? With the high-cardinality case it would fallback to plain encoding instead of dictionary, or is there a bug there? |
|
@zeroshade yes, there seems to be a bug in arrow-go where both dictionary and plain ends up getting written for that page |
|
Hi all, in light of this being closed, I was just curious to clarify the current status, so does iceberg-go currently not support dictionary encoding? If so, would it be reasonable to make a PR to enable opting in but not need to make it enabled by default (i.e. to prevent any issues with high cardinality as mentioned above) Thank you very much for all your hard work on this library! |
|
Hi @C-Loftus, thanks for the poke, we've fixed the double encoding in arrow-go, once arrow-go makes another release, and we bump the dependency here, we should be able to allow dictionary encoding. Until then, enabling it would mean a regression for high cardinality columns. |
|
Makes sense, thank you for the update and that context! |
The Parquet writer hardcoded WithDictionaryDefault(false), overriding the arrow-go writer's own default (true) with no way for users to opt back in.
Remove the override so the library default applies.
This matches Java Iceberg (which removed its equivalent table property in apache/iceberg#7665) and parquet-java/parquet-cpp defaults.