feat(parquet): dictionary fallback heuristic based on compression efficiency by mzabaluev · Pull Request #9700 · apache/arrow-rs

mzabaluev · 2026-04-13T14:56:37Z

Which issue does this PR close?

Closes Parquet dictionary fallback heuristics #9699.

What changes are included in this PR?

Added ColumnProperties option dictionary_fallback, getting a DictionaryFallback enum value.
Two behavior variants are provided (initially, the enum is non-exhaustive to allow more to be added later if necessary):

OnPageSizeLimit - the prior behavior and the default, triggers fallback on exceeding the dictionary page size limit.
OnUnfavorableCompression - a new behavior, includes the page size limit check and adds a check for encoded size not exceeding the plain data size.

Implemented the new optional behavior in the encoder.

Are these changes tested?

Added new tests exercising the OnUnfavorableCompression behavior.
The existing tests exercise OnPageSizeLimit.

Are there any user-facing changes?

Added API in parquet:

The DictionaryFallback enum
ColumnProperties::dictionary_fallback, ColumnProperties::set_dictionary_fallback
WriterPropertiesBuilder::set_dictionary_fallback, WriterPropertiesBuilder::set_column_dictionary_fallback
WriterProperties::dictionary_fallback

In parquet, add `dictionary_fallback` option to `ColumnProperties`. Its value type is defined as the `DictionaryFallback` enum with the varians `OnDataPageSize` (the previous behavior, the default) and `OnUnfavorableCompression`. The latter replicates the behavior of parquet-java, which falls back to non-dictionary encoding when the estimated compressed size of the data page with dictionary encoding is not smaller than the estimated size of the data page with plain encoding.

The method as it is used now does not need to return the tuple of base size and number of elements, instead just compute the encoded size as appropriate for the type.

It's not really necessary because the counter is only used while dictionary encoding is enabled, but it's good to safeguard against future refactoring.

etseidl · 2026-04-13T23:29:05Z

Thanks @mzabaluev, I was unaware of this parquet-java behavior. I wonder, however, if even the parquet-java version needs to be updated.

I looked at the java code, and it has been around for quite some time (it was added in late 2014). At that time, I believe the default page size was on the order of a megabyte, so using this heuristic after a single page was probably not a bad idea. However, when the page indexes were added, parquet-java was modified to by default limit pages to 20000 rows (this crate adopted the default 20k page size quite some time later). IMO, 20000 values is too small a sample to decide if a dictionary is having a beneficial effect. Let's say one has a relatively low cardinality (32k) i64 column with a somewhat random distribution. After encoding one 20k row page I think the heuristic here will almost certainly choose plain vs dictionary, but if one were to encode 10 pages, dictionary would then be seen to be superior by far.

I like that this is opt-in, but then wonder if a user knows this heuristic will be helpful (i.e. they know it's a high cardinality column), could they not instead simply disable dictionary encoding for the column in question.

EmilyMatt · 2026-04-14T04:51:32Z

parquet/src/data_type.rs

-            (std::mem::size_of::<u32>(), self.len())
+        fn dict_encoding_size(&self) -> usize {
+            4 + self.len()
        }


Use size_of perhaps?

It's a u32, so it will never be different from 4.

yes, so you should use size_of, which is clear why^

To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

EmilyMatt · 2026-04-14T04:51:50Z

parquet/src/data_type.rs

        }

+        fn dict_encoding_size(&self) -> usize {
+            12


This is a type that encodes to 96 bits. The compound Rust type is furthermore not declared with a repr that nails down its in-memory size (the compiler might decide to align such small arrays to 16 bytes one day and jack up the size accordingly, for example) so I'd argue using size_of would not be squeaky-clean here.

I understand, so a const would be enough, just not a magic number.

OK, the array alignment is nailed down in the spec, but Int96 wrapping the array is a struct with a Rust repr, so... 🤷 Could be a case for declaring the struct with repr(packed), but I'd rather argue we use explicit numbers here because the encoding is not a straight memory copy.

Good suggestion about the const, I've added Int96::SIZE_IN_BYTES for this purpose.

EmilyMatt · 2026-04-14T04:52:09Z

parquet/src/data_type.rs

            Ok(values_read)
        }

+        fn dict_encoding_size(&self) -> usize {


Hmm, it's not only that using size_of would be dubious, the encoding actually uses one bit per value, so this method is a leaky abstraction. I'll see about reworking it.

The dicitionary encoding is never used on bool types, so this just needs some clarifying comments.

... or, better, a panic because this method should never be called for BoolType: the column encoder does a check to fall back to plain/RLE in this case.

Yeah this probably needs an unreachable!()
Though I believe a main principle is that this project should be really low in the dependency graph, which merits not using panics, maybe a const here as well will be fine, or 0, but I fear this may cause a never-ending block/row-group if this assumption ever changes(for example, encoding a full-true group or something.

mzabaluev · 2026-04-14T09:25:15Z

Good points @etseidl. Our motivation for adding this is that in some cases e.g. with high cardinality, the Rust parquet writer produces much larger encoded Parquet than the Spark workloads we're aiming to replace. So using a default option that enables the heuristic akin to the one hardcoded into parquet-java would get us on par (or maybe better, because this implementation may choose to fall back at any page chunk in the dataframe).

etseidl · 2026-04-14T13:36:19Z

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

In dict_encoded_size, use size_of and symbolic constants instead of hardcoded values. For the boolean type, backstop with a panic since any returned value would be bogus for this method's contract (as plain encoding for BOOLEAN is bit-packed). The dictionary encoding is never used for boolean values.

mzabaluev · 2026-04-14T16:52:58Z

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

Can this be effected with other existing properties by an educated programmer? There may be adverse effects to tweaking other defaults dependently on this. If you mean data_page_row_count_limit, this is a writer property, so I'd be hesitant to auto-adjust it based on any column properties.

I'd rather have the option to choose between:

The default simple behavior, i.e. OnPageSizeLimit;
The Java-like heuristic to closely follow Spark and other systems using parquet-java (however, a bit-for-bit workalike would require a larger rework of the logic and is perhaps not desirable);
Better adaptive behaviors, if and when implemented.

For 3, I have purposefully left the DictionaryFallback enum open-ended.

etseidl

I sympathize with wanting to be a bit smarter about when to give up on dictionary encoding. I would, however, like to see something a bit more defensible before proceeding with this change. For example, I'd like to see examples where this heuristic outperforms the current defaults by more than 10%, say, and also outperforms disabling dictionary encoding altogether (something which is already an opt-in option, as this new heuristic would be).

parquet/src/column/writer/encoder.rs

etseidl · 2026-04-14T16:25:20Z

parquet/src/arrow/arrow_writer/mod.rs

    }

+    #[test]
+    fn test_dict_page_size_decided_by_compression_fallback() {


As a test, I saved the output from this and examined the sizing. Without the heuristic, the encoded size for col0 is 8658384 bytes (the default fallback mechanism kicked in after 7 pages). With the heuristic, col1 is 8391126 bytes, a savings of 3%.

I also modified the test to mod the index with 32767. In that instance, col1 was still 8391126 bytes, but col0 was only 2231581, nearly 4X smaller.

I know this is not entirely representative, but it does again point out the pitfalls of too simplistic an approach.

Edit: I did a test of spark with the latter file (32k cardinality). By default, it opts to fallback for all pages, so the file is even larger. If I modify the global parquet.page.row.count.limit to 132000, it then opts for dictionary encoding as it should.

I have modified the test in 1b6dd37 to demonstrate a case when even an early fallback decision brings about 12% compression. But I generally agree with your assessment, so more work is needed.

Another quirk is seen in this test: a dictionary page is still flushed to encode the first data page, even though there is no benefit. Parquet-java takes care to hand over the accumulated values to the plain encoder to be re-encoded.

etseidl · 2026-04-14T16:27:40Z

parquet/src/encodings/encoding/plain_counter.rs

+/// that the writer should fall back to the plain encoding when at a certain point,
+/// e.g. after encoding the first batch, the total size of unencoded data
+/// is calculated as smaller than `(encodedSize + dictionarySize)`.
+pub struct PlainDataSizeCounter {


To my comment about sample size, perhaps this can be adapted to keep track of both the plain and dictionary encoded sizes, and then only make a decision after some critical number of rows or bytes have been processed.

I can add a value governing this (as the minimum number of rows) directly into the OnUnsatisfactoryCompression variant. Or perhaps, add another property alongside for future extensibility, since it's not clear if everyone would want it to be rows and there'd be a preference for the byte size threshold later, and I don't want to introduce a nested open-ended property struct.

Perhaps just values entered to cover lists as well. I think as long as there are a sufficient number of samples with which to make a decision, this could be quite nice in the end.

etseidl · 2026-04-14T16:28:42Z

parquet/src/data_type.rs

-            (std::mem::size_of::<u32>(), self.len())
+        fn dict_encoding_size(&self) -> usize {
+            4 + self.len()
        }


To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

To plain_encoded_data_size as suggested in review.

Modify the compression fallback test to illustrate the benefit in an admittedly differently contrived case. The heuristic borrowed from parquet-java is still not ideal for all cases, so we'll need more configurability.

mzabaluev added 3 commits April 13, 2026 17:54

test: fallback with OnUnfavorableCompression

3aacbe5

refactor(parquet): simplify dict_encoding_size

83ded36

The method as it is used now does not need to return the tuple of base size and number of elements, instead just compute the encoded size as appropriate for the type.

github-actions bot added the parquet Changes to the parquet crate label Apr 13, 2026

mzabaluev added 3 commits April 13, 2026 20:25

fix: unset plain data counter when flushing dict

674a1b0

It's not really necessary because the counter is only used while dictionary encoding is enabled, but it's good to safeguard against future refactoring.

chore: fix clippy

13e042e

chore: license on plain_counter

becdcef

EmilyMatt reviewed Apr 14, 2026

View reviewed changes

etseidl requested changes Apr 14, 2026

View reviewed changes

chore(parquet): rename uncompressed_data_size

701ff2b

To plain_encoded_data_size as suggested in review.

etseidl mentioned this pull request Apr 14, 2026

RequiresFallback.isCompressionSatisfying is too aggressive with current default page size apache/parquet-java#3479

Open

test: compression fallback wins

1b6dd37

Modify the compression fallback test to illustrate the benefit in an admittedly differently contrived case. The heuristic borrowed from parquet-java is still not ideal for all cases, so we'll need more configurability.

Conversation

mzabaluev commented Apr 13, 2026

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Apr 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmilyMatt Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzabaluev commented Apr 14, 2026

Uh oh!

etseidl commented Apr 14, 2026

Uh oh!

mzabaluev commented Apr 14, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

etseidl Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EmilyMatt Apr 14, 2026 •

edited

Loading

etseidl Apr 14, 2026 •

edited

Loading