feat(parquet): dictionary fallback heuristic based on compression efficiency#9700
feat(parquet): dictionary fallback heuristic based on compression efficiency#9700mzabaluev wants to merge 9 commits intoapache:mainfrom
Conversation
In parquet, add `dictionary_fallback` option to `ColumnProperties`. Its value type is defined as the `DictionaryFallback` enum with the varians `OnDataPageSize` (the previous behavior, the default) and `OnUnfavorableCompression`. The latter replicates the behavior of parquet-java, which falls back to non-dictionary encoding when the estimated compressed size of the data page with dictionary encoding is not smaller than the estimated size of the data page with plain encoding.
The method as it is used now does not need to return the tuple of base size and number of elements, instead just compute the encoded size as appropriate for the type.
It's not really necessary because the counter is only used while dictionary encoding is enabled, but it's good to safeguard against future refactoring.
|
Thanks @mzabaluev, I was unaware of this parquet-java behavior. I wonder, however, if even the parquet-java version needs to be updated. I looked at the java code, and it has been around for quite some time (it was added in late 2014). At that time, I believe the default page size was on the order of a megabyte, so using this heuristic after a single page was probably not a bad idea. However, when the page indexes were added, parquet-java was modified to by default limit pages to 20000 rows (this crate adopted the default 20k page size quite some time later). IMO, 20000 values is too small a sample to decide if a dictionary is having a beneficial effect. Let's say one has a relatively low cardinality (32k) i64 column with a somewhat random distribution. After encoding one 20k row page I think the heuristic here will almost certainly choose plain vs dictionary, but if one were to encode 10 pages, dictionary would then be seen to be superior by far. I like that this is opt-in, but then wonder if a user knows this heuristic will be helpful (i.e. they know it's a high cardinality column), could they not instead simply disable dictionary encoding for the column in question. |
| (std::mem::size_of::<u32>(), self.len()) | ||
| fn dict_encoding_size(&self) -> usize { | ||
| 4 + self.len() | ||
| } |
There was a problem hiding this comment.
It's a u32, so it will never be different from 4.
There was a problem hiding this comment.
yes, so you should use size_of, which is clear why^
There was a problem hiding this comment.
To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.
parquet/src/data_type.rs
Outdated
| } | ||
|
|
||
| fn dict_encoding_size(&self) -> usize { | ||
| 12 |
There was a problem hiding this comment.
This is a type that encodes to 96 bits. The compound Rust type is furthermore not declared with a repr that nails down its in-memory size (the compiler might decide to align such small arrays to 16 bytes one day and jack up the size accordingly, for example) so I'd argue using size_of would not be squeaky-clean here.
There was a problem hiding this comment.
I understand, so a const would be enough, just not a magic number.
There was a problem hiding this comment.
OK, the array alignment is nailed down in the spec, but Int96 wrapping the array is a struct with a Rust repr, so... 🤷 Could be a case for declaring the struct with repr(packed), but I'd rather argue we use explicit numbers here because the encoding is not a straight memory copy.
There was a problem hiding this comment.
Good suggestion about the const, I've added Int96::SIZE_IN_BYTES for this purpose.
| Ok(values_read) | ||
| } | ||
|
|
||
| fn dict_encoding_size(&self) -> usize { |
There was a problem hiding this comment.
Hmm, it's not only that using size_of would be dubious, the encoding actually uses one bit per value, so this method is a leaky abstraction. I'll see about reworking it.
There was a problem hiding this comment.
The dicitionary encoding is never used on bool types, so this just needs some clarifying comments.
There was a problem hiding this comment.
... or, better, a panic because this method should never be called for BoolType: the column encoder does a check to fall back to plain/RLE in this case.
There was a problem hiding this comment.
Yeah this probably needs an unreachable!()
Though I believe a main principle is that this project should be really low in the dependency graph, which merits not using panics, maybe a const here as well will be fine, or 0, but I fear this may cause a never-ending block/row-group if this assumption ever changes(for example, encoding a full-true group or something.
|
Good points @etseidl. Our motivation for adding this is that in some cases e.g. with high cardinality, the Rust parquet writer produces much larger encoded Parquet than the Spark workloads we're aiming to replace. So using a default option that enables the heuristic akin to the one hardcoded into parquet-java would get us on par (or maybe better, because this implementation may choose to fall back at any page chunk in the dataframe). |
|
I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled. |
In dict_encoded_size, use size_of and symbolic constants instead of hardcoded values. For the boolean type, backstop with a panic since any returned value would be bogus for this method's contract (as plain encoding for BOOLEAN is bit-packed). The dictionary encoding is never used for boolean values.
Can this be effected with other existing properties by an educated programmer? There may be adverse effects to tweaking other defaults dependently on this. If you mean I'd rather have the option to choose between:
For 3, I have purposefully left the |
etseidl
left a comment
There was a problem hiding this comment.
I sympathize with wanting to be a bit smarter about when to give up on dictionary encoding. I would, however, like to see something a bit more defensible before proceeding with this change. For example, I'd like to see examples where this heuristic outperforms the current defaults by more than 10%, say, and also outperforms disabling dictionary encoding altogether (something which is already an opt-in option, as this new heuristic would be).
| } | ||
|
|
||
| #[test] | ||
| fn test_dict_page_size_decided_by_compression_fallback() { |
There was a problem hiding this comment.
As a test, I saved the output from this and examined the sizing. Without the heuristic, the encoded size for col0 is 8658384 bytes (the default fallback mechanism kicked in after 7 pages). With the heuristic, col1 is 8391126 bytes, a savings of 3%.
I also modified the test to mod the index with 32767. In that instance, col1 was still 8391126 bytes, but col0 was only 2231581, nearly 4X smaller.
I know this is not entirely representative, but it does again point out the pitfalls of too simplistic an approach.
Edit: I did a test of spark with the latter file (32k cardinality). By default, it opts to fallback for all pages, so the file is even larger. If I modify the global parquet.page.row.count.limit to 132000, it then opts for dictionary encoding as it should.
There was a problem hiding this comment.
I have modified the test in 1b6dd37 to demonstrate a case when even an early fallback decision brings about 12% compression. But I generally agree with your assessment, so more work is needed.
Another quirk is seen in this test: a dictionary page is still flushed to encode the first data page, even though there is no benefit. Parquet-java takes care to hand over the accumulated values to the plain encoder to be re-encoded.
| /// that the writer should fall back to the plain encoding when at a certain point, | ||
| /// e.g. after encoding the first batch, the total size of unencoded data | ||
| /// is calculated as smaller than `(encodedSize + dictionarySize)`. | ||
| pub struct PlainDataSizeCounter { |
There was a problem hiding this comment.
To my comment about sample size, perhaps this can be adapted to keep track of both the plain and dictionary encoded sizes, and then only make a decision after some critical number of rows or bytes have been processed.
There was a problem hiding this comment.
I can add a value governing this (as the minimum number of rows) directly into the OnUnsatisfactoryCompression variant. Or perhaps, add another property alongside for future extensibility, since it's not clear if everyone would want it to be rows and there'd be a preference for the byte size threshold later, and I don't want to introduce a nested open-ended property struct.
There was a problem hiding this comment.
Perhaps just values entered to cover lists as well. I think as long as there are a sufficient number of samples with which to make a decision, this could be quite nice in the end.
| (std::mem::size_of::<u32>(), self.len()) | ||
| fn dict_encoding_size(&self) -> usize { | ||
| 4 + self.len() | ||
| } |
There was a problem hiding this comment.
To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.
To plain_encoded_data_size as suggested in review.
Modify the compression fallback test to illustrate the benefit in an admittedly differently contrived case. The heuristic borrowed from parquet-java is still not ideal for all cases, so we'll need more configurability.
Which issue does this PR close?
What changes are included in this PR?
Added
ColumnPropertiesoptiondictionary_fallback, getting aDictionaryFallbackenum value.Two behavior variants are provided (initially, the enum is non-exhaustive to allow more to be added later if necessary):
OnPageSizeLimit- the prior behavior and the default, triggers fallback on exceeding the dictionary page size limit.OnUnfavorableCompression- a new behavior, includes the page size limit check and adds a check for encoded size not exceeding the plain data size.Implemented the new optional behavior in the encoder.
Are these changes tested?
Added new tests exercising the
OnUnfavorableCompressionbehavior.The existing tests exercise
OnPageSizeLimit.Are there any user-facing changes?
Added API in parquet:
DictionaryFallbackenumColumnProperties::dictionary_fallback,ColumnProperties::set_dictionary_fallbackWriterPropertiesBuilder::set_dictionary_fallback,WriterPropertiesBuilder::set_column_dictionary_fallbackWriterProperties::dictionary_fallback