Skip to content

Conversation

@efredine
Copy link
Contributor

Which issue does this PR close?

Closes #11145.

Rationale for this change

What changes are included in this PR?

Modifies create_max_min_accs to instantiate accumulators for unpacked data - the value DataType of the Dictionary.

This bug is very similar to a previous bug that impacted the Min/Max aggregate functions. #1235

In addition, the min_max_aggregate_data_type fn is copied from

// Min/max aggregation can take Dictionary encode input but always produces unpacked
// (aka non Dictionary) output. We need to adjust the output data type to reflect this.
// The reason min/max aggregate produces unpacked output because there is only one
// min/max value per group; there is no needs to keep them Dictionary encode
fn min_max_aggregate_data_type(input_type: DataType) -> DataType {
if let DataType::Dictionary(_, value_type) = input_type {
*value_type
} else {
input_type
}
}

I'm unsure if copying the function is the right thing to do in order to prevent coupling between the crates or if it should be moved to some core crate? It also seems like the dedicated min and max functions are to be refactored into a user defined functions?

Are these changes tested?

Yes - a new test has been added.

Are there any user-facing changes?

Currently implemented so the column statistics are returned as an unpacked type. So for DataType::Dictionary(Int32, Utf8) the min or max value is returned as Exact(Utf8("a")). Would it be better to return it as Exact(Dictionary(Int32, Utf8("a")))? I'm unsure what the previous implementation would have returned and whether or not it was correct. But it's possible that returning it as an unpacked type would be a breaking change if it previously returned it as a Dictionary type.

Happy to change the implementation to return a Dictionary type. I'm just unsure which offers the best experience.

@github-actions github-actions bot added the core Core DataFusion crate label Jun 28, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @efredine -- I think this PR fixes the bug 🙏

I left some comments about how to improve the tests -- let me know what you think. I think we can also improve the test in a follow on PR as well

Copy link
Contributor

@appletreeisyellow appletreeisyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing it! @efredine 💯

I left comments in the test to avoid confusions

assert_eq!(c_dic_stats.null_count, Precision::Exact(0));
assert_eq!(
c_dic_stats.max_value,
Precision::Exact(Utf8(Some("c".into())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Precision::Exact(Utf8(Some("c".into())))
Precision::Exact(Utf8(Some("d".into())))

To avoid any confusion, with the new dictionary keys, the max value is "d"

@efredine
Copy link
Contributor Author

Thanks @alamb @appletreeisyellow - I should have reviewed the tests more closely - thanks for the feedback. I will make the adjustments now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @efredine and @appletreeisyellow

@alamb alamb merged commit 61ba655 into apache:main Jun 30, 2024
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…11169)

* fix: Support dictionary type in parquet metadata statistics.

* Simplify tests.

---------

Co-authored-by: Eric Fredine <eric.fredine@beanworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Dictionary in Parquet Metadata Statistics

3 participants