Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialize dictionaries in group keys #8291

Merged

Conversation

qrilka
Copy link
Contributor

@qrilka qrilka commented Nov 21, 2023

Which issue does this PR close?

Closes #7647.

Rationale for this change

From the ticket:
Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns

What changes are included in this PR?

Now AggregateExec includes original_schema which allows us to differentiate between original schema with dictionaries in place and schema where we materialize them as their value types.

Are these changes tested?

Tested locally with cargo test

Are there any user-facing changes?

The main user-facing change will be in different type returned for group keys using dictionaries.

@github-actions github-actions bot added the core Core DataFusion crate label Nov 21, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @qrilka !

Can we please also add a test for this change too (via an explain plan that shows the schema?)

https://github.com/apache/arrow-datafusion/blob/3dbda1e2cfdbfb974c268887294e6cf3de350f71/datafusion/sqllogictest/test_files/aggregate.slt#L2280

Maybe we could add a way to show schema in DisplayableExecutionPlan 🤔

/// returns schema with dictionary group keys materialized as their value types
/// The actual convertion happens in `RowConverter` and we don't do unnecessary
/// conversion back into dictionaries
fn materialize_dict_group_keys(schema: &Schema, group_count: usize) -> Schema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GroupValues is a trait and there are two implementations of it -- GroupValueRows and GroupValuesPrimitive

This conversion only applies to GroupValueRows (though GroupValuesPrimitive won't be used for input dictionary columns anyways)

I wonder if there is some way to move the conversion into GroupValueRows so it is clearer from the context when this is needed or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I do understand your point and it saw this when creating my PR but I don't quite see a good way around this: the problem is that GroupValues get's created only during execution in AggregateExec::execute_typed but the correct ("materialized") schema looks to be needed in AggregateExec itself, i.e. already at the planning stage (as a part of impl ExecutionPlan for AggregateExec).
Maybe I miss some option here?

@qrilka
Copy link
Contributor Author

qrilka commented Nov 23, 2023

@alamb I think the following test looks to be enough:

query T
select arrow_typeof(x_dict) from value_dict group by x_dict;
----
Int32
Int32
Int32
Int32
Int32

Adding schemas into EXPLAIN seems to be a bit of an overkill at least if we have only this case where it could be useful.

@alamb
Copy link
Contributor

alamb commented Nov 30, 2023

@alamb I think the following test looks to be enough:

query T
select arrow_typeof(x_dict) from value_dict group by x_dict;
----
Int32
Int32
Int32
Int32
Int32

Adding schemas into EXPLAIN seems to be a bit of an overkill at least if we have only this case where it could be useful.

Sorry for the late reply @qrilka -- what I was thinking is "if someone undid this code change, would any of the tests fail to know we broke it"

I think the answer is no. So that is what I was thinking about the test.

I am not sure if you are proposing a new test or saying an existing test is sufficient

Given that group keys inherently have few repeated values, especially
when grouping on a single column, the use of dictionary encoding is
unlikely to be yielding significant returns
@qrilka qrilka force-pushed the materialize-dictionaries-in-group-by-keys branch from 2b07114 to 2194cca Compare December 1, 2023 19:02
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 1, 2023
@qrilka
Copy link
Contributor Author

qrilka commented Dec 1, 2023

@alamb I was proposing a new test (which is what is in my quote) now I added it into the PR. Please take a look

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @qrilka

@alamb alamb merged commit eb8aff7 into apache:main Dec 1, 2023
22 checks passed
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 14, 2023
Given that group keys inherently have few repeated values, especially
when grouping on a single column, the use of dictionary encoding is
unlikely to be yielding significant returns
@alamb
Copy link
Contributor

alamb commented Jan 3, 2024

This PR seems to have caused a regression: #8738

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Materialize Dictionaries in Group Keys
2 participants