Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate strings/binarys when building view types #6005

Merged
merged 4 commits into from
Jul 8, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Closes #5910.

Rationale for this change

As discussed in #5910, when the array of strings have low number of unique values, it makes sense to deduplicate them while building it, especially when invoking gc

This pr implements an optional mechanism to deduplicate the strings. This is turned off by default, as it may slow down building array if the array is mostly unique.
However, the gc will by default enable it, so that gc generates an optimal layout.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 4, 2024
@XiangpengHao XiangpengHao changed the title Deduplication strings/binarys when building view types Deduplicate strings/binarys when building view types Jul 4, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao -- I think this PR looks great.

I think we need to extend the tests to cover 2 more cases (I left a comment) but otherwise it is ready to go

I left some suggestions that might improve the code, but I think they are optional

cc @alexwilcoxson-rel @ClSlaid and @ariesdevil

arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
XiangpengHao and others added 2 commits July 6, 2024 09:49
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@XiangpengHao XiangpengHao requested a review from alamb July 6, 2024 14:20
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao 🙏


let mut builder = StringViewBuilder::new()
.with_deduplicate_strings()
.with_block_size(value_1.len() as u32 * 2); // so that we will have multiple buffers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@alamb alamb merged commit 2b986df into apache:master Jul 8, 2024
25 checks passed
@XiangpengHao XiangpengHao deleted the deduplicate-view branch July 8, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider implementing some sort of deduplicate / intern functionality for StringView
2 participants