Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33466: [Go][Parquet] Add support for Dictionary arrays to pqarrow #34342

Merged
merged 12 commits into from Mar 3, 2023

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Feb 24, 2023

Rationale for this change

The Parquet package should properly handle dictionary array types to allow consumers to efficiently read/write dictionary encoded arrays for Dictionary encoded parquet files.

What changes are included in this PR?

Updates and fixes to allow Parquet read/write directly to/from dictionary arrays. Because it requires the Unique and Take compute functions, the dictionary handling requires go1.18+ just like the compute package does.

Updates the schema to handle dictionary types when storing the arrow schema. This also adds some new methods to the ColumnWriter interface and the BinaryRecordReader for handling Dictionaries.

Are these changes tested?

Yes, unit tests are added in the change.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go-side changes look good to me, but I don't currently have the context for the Arrow-side changes.

go/arrow/array/dictionary.go Show resolved Hide resolved
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The arrow-go bits look good to me, but I unfortunately don't have the context for reviewing the Parquet bits right now...

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Feb 28, 2023
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looked through the tests. Looks like you've covered the important cases there :) 👍

go/parquet/pqarrow/encode_dictionary_test.go Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting review Awaiting review awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review awaiting review Awaiting review awaiting merge Awaiting merge labels Mar 1, 2023
@zeroshade zeroshade merged commit afe5514 into apache:main Mar 3, 2023
@zeroshade zeroshade deleted the parquet-write-dictionary branch March 3, 2023 22:05
@ursabot
Copy link

ursabot commented Mar 4, 2023

Benchmark runs are scheduled for baseline = 73e2b56 and contender = afe5514. afe5514 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.06% ⬆️0.0%] test-mac-arm
[Finished ⬇️1.02% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.19% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] afe5514a ec2-t3-xlarge-us-east-2
[Finished] afe5514a test-mac-arm
[Finished] afe5514a ursa-i9-9960x
[Finished] afe5514a ursa-thinkcentre-m75q
[Finished] 73e2b561 ec2-t3-xlarge-us-east-2
[Failed] 73e2b561 test-mac-arm
[Finished] 73e2b561 ursa-i9-9960x
[Finished] 73e2b561 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GO]: pqarrow (github.com/apache/arrow/go/v9/parquet/pqarrow) cannot handle arrow's DICTIONARY field
4 participants