Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GO]: pqarrow (github.com/apache/arrow/go/v9/parquet/pqarrow) cannot handle arrow's DICTIONARY field #33466

Closed
asfimport opened this issue Nov 9, 2022 · 1 comment · Fixed by #34342

Comments

@asfimport
Copy link

Hey, Arrow Go Dev:
 
I was trying to save some arrow tables out to parquet files, with the help of the "github.com/apache/arrow/go/v9/parquet/pqarrow" package. btw, it's generally a great design (of Arrow) and a great Go implementation. 

 
However, one issue sticks out: in my original arrow Table I have some DICTIONARY fields, which pqarrow does NOT currently support.
 
I would assume supporting them will be quite straightward: just "denormalize" the DICTIONARY value into corresponding values (string, Timestamp, etc), and it's up to the Parquet to do the right trick (using DICTIONARY encoding, etc). 
 
I would have done this conversion on-the-fly by myself, by converting each DICTIONARY field into underlying values. However, the arrow table schema is dynamic and outside my control, and I need to iterate through fields (maybe structs) to locate those) -> it would be much better if pqarrow can support this natively. 
 
Can anyone help? thanks!

Reporter: Kevin Yang
Assignee: Matthew Topol / @zeroshade

Note: This issue was originally created as ARROW-18288. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Matthew Topol / @zeroshade:
This isn't quite as straightforward as denormalizing the values in order to ensure proper statistics handling and efficient propagation. (Sure, you could just naively denormalize and then write, but that could cause unnecessary copies and other inefficient handling). I've started working on this but ran into a couple snags where I will need to utilize and enhance the compute package. I'll have a few things up for this soon as I work it out as this is going to require:

  • Enabling proper casting from Dictionary types to values (unpacking dictionaries)
  • One of the following:
    • Implementing hash kernels for the Compute module to efficiently perform a unique operation on the dictionary indexes to find the min/max for stats
    • Implementing aggregation kernels to implement using MinMax to find the min-max on the dictionary array directly (more efficient than hash for uniqueness but will take longer / harder to do).

zeroshade added a commit that referenced this issue Feb 14, 2023
### Rationale for this change

Implementing a kernel for computing the "unique" values in an arrow array, primarily for use in solving #33466. 

### What changes are included in this PR?
Adds a "unique" function to the compute list and helper convenience functions.

### Are these changes tested?
Yes, unit tests are included.

### Are there any user-facing changes?
Just the new available functions.

* Closes: #34171

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
### Rationale for this change

Implementing a kernel for computing the "unique" values in an arrow array, primarily for use in solving apache#33466. 

### What changes are included in this PR?
Adds a "unique" function to the compute list and helper convenience functions.

### Are these changes tested?
Yes, unit tests are included.

### Are there any user-facing changes?
Just the new available functions.

* Closes: apache#34171

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
zeroshade added a commit to zeroshade/arrow that referenced this issue Feb 24, 2023
zeroshade added a commit to zeroshade/arrow that referenced this issue Feb 24, 2023
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
### Rationale for this change

Implementing a kernel for computing the "unique" values in an arrow array, primarily for use in solving apache#33466. 

### What changes are included in this PR?
Adds a "unique" function to the compute list and helper convenience functions.

### Are these changes tested?
Yes, unit tests are included.

### Are there any user-facing changes?
Just the new available functions.

* Closes: apache#34171

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
zeroshade added a commit to zeroshade/arrow that referenced this issue Mar 2, 2023
zeroshade added a commit that referenced this issue Mar 3, 2023
…34342)

### Rationale for this change
The Parquet package should properly handle dictionary array types to allow consumers to efficiently read/write dictionary encoded arrays for Dictionary encoded parquet files.

### What changes are included in this PR?
Updates and fixes to allow Parquet read/write directly to/from dictionary arrays. Because it requires the `Unique` and `Take` compute functions, the dictionary handling requires go1.18+ just like the compute package does.

Updates the schema to handle dictionary types when storing the arrow schema. This also adds some new methods to the `ColumnWriter` interface and the `BinaryRecordReader` for handling Dictionaries. 

### Are these changes tested?
Yes, unit tests are added in the change.

* Closes: #33466

Lead-authored-by: Matt Topol <zotthewizard@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 12.0.0 milestone Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants