Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: support setting the field_id with an ArrowWriter #4702

Closed
mhilton opened this issue Aug 16, 2023 · 3 comments · Fixed by #4710
Closed

parquet: support setting the field_id with an ArrowWriter #4702

mhilton opened this issue Aug 16, 2023 · 3 comments · Fixed by #4710
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate parquet-derive

Comments

@mhilton
Copy link

mhilton commented Aug 16, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We would like to use the parquet files written from a set of arrow record batches as part of an apache-iceberg snapshot without modification. The apache-iceberg parquet specification requires that field-ids are present.

Describe the solution you'd like
The solution implemented by (at least) the go parquet package seems reasonable. This uses a metadata value with the key PARQUET:field_id to determine the field_id when converting an arrow schema into a parquet schema. If there is no such metadata entry then the field_id will not be present.

Describe alternatives you've considered
An alternative would be to add a mechanism to WriterProperties to specify the field_id to use with a column. This presumably would work in a similar manner to encoding.

Additional context
N/A

@mhilton mhilton added the enhancement Any new improvement worthy of a entry in the changelog label Aug 16, 2023
This was referenced Aug 16, 2023
@tustvold
Copy link
Contributor

Couple of notes from digging into this:

From https://iceberg.apache.org/spec/#column-projection:

Tables may also define a property schema.name-mapping.default with a JSON name mapping containing a list of field mapping objects. These mappings provide fallback field ids to be used when a data file does not contain field id information

So it would appear that field mappings are not strictly required to be present, this may be a way to avoid needing to rewrite data lacking such attributes

Additionally also from https://iceberg.apache.org/spec/#column-projection:

List types should contain a mapping in fields for element.
Map types should contain mappings in fields for key and value.

This would appear to suggest that iceberg only requires that field IDs are present for the bottom of the three-level list declaration

<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}

I think the approach suggested in this PR is perfectly acceptable, as whilst it provides no mechanism to provide a field id for repeated group list, I suspect this is fine for most use-cases

@tustvold tustvold self-assigned this Aug 17, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 17, 2023
@tustvold tustvold added the parquet Changes to the parquet crate label Aug 21, 2023
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #4706

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet-derive'} from #4706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate parquet-derive
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants