parquet: support setting the field_id with an ArrowWriter #4702

mhilton · 2023-08-16T09:03:38Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We would like to use the parquet files written from a set of arrow record batches as part of an apache-iceberg snapshot without modification. The apache-iceberg parquet specification requires that field-ids are present.

Describe the solution you'd like
The solution implemented by (at least) the go parquet package seems reasonable. This uses a metadata value with the key PARQUET:field_id to determine the field_id when converting an arrow schema into a parquet schema. If there is no such metadata entry then the field_id will not be present.

Describe alternatives you've considered
An alternative would be to add a mechanism to WriterProperties to specify the field_id to use with a column. This presumably would work in a similar manner to encoding.

Additional context
N/A

The text was updated successfully, but these errors were encountered:

tustvold · 2023-08-17T11:30:55Z

Couple of notes from digging into this:

From https://iceberg.apache.org/spec/#column-projection:

Tables may also define a property schema.name-mapping.default with a JSON name mapping containing a list of field mapping objects. These mappings provide fallback field ids to be used when a data file does not contain field id information

So it would appear that field mappings are not strictly required to be present, this may be a way to avoid needing to rewrite data lacking such attributes

Additionally also from https://iceberg.apache.org/spec/#column-projection:

List types should contain a mapping in fields for element.
Map types should contain mappings in fields for key and value.

This would appear to suggest that iceberg only requires that field IDs are present for the bottom of the three-level list declaration

<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}

I think the approach suggested in this PR is perfectly acceptable, as whilst it provides no mechanism to provide a field id for repeated group list, I suspect this is fine for most use-cases

tustvold · 2023-08-21T15:06:36Z

label_issue.py automatically added labels {'parquet'} from #4706

tustvold · 2023-08-21T15:06:38Z

label_issue.py automatically added labels {'parquet-derive'} from #4706

mhilton added the enhancement Any new improvement worthy of a entry in the changelog label Aug 16, 2023

This was referenced Aug 16, 2023

Parquet Field IDs #3548

Closed

Cleanup parquet type builders #4706

Merged

tustvold self-assigned this Aug 17, 2023

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 17, 2023

Support Field ID in ArrowWriter (apache#4702)

5c76c6a

tustvold mentioned this issue Aug 17, 2023

Support Field ID in ArrowWriter (#4702) #4710

Merged

tustvold closed this as completed in #4710 Aug 17, 2023

tustvold added a commit that referenced this issue Aug 17, 2023

Support Field ID in ArrowWriter (#4702) (#4710)

b810e8f

tustvold added the parquet Changes to the parquet crate label Aug 21, 2023

tustvold added the parquet-derive label Aug 21, 2023

Samrose-Ahmed mentioned this issue Sep 29, 2023

parquet: Field Ids are not read from a Parquet file without serialized arrow schema #4877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: support setting the field_id with an ArrowWriter #4702

parquet: support setting the field_id with an ArrowWriter #4702

mhilton commented Aug 16, 2023 •

edited

tustvold commented Aug 17, 2023

tustvold commented Aug 21, 2023

tustvold commented Aug 21, 2023

parquet: support setting the field_id with an ArrowWriter #4702

parquet: support setting the field_id with an ArrowWriter #4702

Comments

mhilton commented Aug 16, 2023 • edited

tustvold commented Aug 17, 2023

tustvold commented Aug 21, 2023

tustvold commented Aug 21, 2023

mhilton commented Aug 16, 2023 •

edited