Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Read ORC metadata #35304

Closed
Fokko opened this issue Apr 24, 2023 · 2 comments · Fixed by #35499
Closed

[Python] Read ORC metadata #35304

Fokko opened this issue Apr 24, 2023 · 2 comments · Fixed by #35499

Comments

@Fokko
Copy link
Contributor

Fokko commented Apr 24, 2023

Describe the enhancement requested

When reading an ORC schema, the metadata of the fields isn't exposed. For more information check: apache/iceberg#6973 (comment)

Component(s)

Python

@jorisvandenbossche
Copy link
Member

In C++, there is a ORCFileReader::ReadMetadata(), which is exposed as the metadata property in the Python ORCReader. Does that already give the relevant information, or is this a different set of metadata? (I know that arrow and parquet also both have file-level metadata as field-level metadata, so the same might be the case here)

I see that the ORC C++ orc::Type has a getAttributeKeys, and the proto spec has an attributes field for a Type (https://github.com/apache/orc/blob/cd79029e41e9cfe434529f820e73900e6a72797e/proto/orc_proto.proto#L226). Is this the information that you looking for?

@wgtmac
Copy link
Member

wgtmac commented May 8, 2023

In C++, there is a ORCFileReader::ReadMetadata(), which is exposed as the metadata property in the Python ORCReader. Does that already give the relevant information, or is this a different set of metadata? (I know that arrow and parquet also both have file-level metadata as field-level metadata, so the same might be the case here)

I see that the ORC C++ orc::Type has a getAttributeKeys, and the proto spec has an attributes field for a Type (https://github.com/apache/orc/blob/cd79029e41e9cfe434529f820e73900e6a72797e/proto/orc_proto.proto#L226). Is this the information that you looking for?

I think @Fokko is looking for a way to expose attributes from ORC. attributes store the information about column field-id defined by Apache Iceberg.

wgtmac added a commit to wgtmac/arrow that referenced this issue May 9, 2023
pitrou pushed a commit that referenced this issue May 11, 2023
### Rationale for this change

Apache Orc has a per column attribute map and Apache Iceberg depends on this to encode its field metadata. However, the C++ Orc adapter does not know it which makes it difficult to support pyarrow and pyiceberg.

### What changes are included in this PR?

Both reader and writer support Orc attributes conversion from/to arrow field metadata.

### Are these changes tested?

Added two test cases to make sure the Orc adapter can preserve the attributes well.

### Are there any user-facing changes?

No.
* Closes: #35304

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone May 11, 2023
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
### Rationale for this change

Apache Orc has a per column attribute map and Apache Iceberg depends on this to encode its field metadata. However, the C++ Orc adapter does not know it which makes it difficult to support pyarrow and pyiceberg.

### What changes are included in this PR?

Both reader and writer support Orc attributes conversion from/to arrow field metadata.

### Are these changes tested?

Added two test cases to make sure the Orc adapter can preserve the attributes well.

### Are there any user-facing changes?

No.
* Closes: apache#35304

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
### Rationale for this change

Apache Orc has a per column attribute map and Apache Iceberg depends on this to encode its field metadata. However, the C++ Orc adapter does not know it which makes it difficult to support pyarrow and pyiceberg.

### What changes are included in this PR?

Both reader and writer support Orc attributes conversion from/to arrow field metadata.

### Are these changes tested?

Added two test cases to make sure the Orc adapter can preserve the attributes well.

### Are there any user-facing changes?

No.
* Closes: apache#35304

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants