Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Expose ORC metadata() in Python ORCFile #17254

Closed
asfimport opened this issue Jul 2, 2020 · 8 comments
Closed

[Python] Expose ORC metadata() in Python ORCFile #17254

asfimport opened this issue Jul 2, 2020 · 8 comments

Comments

@asfimport
Copy link

There is currently no way for a user to directly access the underlying ORC metadata of a given file. It seems the C++ functions and objects already existing and rather the plumbing is just missing the the cython/python and potentially a few c++ shims. Giving users the ability to retrieve the metadata without first reading the entire file could help numerous applications to increase their query performance by allowing them to intelligently determine which ORC stripes should be read.  

This would allow for something like 

import pyarrow as pa 
orc_metadata = pa.orc.ORCFile(filename).metadata()

Reporter: Jeremy Dyer
Assignee: Ian Alexander Joiner / @iajoiner

PRs and other links:

Note: This issue was originally created as ARROW-9299. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Caleb Winston:
This would be very useful for our use-case in cuDF where we want to select stripes to read onto GPU based on statistics stored in the ORC metadata.

Edit: Didn't see who was posting this haha.

@asfimport
Copy link
Author

Caleb Winston:
[~jeremy.dyer] Is it possible to get metadata using arrow-cpp though? I'm seeing a private field [1] storing an ORC Reader which could be used to get metadata. There isn't a way to access this through C++ API even though the metadata is in there - correct?

 [1] 

std::unique_ptr<liborc::Reader> reader_;

@asfimport
Copy link
Author

Jeremy Dyer:
[~calebwin]  it is possible, but not currently visible as you mentioned. I think the easiest thing to do would be add a function in orc/adaptor.cc that did basically the same thing done here [1]. After that it would be exposed so that python could invoke it I believe? I'm no expert here but seems like that would do the trick.

[1] 

std::list<std::string> keys = reader_->getMetadataKeys();
std::shared_ptr<KeyValueMetadata> metadata;
if (!keys.empty()) {
metadata = std::make_shared<KeyValueMetadata>();
for (auto it = keys.begin(); it != keys.end(); ++it) {
metadata->Append(*it, reader_->getMetadataValue(*it));
}
}

@asfimport
Copy link
Author

Ian Alexander Joiner / @iajoiner:
I will try to do it by Oct.

@asfimport
Copy link
Author

Ian Alexander Joiner / @iajoiner:
OK I'm working on this now.

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Ian Alexander Joiner / @iajoiner:
[~calebwin] [~jeremy.dyer] One question..what kind of metadata do you want? User metadata or stuff such as stripe stats and compression kind?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 10157
#10157

@asfimport asfimport added this to the 5.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant