New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RowGroupReader.get_row_iter() fails with Path ColumnPath not found #5064
Comments
Possibly a duplicate of #2394 Currently repeated fields are only supported by the arrow interface, or the lower level ColumnReader |
That indeed looks related to #2394. Reading through the issue conversation, I also hit the error It sounds like there isn't enough volunteer time to expand
Doc/example contribution could also be fair game (I was also thinking about adding something around ObjectStore + Parquet async -- I had a hard time figuring this one out but it actually works really well and I think I understand that part enough now to expand the docs). |
I can't promise speedy reviews, but I can try to review PRs. I would ask though if you plan on large feature work in this area you file tickets first to get feedback on what you propose. That being said, I want to just set expectations that these APIs will always be orders of magnitude slower than their columnar brethren. There is also a huge amount of subtlety to correctly handling nested schema in all the weird forms it comes in, which the current row readers don't even attempt to handle. If you want a hobby project to work on, happy to help, but if you're looking to base an application around these APIs I would strongly encourage just using the arrow interface |
Sounds fair. I am going to run tests on more datasets to see if the scenario can be unblocked through a targeted fix, and if not I'll resolve as Won't Fix. The call to |
Apologies, it is on my list, but I've been a bit swamped recently and I need some time to sit down and learn how that code is working before I can review effectively |
|
Describe the bug
I am trying to read a Parquet file generated by a Hadoop MapReduce job. The schema is a bit complex, with a minimal repro looking something like this:
Reading this schema fails with
Path ColumnPath { parts: [\"value1\"] } not found
. The error message is correct:value1
does not exist. It is off by one level and should belevel1.value1
.Looking into the code, it seems like there is a double
path.pop()
happening inreader_tree()
.To Reproduce
Minimal unit test reproing the issue:
Expected behavior
Reading the Parquet file succeeds.
Additional context
A fix may be to push back the value just popped so that it can be popped again right after. This 2-line change in
reader_tree()
got the test to pass. I am very new to that codebase but happy to send that as Pull Request along with the test if that fix makes sense.The text was updated successfully, but these errors were encountered: