Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid point parquet file #2

Open
kylebarron opened this issue Jul 16, 2023 · 5 comments
Open

Invalid point parquet file #2

kylebarron opened this issue Jul 16, 2023 · 5 comments

Comments

@kylebarron
Copy link
Member

Trying to load example-point-interleaved.parquet fails in both pyarrow and Rust.

pyarrow.parquet.read_table('example-point-interleaved.parquet') gives:

File ~/.pyenv/versions/3.9.16/lib/python3.9/site-packages/pyarrow/_dataset.pyx:546, in pyarrow._dataset.Dataset.to_table()

File ~/.pyenv/versions/3.9.16/lib/python3.9/site-packages/pyarrow/_dataset.pyx:3449, in pyarrow._dataset.Scanner.to_table()

File ~/.pyenv/versions/3.9.16/lib/python3.9/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/.pyenv/versions/3.9.16/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Expected all lists to be of size=2 but index 3 had size=0

Rust (arrow2/parquet2) gives:

thread 'array::point::array::test::parse_wkb_geoarrow_interleaved_example' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("validity mask length must be equal to the number of values divided by size")', /Users/kyle/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow2-0.17.2/src/array/fixed_size_list/mod.rs:80:52
@paleolimbot
Copy link
Contributor

I think that is a long-standing issue with the fixed-size list implementation in Parquet ( apache/arrow#35692 , apache/arrow#24425 ), at least on the Arrow C++ side. Practically it means you can't read NULL points from a Parquet file if you use the interleaved representation (although you can write them no problem).

@paleolimbot
Copy link
Contributor

For testing purposes I should probably render all the example files to Arrow IPC as well since it's unlikely any fix to that will be widely available in the next few months.

@kylebarron
Copy link
Member Author

IMO saving as IPC makes the most sense since this is nominally test data for geoarrow, not geoparquet. Also IPC is able to exactly mirror every type in Arrow, whereas in the future unions won't be able to be represented in Parquet right?

@paleolimbot
Copy link
Contributor

I think Parquet can model a sparse union as a struct but I don't know if that's something that is useful or not. In any case, rendering those examples to IPC is the best fit since, as you noted, it can perfectly represent an Arrow type.

@paleolimbot
Copy link
Contributor

I forgot to create a branch + PR (😬 ) but they should all be there in IPC format! (e.g., https://github.com/geoarrow/geoarrow-data/blob/main/example/README.md )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants