-
-
Notifications
You must be signed in to change notification settings - Fork 31
BUG: fix support to read parquet files with list columns #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix support to read parquet files with list columns #597
Conversation
…lways create them on-the-fly
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
Also commented that inline, but personally I would just leave this as is and not try to exactly reconcile (in the end the issue here is that pandas does not have a proper list type, and once it has that, this will change anyway)
Similarly here, I think we can just accept this as differences with the pyarrow->pandas conversion |
So it might be related just to the version of libgdal-core, and not this PR / the fact that libgdal-arrow-parquet was added? The last test run on main was still using libgdal 3.11 |
I started the tests on main manually, and they use libgdal 3.12 now, but they passed. I also tried adding libgdal-arrow-parquet in main (#599), but this still didn't fail. In this PR I tried moving around the pyarrow imports in different ways... but it keeps failing. If I comment out the new tests it stops failing, but if they are enabled it breaks... I moved them down the tests_... file so they are - I suppose - executed after the breaking test, but that doesn't help either. Seems like a flaky thing is general :-(, so I wonder how it behaves "in the wild" in real code... |
Managed to reproduce it locally with the test suite, taking a look |
…-parquet-list-columns
|
Thanks @theroggy! |
In PR #556 support for list-type colums was added, with tests for .geojson files. However, list columns in .parquet files are apparently returned/treated differently by GDAL than list columns in .geojson. This PR takes care of handling .parquet files correctly as well and adds tests for this case.
Remarks:
ndarrays rather than python lists, which is a small difference compared to .geojson files. As discussed below we keep this behaviour.use_arrowor not gives differences as well, e.g. in None being returned versus np.nan.use_arrow=Falsefor now as in this case the columns are flattened, and its not clear how we want to deal with this. To be further discussed/followed up in BUG: reading file with JSON ogr subtype is broken with use_arrow=True #592Nonevalues in the list, theseNonevalues are returned as0or""when read withuse_arrow=Falsewhich is incorrect. Withuse_arrow=True, theseNonevalues are returned asnp.nan, which is fine. This has been reported here: Parquet: list field types with None values in the list give issues OSGeo/gdal#13448reference #592