Skip to content

Conversation

@theroggy
Copy link
Member

@theroggy theroggy commented Nov 18, 2025

In PR #556 support for list-type colums was added, with tests for .geojson files. However, list columns in .parquet files are apparently returned/treated differently by GDAL than list columns in .geojson. This PR takes care of handling .parquet files correctly as well and adds tests for this case.

Remarks:

  • for ".parquet" files, list columns are returned already as lists without having to parse them. However, the lists returned are ndarrays rather than python lists, which is a small difference compared to .geojson files. As discussed below we keep this behaviour.
  • use_arrow or not gives differences as well, e.g. in None being returned versus np.nan.
  • A test for nested columns in a parquet file was added, but is skipped when use_arrow=False for now as in this case the columns are flattened, and its not clear how we want to deal with this. To be further discussed/followed up in BUG: reading file with JSON ogr subtype is broken with use_arrow=True #592
  • When a .parquet file contains list fields with None values in the list, these None values are returned as 0 or "" when read with use_arrow=False which is incorrect. With use_arrow=True, these None values are returned as np.nan, which is fine. This has been reported here: Parquet: list field types with None values in the list give issues OSGeo/gdal#13448

reference #592

@theroggy theroggy changed the title ENH: add support for parquet list columns BUG: fix support to read parquet files with list columns Nov 18, 2025
@theroggy theroggy changed the title BUG: fix support to read parquet files with list columns BUG: also support to read parquet files with list columns Nov 18, 2025
@theroggy theroggy marked this pull request as ready for review November 18, 2025 15:08
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 18, 2025

  • for ".parquet" files, list columns are returned already as lists without having to parse them. However, the lists returned are ndarrays rather than python lists. This PR changes the behaviour for .geojson files so they also return ndarrays, but it could as well be changed the other way around.

Also commented that inline, but personally I would just leave this as is and not try to exactly reconcile (in the end the issue here is that pandas does not have a proper list type, and once it has that, this will change anyway)

use_arrow or not gives differences as well, e.g. in None being returned versus np.nan.

Similarly here, I think we can just accept this as differences with the pyarrow->pandas conversion

@jorisvandenbossche
Copy link
Member

  • TODO: after adding the libgdal-arrow-parquet package in a conda CI env a test, test_read_dataframe_arrow_dtypes started failing. If libgdal-core is limited to < 3.12, the error disappears again?

So it might be related just to the version of libgdal-core, and not this PR / the fact that libgdal-arrow-parquet was added?

The last test run on main was still using libgdal 3.11

@theroggy
Copy link
Member Author

theroggy commented Nov 19, 2025

  • TODO: after adding the libgdal-arrow-parquet package in a conda CI env a test, test_read_dataframe_arrow_dtypes started failing. If libgdal-core is limited to < 3.12, the error disappears again?

So it might be related just to the version of libgdal-core, and not this PR / the fact that libgdal-arrow-parquet was added?

The last test run on main was still using libgdal 3.11

I started the tests on main manually, and they use libgdal 3.12 now, but they passed. I also tried adding libgdal-arrow-parquet in main (#599), but this still didn't fail.

In this PR I tried moving around the pyarrow imports in different ways... but it keeps failing. If I comment out the new tests it stops failing, but if they are enabled it breaks... I moved them down the tests_... file so they are - I suppose - executed after the breaking test, but that doesn't help either.

Seems like a flaky thing is general :-(, so I wonder how it behaves "in the wild" in real code...

@jorisvandenbossche
Copy link
Member

Seems like a flaky thing is general :-(, so I wonder how it behaves "in the wild" in real code...

Managed to reproduce it locally with the test suite, taking a look

@theroggy theroggy added this to the 0.12.0 milestone Nov 19, 2025
@theroggy theroggy changed the title BUG: also support to read parquet files with list columns BUG: fix support to read parquet files with list columns Nov 19, 2025
@jorisvandenbossche jorisvandenbossche merged commit d4f51b3 into geopandas:main Nov 20, 2025
3 of 25 checks passed
@theroggy theroggy deleted the ENH-add-support-for-parquet-list-columns branch November 20, 2025 10:31
@jorisvandenbossche
Copy link
Member

Thanks @theroggy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants