PERF: improve reading of geoarrow encoded Parquet (avoid converting coords to geopandas object dtype) #3322

jorisvandenbossche · 2024-06-03T10:19:25Z

This reduces the time to read a simple geoparquet file with 10 million points from about 6 seconds to about 3 seconds.
Surprisingly, this also reduces the time in the case of reading such a file with the default WKB encoding, because for some reason the conversion in pyarrow for variable size binary column to numpy is faster than to pandas (which doesn't really make sense since both create the same object dtype array, will have to investigate and report upstream to pyarrow)

This takes a similar approach as the GeoArrow import code (#3301), i.e. only converting the attributes from Arrow -> Pandas, and then separately the geometry columns. In case of geoarrow-encoded columns, this avoids converting the nested struct to python lists/dictionaries (which we then don't use anyway, because we create the geometries directly from the raw Arrow data).
It is a bit unfortunate that this logic is a bit duplicated between arrow.py and geoarrow.py. But the problem is that for from_arrow the logic is based on checking the arrow extension metadata, while for GeoParquet we need to support generic files that might not have that Arrow-specific metadata.

N = 10_000_000

df = geopandas.GeoDataFrame({"col": range(N)}, geometry=geopandas.GeoSeries.from_xy(np.random.rand(N), np.random.rand(N)))
df.to_parquet("/tmp/test_points_wkb.parquet", geometry_encoding="WKB")
df.to_parquet("/tmp/test_points_geoarrow.parquet", geometry_encoding="geoarrow")

In [11]: %time geopandas.read_parquet("/tmp/test_points_wkb.parquet")
CPU times: user 5.22 s, sys: 1.7 s, total: 6.91 s
Wall time: 6.54 s  # <-- main
Wall time: 4.09 s  # <-- PR

In [12]: %time geopandas.read_parquet("/tmp/test_points_geoarrow.parquet")
CPU times: user 4.92 s, sys: 1.71 s, total: 6.62 s
Wall time: 6.4 s  # <-- main
Wall time: 3.65 s  # <-- PR

(it's also a bit disappointing how creating Points from x/y values is only slightly faster than parsing WKB, but that's something to profile on the shapely side)

…oords to geopandas object dtype)

martinfleis

Interesting...

m-richards

Thanks Joris!

jorisvandenbossche · 2024-06-07T09:12:17Z

Opened apache/arrow#42026 for the pyarrow perf issue

PERF: improve reading of geoarrow encoded Parquet (avoid converting c…

aac77df

…oords to geopandas object dtype)

martinfleis approved these changes Jun 3, 2024

View reviewed changes

jorisvandenbossche added 2 commits June 3, 2024 14:51

temp

7f2bdfc

preserve and test column order

5e6c9a1

m-richards approved these changes Jun 6, 2024

View reviewed changes

martinfleis added this to the 1.0 milestone Jun 6, 2024

jorisvandenbossche mentioned this pull request Jun 7, 2024

[Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy apache/arrow#42026

Open

jorisvandenbossche merged commit a61af6e into geopandas:main Jun 7, 2024
20 checks passed

jorisvandenbossche deleted the perf-read-parquet-geoarrow branch June 7, 2024 09:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improve reading of geoarrow encoded Parquet (avoid converting coords to geopandas object dtype) #3322

PERF: improve reading of geoarrow encoded Parquet (avoid converting coords to geopandas object dtype) #3322

jorisvandenbossche commented Jun 3, 2024

martinfleis left a comment

m-richards left a comment

jorisvandenbossche commented Jun 7, 2024

PERF: improve reading of geoarrow encoded Parquet (avoid converting coords to geopandas object dtype) #3322

PERF: improve reading of geoarrow encoded Parquet (avoid converting coords to geopandas object dtype) #3322

Conversation

jorisvandenbossche commented Jun 3, 2024

martinfleis left a comment

Choose a reason for hiding this comment

m-richards left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 7, 2024