Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add export to GeoArrow #3219

Merged
merged 39 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
fe43cbf
Add export to GeoArrow
kylebarron Mar 13, 2024
e5cb82c
add support for interleaved/separate + update field names + add basic…
jorisvandenbossche May 8, 2024
f1d2cfe
Merge remote-tracking branch 'upstream/main' into kyle/geoarrow
jorisvandenbossche May 15, 2024
4ff128f
add local test files + test and fix z dim
jorisvandenbossche May 15, 2024
806dbd5
fix + test crs per column
jorisvandenbossche May 15, 2024
b0e5348
skip test if no pyarrow
jorisvandenbossche May 17, 2024
32be8ec
Merge remote-tracking branch 'upstream/main' into kyle/geoarrow
jorisvandenbossche May 17, 2024
92d546a
refactor main geopandas->arrow conversion out of geoparquet writing
jorisvandenbossche May 17, 2024
196265d
add test for erro message and for missing values
jorisvandenbossche May 17, 2024
334eff0
clean-up assert helper
jorisvandenbossche May 17, 2024
4661fb6
add small test for mixed geometries
jorisvandenbossche May 17, 2024
3379c37
missing values for interleaved not supported for older pyarrow
jorisvandenbossche May 17, 2024
54901c3
Merge remote-tracking branch 'upstream/main' into kyle/geoarrow
jorisvandenbossche May 17, 2024
ba7ef41
flatten -> ravel
jorisvandenbossche May 17, 2024
39cb8b1
only require recent pyarrow for point geometries + bump minimum teste…
jorisvandenbossche May 17, 2024
ca91e5d
bump minimum tested version to 10.0
jorisvandenbossche May 17, 2024
07eb5d6
skip 3D for older GEOS
jorisvandenbossche May 17, 2024
20a60f5
fix for older shapely and missing values in point array
jorisvandenbossche May 18, 2024
e61c4ea
fix polygon creation for older pyarrow
jorisvandenbossche May 18, 2024
7607873
Merge remote-tracking branch 'upstream/main' into kyle/geoarrow
jorisvandenbossche May 18, 2024
42c657f
skip for older GEOS
jorisvandenbossche May 18, 2024
459f1b4
fix read-only bug with combo of nightly pandas+numpy
jorisvandenbossche May 18, 2024
bb53149
return generic ArrowTable object instead of pyarrow.Table
jorisvandenbossche May 18, 2024
61e4d0e
Merge remote-tracking branch 'upstream/main' into kyle/geoarrow
jorisvandenbossche May 19, 2024
cdc5f23
fixup linting
jorisvandenbossche May 19, 2024
df3b594
create data with nullable=False fields
jorisvandenbossche May 20, 2024
07a836d
Update geopandas/geodataframe.py
jorisvandenbossche May 23, 2024
d9aa311
add back empty extension metadata in case of no crs
jorisvandenbossche May 23, 2024
be49923
Merge remote-tracking branch 'upstream/main' into kyle/geoarrow
jorisvandenbossche May 23, 2024
f253482
fixup merge
jorisvandenbossche May 23, 2024
293a18f
cleanup comments
jorisvandenbossche May 23, 2024
ec41e85
add basic test with geoarrow-pyarrow
jorisvandenbossche May 23, 2024
c1699c5
Merge remote-tracking branch 'upstream/main' into kyle/geoarrow
jorisvandenbossche May 23, 2024
c16d2ed
add export for geoseries
jorisvandenbossche May 23, 2024
317a14b
update tests to actually test the point geoarrow case
jorisvandenbossche May 23, 2024
b4817d9
use is_nan compute function instead of method
jorisvandenbossche May 23, 2024
8f2bac7
fix docstring
jorisvandenbossche May 23, 2024
b3665e3
properly expose include_z
jorisvandenbossche May 24, 2024
fa0d79f
add versionadded
jorisvandenbossche May 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ci/envs/39-minimal.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ dependencies:
- geopy
- SQLalchemy
- libspatialite
- pyarrow=8.0.0
- pyarrow=10.0
- geodatasets
- pip
- pip:
Expand Down
1 change: 1 addition & 0 deletions geopandas/_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
# Shapely / GEOS compat
# -----------------------------------------------------------------------------

SHAPELY_GE_204 = Version(shapely.__version__) >= Version("2.0.4")

GEOS_GE_390 = shapely.geos.geos_version >= (3, 9, 0)
GEOS_GE_310 = shapely.geos.geos_version >= (3, 10, 0)
Expand Down
74 changes: 74 additions & 0 deletions geopandas/geodataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1149,6 +1149,80 @@ def to_wkt(self, **kwargs):

return df

def to_arrow(self, index=None, geometry_encoding="WKB", interleaved=True):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coord_type: Literal["interleaved", "struct|separate"] might be more future-proof than interleaved=True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but the interleaved is consistent with an existing keyword in shapely. But maybe that's not a good enough reason ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's maybe slightly confusing to have both interleaved and geometry_encoding parameters. Could we have geometry_encoding: "WKB" | "interleaved" | "separated"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it separate might make it easier to specify future encodings/coordinate combinations that may not exist yet 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to allow to user to specify a more specific geoarrow type, like "point", "multipoints", etc (i.e. essentially anything that comes after geoarrow.<> in the extension type name). That would give users the possibility to control the exact output (eg they know they can have mix of polygons and multipolygons, and want to be sure to always export as multipolygon regardless of the actual values present)

If we do that, that wouldn't really match with allowing "interleaved" | "separated"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe def to_arrow(self, index=None, *, geometry_encoding=None, interleaved=None) would leave some flexibility to sort out the details in a backwards-compatible way? As much as the completest in me would like the ability to specify all possible type constraints, geometry_encoding + interleaved is probably sufficient for almost everybody 🙂 .

Copy link
Contributor Author

@kylebarron kylebarron May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with separate options here. (As a general statement, big fan of using * to force named arguments, so +1 on that here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and also later on if we want to improve this interface by using names for the encoding or whathever, it should always be relatively straightforward to keep supporting the interface we come up now for back compat. Getting something out there for now is more important ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe def to_arrow(self, index=None, *, geometry_encoding=None, interleaved=None) would leave some flexibility to sort out the details in a backwards-compatible way?

@paleolimbot with geometry_encoding=None being None by default, you meant to have the user required to always specify some encoding (i.e. not actually default to WKB) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think WKB is the default that makes the most sense for the forseeable future. None would just make it clear that it's unspecified and that we're picking the default (but perhaps that's a change for a future version).

"""Encode a GeoDataFrame to GeoArrow format.

See https://geoarrow.org/ for details on the GeoArrow specification.

This functions returns a generic Arrow data object implementing
the `Arrow PyCapsule Protocol`_ (i.e. having an ``__arrow_c_stream__``
method). This object can then be consumed by your Arrow implementation
of choice that supports this protocol.

.. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html

Parameters
----------
index : bool, default None
If ``True``, always include the dataframe's index(es) as columns
in the file output.
If ``False``, the index(es) will not be written to the file.
If ``None``, the index(ex) will be included as columns in the file
output except `RangeIndex` which is stored as metadata only.
geometry_encoding : {'WKB', 'geoarrow' }, default 'WKB'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea behind WKB being the default? Wouldn't it make more sense to default to GeoArrow? Going forward, I suppose that GeoArrow is the encoding of interest and since this is a new method, we don't need to think about backwards compatibility as we do with Parquet IO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WKB has the advantage that any GeoSeries can be converted to it (and that two distinct GeoSeries that are converted to it, perhaps containing different geometry types, can be combined with an Arrow C++ array concatenate operation). I love the GeoArrow encoding but I can see how users might have the most success opting in when it makes sense for them to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, we've also been discussing a mixed type (backed by an arrow union) that can hold multiple underlying arrays. That would always work here, but shapely doesn't support export to it yet.

The GeoArrow encoding to use for the data conversion.
interleaved : bool, default True
Only relevant for 'geoarrow' encoding. If True, the geometries
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
coordinates are interleaved in a single fixed size list array.
If False, the coordinates are stored as separate arrays in a
struct type.

Returns
-------
ArrowTable
A generic Arrow table object with geometry columns encoded to
GeoArrow.

Examples
--------
>>> from shapely.geometry import Point
>>> data = {'col1': ['name1', 'name2'], 'geometry': [Point(1, 2), Point(2, 1)]}
>>> gdf = geopandas.GeoDataFrame(data)
>>> gdf
col1 geometry
0 name1 POINT (1 2)
1 name2 POINT (2 1)

>>> arrow_table = gdf.to_arrow()
>>> arrow_table
<geopandas.io.geoarrow.ArrowTable object at ...>

The returned data object needs to be consumed by a library implementing
the Arrow PyCapsule Protocol. For example, wrapping the data as a
pyarrow.Table (requires pyarrow >= 14.0):

>>> import pyarrow as pa
>>> table = pa.table(arrow_table)
>>> table
pyarrow.Table
col1: string
geometry: binary
----
col1: [["name1","name2"]]
geometry: [[0101000000000000000000F03F0000000000000040,\
01010000000000000000000040000000000000F03F]]

"""
from geopandas.io.geoarrow import ArrowTable, geopandas_to_arrow

table = geopandas_to_arrow(
self,
index=index,
geometry_encoding=geometry_encoding,
interleaved=interleaved,
)
return ArrowTable(table)

def to_parquet(
self, path, index=None, compression="snappy", schema_version=None, **kwargs
):
Expand Down
18 changes: 4 additions & 14 deletions geopandas/io/arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@
import numpy as np
from pandas import DataFrame, Series

import shapely

import geopandas
from geopandas import GeoDataFrame
from geopandas._compat import import_optional_dependency
Expand Down Expand Up @@ -272,24 +270,16 @@ def _geopandas_to_arrow(df, index=None, schema_version=None):
"""
Helper function with main, shared logic for to_parquet/to_feather.
"""
from pyarrow import Table
from geopandas.io.geoarrow import geopandas_to_arrow

_validate_dataframe(df)

# create geo metadata before altering incoming data frame
geo_metadata = _create_metadata(df, schema_version=schema_version)

if shapely.geos_version > (3, 10, 0):
kwargs = {"flavor": "iso"}
else:
if any(
df[col].array.has_z.any() for col in df.columns[df.dtypes == "geometry"]
):
raise ValueError("Cannot write 3D geometries with GEOS<3.10")
kwargs = {}
df = df.to_wkb(**kwargs)

table = Table.from_pandas(df, preserve_index=index)
table = geopandas_to_arrow(
df, geometry_encoding="WKB", index=index, interleaved=True
)

# Store geopandas specific file-level metadata
# This must be done AFTER creating the table or it is not persisted
Expand Down
Loading
Loading