-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add export to GeoArrow #3219
Changes from 25 commits
fe43cbf
e5cb82c
f1d2cfe
4ff128f
806dbd5
b0e5348
32be8ec
92d546a
196265d
334eff0
4661fb6
3379c37
54901c3
ba7ef41
39cb8b1
ca91e5d
07eb5d6
20a60f5
e61c4ea
7607873
42c657f
459f1b4
bb53149
61e4d0e
cdc5f23
df3b594
07a836d
d9aa311
be49923
f253482
293a18f
ec41e85
c1699c5
c16d2ed
317a14b
b4817d9
8f2bac7
b3665e3
fa0d79f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1149,6 +1149,80 @@ def to_wkt(self, **kwargs): | |
|
||
return df | ||
|
||
def to_arrow(self, index=None, geometry_encoding="WKB", interleaved=True): | ||
"""Encode a GeoDataFrame to GeoArrow format. | ||
|
||
See https://geoarrow.org/ for details on the GeoArrow specification. | ||
|
||
This functions returns a generic Arrow data object implementing | ||
the `Arrow PyCapsule Protocol`_ (i.e. having an ``__arrow_c_stream__`` | ||
method). This object can then be consumed by your Arrow implementation | ||
of choice that supports this protocol. | ||
|
||
.. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html | ||
|
||
Parameters | ||
---------- | ||
index : bool, default None | ||
If ``True``, always include the dataframe's index(es) as columns | ||
in the file output. | ||
If ``False``, the index(es) will not be written to the file. | ||
If ``None``, the index(ex) will be included as columns in the file | ||
output except `RangeIndex` which is stored as metadata only. | ||
geometry_encoding : {'WKB', 'geoarrow' }, default 'WKB' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the idea behind WKB being the default? Wouldn't it make more sense to default to GeoArrow? Going forward, I suppose that GeoArrow is the encoding of interest and since this is a new method, we don't need to think about backwards compatibility as we do with Parquet IO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. WKB has the advantage that any There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For completeness, we've also been discussing a mixed type (backed by an arrow union) that can hold multiple underlying arrays. That would always work here, but shapely doesn't support export to it yet. |
||
The GeoArrow encoding to use for the data conversion. | ||
interleaved : bool, default True | ||
Only relevant for 'geoarrow' encoding. If True, the geometries | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
coordinates are interleaved in a single fixed size list array. | ||
If False, the coordinates are stored as separate arrays in a | ||
struct type. | ||
|
||
Returns | ||
------- | ||
ArrowTable | ||
A generic Arrow table object with geometry columns encoded to | ||
GeoArrow. | ||
|
||
Examples | ||
-------- | ||
>>> from shapely.geometry import Point | ||
>>> data = {'col1': ['name1', 'name2'], 'geometry': [Point(1, 2), Point(2, 1)]} | ||
>>> gdf = geopandas.GeoDataFrame(data) | ||
>>> gdf | ||
col1 geometry | ||
0 name1 POINT (1 2) | ||
1 name2 POINT (2 1) | ||
|
||
>>> arrow_table = gdf.to_arrow() | ||
>>> arrow_table | ||
<geopandas.io.geoarrow.ArrowTable object at ...> | ||
|
||
The returned data object needs to be consumed by a library implementing | ||
the Arrow PyCapsule Protocol. For example, wrapping the data as a | ||
pyarrow.Table (requires pyarrow >= 14.0): | ||
|
||
>>> import pyarrow as pa | ||
>>> table = pa.table(arrow_table) | ||
>>> table | ||
pyarrow.Table | ||
col1: string | ||
geometry: binary | ||
---- | ||
col1: [["name1","name2"]] | ||
geometry: [[0101000000000000000000F03F0000000000000040,\ | ||
01010000000000000000000040000000000000F03F]] | ||
|
||
""" | ||
from geopandas.io.geoarrow import ArrowTable, geopandas_to_arrow | ||
|
||
table = geopandas_to_arrow( | ||
self, | ||
index=index, | ||
geometry_encoding=geometry_encoding, | ||
interleaved=interleaved, | ||
) | ||
return ArrowTable(table) | ||
|
||
def to_parquet( | ||
self, path, index=None, compression="snappy", schema_version=None, **kwargs | ||
): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coord_type: Literal["interleaved", "struct|separate"]
might be more future-proof thaninterleaved=True
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but the
interleaved
is consistent with an existing keyword in shapely. But maybe that's not a good enough reason ;)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's maybe slightly confusing to have both
interleaved
andgeometry_encoding
parameters. Could we havegeometry_encoding: "WKB" | "interleaved" | "separated"
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it separate might make it easier to specify future encodings/coordinate combinations that may not exist yet 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to allow to user to specify a more specific geoarrow type, like "point", "multipoints", etc (i.e. essentially anything that comes after
geoarrow.<>
in the extension type name). That would give users the possibility to control the exact output (eg they know they can have mix of polygons and multipolygons, and want to be sure to always export as multipolygon regardless of the actual values present)If we do that, that wouldn't really match with allowing
"interleaved" | "separated"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
def to_arrow(self, index=None, *, geometry_encoding=None, interleaved=None)
would leave some flexibility to sort out the details in a backwards-compatible way? As much as the completest in me would like the ability to specify all possible type constraints,geometry_encoding
+interleaved
is probably sufficient for almost everybody 🙂 .There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with separate options here. (As a general statement, big fan of using
*
to force named arguments, so +1 on that here)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and also later on if we want to improve this interface by using names for the encoding or whathever, it should always be relatively straightforward to keep supporting the interface we come up now for back compat. Getting something out there for now is more important ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot with
geometry_encoding=None
being None by default, you meant to have the user required to always specify some encoding (i.e. not actually default to WKB) ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think WKB is the default that makes the most sense for the forseeable future.
None
would just make it clear that it's unspecified and that we're picking the default (but perhaps that's a change for a future version).