-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed geometry types with DenseUnion
#23
Comments
Can you link to this? When searching for
|
Hey @kylebarron ! Thanks for the super quick response. So, we did decide to go with a nested array, and the nested array is implemented in http://www.github.com/rapidsai/cuspatial/pull/585 . My original implementation had a lot of extra logic for remembering the ordering of Features in a particular GeoSeries - a strength, perhaps, of GeoPandas is to allow any Features, in an order, with indexing. What I figured out while I was fixing the bug in cuspatial's GeoArrow identified in https://notebooksharing.space/view/517f3172b12354804179f248247ab5ffd6573214e9f9810d13494533f1aefd8a#displayOptions= is that what I had done was re-implement https://github.com/rapidsai/cuspatial/blob/8718c149ec56da2c408f7b7214f927f89e51ba5d/python/cuspatial/cuspatial/geometry/pygeoarrow.py contains a GeoArrow spec that includes the DenseUnion. As it is presented it still satisfies the original agreement behind GeoArrow: Each of the four |
@kylebarron We're using Arrow's Exploding a single GeoSeries into multiple Tables doesn't allow us to preserve relative row ordering efficiently. We'd either have to fill the "not-this-geometry-type" slots of each table with nulls (like a |
Thanks for the explanation; that makes a lot of sense. I'm wondering... do you think the I'm mainly worried about a growing spec becoming harder to implement. |
I'm willing to include it in the core spec because it allows us to represent and transmit a singular object as a GeoArrow datasource. However, I understand the concern about implementation complexity. I think the good news is that even with
|
I'm not sure I agree with this conclusion. If you're exclusively working with the low-level Arrow buffers, then yes it's easy to unpack a union array. But I expect that a union array might violate some of the assumptions that higher level libraries impose. Looking at polars again as an example, it expects that each column is strongly typed, and a union column might violate many assumptions baked in to the library. While in theory a column could be downcasted to an Arrow In #22 @paleolimbot is suggesting to have a variety of different arrow extension type names. I.e.:
What would you think about defining a new extension type for mixed data, say
(I think if I understand correctly your mixed-type columns are not necessarily the same as a For users who desire only a single geometry type, they can mark their data as one of the above types. For users like yourselves who require mixed geometry types, you can "opt-in" to the |
A Union is no less strongly typed than any other arrow dtype. If Polars doesn't support Unions, they probably should (after all, they claim to support Arrow!). It's unreasonable to compromise on memory layout because an unrelated library doesn't support it.
What do you mean by "downcasting" to an Arrow Union array? Unions are nested types -- if anything, you'd up-cast a one of the potential sub-dtypes into a singley-typed Union.
Not sure what you mean? A Union's row can only be one of the types represented by the Union. If you have a
Users can still have columns of a single geometry dtype if they need. This work is about representing columns with mixed geometry dtypes, i.e.
Anyone using the Arrow libraries can send and receive Union Arrays. If Polars wants to force their users to transform their data before they use Polars, then that's a decision I assume they're making with full knowledge of how much more difficult it makes their users' lives. |
I think me referencing polars as an example was probably a mistake for the purposes of this thread. Regardless, my essential question is: are you proposing that every usage of geoarrow be wrapped in a dense union? If a user wishes to store a point column, the spec should disallow I have no issue at all with the spec defining support for mixed geometries; I'm just trying to figure out the pros and cons to requiring all geoarrow data be stored within a |
No, definitely not. The |
Apologies for the misunderstanding. I'm +1 on adding support for a |
Hi all, sorry for the delay (summer holidays, conference, Arrow release, etc.)...I wanted to make sure to give this proper reading since it's an important one! I'm definitely in support of a When I envisioned the union type...and with the brief discussions I've had with @jorisvandenbossche about this...I envisioned a union that did not have a fixed set of specified members. For example, a kernel computing a boundary of a I get how it's potentially useful to only ever have one data type (in the One advantage of the non fixed-members thing is that we can represent nested collections: the |
pyarrow
DenseUnionDenseUnion
@thomcom I updated the title on this to try and reflect the discussion being mostly around |
I still need to read up on the layout specifics of DenseUnion and Union, but being able to represent geometry collections seems powerful. Could that even allow us to remove/not need WKB? |
We'd definitely like to leave WKB behind since it has fundamental I/O limitations. |
I agree that we should try to leave WKB behind and try to encode geometries using a native columnar representation. I'm currently working on geospatial support for DuckDB and have so far settled for a similar nested schema to what has been discussed here. I've also implemented Union types (although they are "sparse" for now, not sure if "dense" implementation would be worth it complexity/performance wise since consecutive nulls are pretty cheap for us to store) specifically to use in mixed-geometry scenarios. We similarly don't have any plans to support recursive geometry collections (or types in general). Although from my own experience I rarely find that mixed geometries are all that useful or common, in which case allowing "sparse" union implementations might be an acceptable complexity-performance tradeoff? My impression is that you basically always split/validate/categorize any raw data into specific geometry types before further analysis, mixed-geometry columns are mostly just an immediate representation used in the initial ETL step. It would be interesting to hear a use-case from someone who uses them regularly. |
Is that also true for points? For
I think that's true, or maybe I would add an intermediate representation. For example, when you do an intersection in GEOS you might get a POINT (vertices touching), LINESTRING (edges touching), POLYGON (overlapping), or any combination of those (GEOMETRYCOLLECTION). For the purposes of query engines that require an output type as a function of the input type (Arrow's query engine, Acero, or Substrait's spec), that function has to be able to represent the output type/data somehow (even if the immediate next step is a cast to one of the geometry types). For what it's worth, S2 lets you pick what dimension output you want and a GEOS-based kernel could emulate that. Recursive geometry collections and collections that have mixed dimensions are also only useful as intermediate representations...the next step is almost always to flatten them or drop dimensions or fill empty dimensions. It could probably work to not be able to represent them, but then there would be need to be a set of conversion options for import of any type where this happens (in particular mixed dimensions comes up more than you'd think). I haven't prototyped these yet but I think it might not be all that hard to support both. |
yeah maybe I could have been clearer, In-memory they still take the same space, but once they hit the disk nulls are squashed as part of our compression scheme afaik.
Good call, I remember I actually brought this exact example up internally when we discussed adding Unions, but in our SQL case we could've just as well returned a struct or use a table-returning function instead to represent the multiple return values. That said I do think that it makes sense to have a GeometryCollection type, I'm just not sure how "advanced" it needs to be, so if implementation complexity is a concern my two cents would be to rather go for a list of sparse unions (or even a list of structs) than not having it at all.
This was my plan to tackle recursive collections as well, flatten and provide a "group" or "depth" column as well on import. Although I am secretly hoping they won't be that common, (I know the GeoJSON spec discourages nested GeometryCollections) but I might be wrong about that. |
Coming back to this issue as I'm thinking of how to implement a prototype in Rust. For the (non-nested) geometry collection array, you could just implement a |
It seems like |
In terms of the specification I was thinking of proposing a This means that implementors can choose whether they care to have the overhead of always storing multi-geometries, or whether they have a use case to store, for example |
I have a mostly-working "mixed geometry array" implementation (i.e. not supporting geometry collections) in Rust here geoarrow/geoarrow-rs#122. It passes a simple test round-tripping a collection of points, lines, and polygons, but it needs a little more work to ensure I'm thinking of fleshing out a full prototype of this and a geometry collection array and then writing a spec PR here for discussion 🙂 |
I put up #43 for discussion! Feedback is appreciated! |
In reviewing #43 I had a thought about how this might be able to work without a A good example of an error that seems to make a DenseUnion-based solution problematic is: import pyarrow as pa
import pyarrow.parquet as pq
mixed_type = pa.union(
[pa.field("", pa.int32())],
mode="dense"
)
table = pa.Table.from_batches([], schema=pa.schema([pa.field("col", mixed_type)]))
pq.write_table(table, "some_file.parquet")
#> ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: dense_union<: int32=0> For argument's sake, a type in the form |
JavaScript Arrow supports unions! |
Apologies! It may be worth updating https://github.com/apache/arrow/blob/main/docs/source/status.rst . It should be said that I can knock R off that list too since it's fairly straightforward to wrap the C++ implementation from R. |
I made a PR to update that table: apache/arrow#37108 |
Hey @jorisvandenbossche and @kylebarron, I wanted to share that an updated implementation of GeoArrow is up for review in cuspatial: http://www.github.com/rapidsai/cuspatial/pull/585.
I use the arrow
DenseUnion
type to contain four arrays: points, mpoints, lines, and polygons. Whether or not they are Multi is identified whether or not aDenseUnion
__getitem__
is length of 1 or more.In particular look at: https://github.com/rapidsai/cuspatial/pull/585/files#diff-3cc28b8293d42e4a558968c1722e9a1a3e14af2386ad97345221e23f9007ecdeR237-R276 which contains the arrow-to-Shapely logic that I use to create a GeoPandas df from pure arrow buffers
Also look at: https://github.com/rapidsai/cuspatial/pull/585/files#diff-2c7a56464dcf8e0309e008b0bf15d4b45a5c2e0a3eac2eb03fa9135f7a9cafd5R36 which is passed into https://github.com/rapidsai/cuspatial/pull/585/files#diff-ac0db319fff6fbea695e6a72ff71778e65aff8bafc696ea1ad2a1f374039b0d8R28 to parse a GeoSeries into pyarrow
DenseUnion
format.How do you feel about altering the
GeoArrow
spec to be aDenseUnion
like @trxcllnt suggested a year ago?I included the above links to suggest how you could easily make pyarrow internals for GeoPandas. I might be able to contribute some to that effort!
The text was updated successfully, but these errors were encountered: