Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed geometry types with DenseUnion #23

Open
thomcom opened this issue Jul 21, 2022 · 26 comments · May be fixed by #43
Open

Mixed geometry types with DenseUnion #23

thomcom opened this issue Jul 21, 2022 · 26 comments · May be fixed by #43

Comments

@thomcom
Copy link

thomcom commented Jul 21, 2022

Hey @jorisvandenbossche and @kylebarron, I wanted to share that an updated implementation of GeoArrow is up for review in cuspatial: http://www.github.com/rapidsai/cuspatial/pull/585.

I use the arrow DenseUnion type to contain four arrays: points, mpoints, lines, and polygons. Whether or not they are Multi is identified whether or not a DenseUnion __getitem__ is length of 1 or more.

In particular look at: https://github.com/rapidsai/cuspatial/pull/585/files#diff-3cc28b8293d42e4a558968c1722e9a1a3e14af2386ad97345221e23f9007ecdeR237-R276 which contains the arrow-to-Shapely logic that I use to create a GeoPandas df from pure arrow buffers

Also look at: https://github.com/rapidsai/cuspatial/pull/585/files#diff-2c7a56464dcf8e0309e008b0bf15d4b45a5c2e0a3eac2eb03fa9135f7a9cafd5R36 which is passed into https://github.com/rapidsai/cuspatial/pull/585/files#diff-ac0db319fff6fbea695e6a72ff71778e65aff8bafc696ea1ad2a1f374039b0d8R28 to parse a GeoSeries into pyarrow DenseUnion format.

How do you feel about altering the GeoArrow spec to be a DenseUnion like @trxcllnt suggested a year ago?

I included the above links to suggest how you could easily make pyarrow internals for GeoPandas. I might be able to contribute some to that effort!

@kylebarron
Copy link
Member

kylebarron commented Jul 21, 2022

like @trxcllnt suggested a year ago

Can you link to this? When searching for DenseUnion I found this. Maybe it's this comment you're referring to?

altering the GeoArrow spec to be a DenseUnion

My interpretation was that the conclusion of #4 was that we wanted to go with a nested array approach, and not a struct approach. They're very similar in physical arrow memory but the nested list approach has the added benefit that the logical layout is familiar to anyone who has used e.g. GeoJSON.

Is there a reason you can't use the nested list approach? You can't have a different table for each geometry type?

Judging from what has been already merged into the repo (https://github.com/geopandas/geo-arrow-spec/blob/main/format.md#format), my thought was the consensus was to use a single geometry type per table, and otherwise to fall back on WKB.

@thomcom
Copy link
Author

thomcom commented Jul 22, 2022

Hey @kylebarron ! Thanks for the super quick response.

So, we did decide to go with a nested array, and the nested array is implemented in http://www.github.com/rapidsai/cuspatial/pull/585 .

My original implementation had a lot of extra logic for remembering the ordering of Features in a particular GeoSeries - a strength, perhaps, of GeoPandas is to allow any Features, in an order, with indexing. What I figured out while I was fixing the bug in cuspatial's GeoArrow identified in https://notebooksharing.space/view/517f3172b12354804179f248247ab5ffd6573214e9f9810d13494533f1aefd8a#displayOptions= is that what I had done was re-implement DenseUnion, natively, so that I could support abstract ordering and heterogeneity.

https://github.com/rapidsai/cuspatial/blob/8718c149ec56da2c408f7b7214f927f89e51ba5d/python/cuspatial/cuspatial/geometry/pygeoarrow.py contains a GeoArrow spec that includes the DenseUnion. ArrowPointsType, ArrowMultiPointsType, ArrowLineStringsType, and ArrowPolygonsType are the four nested arrays that we agreed upon. The DenseUnion contains two more pieces of metadata: an array of the types of each item in the Union, and an array of the per type offset of each item in the union. This allowed me to have mixed types in the exact fashion that GeoPandas handles GeoSeries.

As it is presented it still satisfies the original agreement behind GeoArrow: Each of the four children arrays in the Union, accessible via union.child(0) etc, are the Nested Array Lists. The DenseUnion is just a bonus, and one that I think fits the standard well.

@trxcllnt
Copy link

trxcllnt commented Jul 22, 2022

@kylebarron We're using Arrow's DenseUnion to represent multiple geometry types in the same column so we can provide a similar high-level API (in terms of functionality) to geopandas.GeoSeries. Each geometric type the Union can contain are themselves implemented as lists/nested lists.

Exploding a single GeoSeries into multiple Tables doesn't allow us to preserve relative row ordering efficiently. We'd either have to fill the "not-this-geometry-type" slots of each table with nulls (like a SparseUnion would do), or re-create the "typeIdToChild" and "valueOffsets" mappings that DenseUnion already has, at which point we've just recreated DenseUnion.

@kylebarron
Copy link
Member

kylebarron commented Jul 26, 2022

Thanks for the explanation; that makes a lot of sense.

I'm wondering... do you think the DenseUnion approach should be part of the core spec? Or some sort of extension? Would you have every geoarrow column be contained within a dense union?

I'm mainly worried about a growing spec becoming harder to implement. For context, I've been exploring geoarrow in rust, and while the core Arrow libraries support the entire Arrow spec, libraries built on Arrow do not necessarily. The most popular dataframe library in Rust is Polars, and even getting them to consider support for FixedSizeList, which we need for today's geoarrow memory layout, is a hard sell out of worries of added complexity (pola-rs/polars#4014).

@thomcom
Copy link
Author

thomcom commented Jul 26, 2022

I'm willing to include it in the core spec because it allows us to represent and transmit a singular object as a GeoArrow datasource. However, I understand the concern about implementation complexity. I think the good news is that even with DenseUnion, the underlying FixedSizeLists are still present and trivial to implement and access. A DenseUnion is literally just a struct that holds the offsets buffer, the types buffer, and n Lists. Anyone who bothers to read those three structures has the four FixedSizeLists that we agreed upon and don't require any work.

DenseUnion is attractive also because it allows for heterogeneous, arbitrarily ordered FeatureSets. Without it, all points, mpoints, etc, must always be grouped together and can't be reordered without recreating at minimum the child buffers.

@kylebarron
Copy link
Member

I think the good news is that even with DenseUnion, the underlying FixedSizeLists are still present and trivial to implement and access. A DenseUnion is literally just a struct that holds the offsets buffer, the types buffer, and n Lists. Anyone who bothers to read those three structures has the four FixedSizeLists that we agreed upon and don't require any work.

I'm not sure I agree with this conclusion. If you're exclusively working with the low-level Arrow buffers, then yes it's easy to unpack a union array. But I expect that a union array might violate some of the assumptions that higher level libraries impose. Looking at polars again as an example, it expects that each column is strongly typed, and a union column might violate many assumptions baked in to the library. While in theory a column could be downcasted to an Arrow Union array, I could imagine such a suggestion to be refused out of complexity concerns.

In #22 @paleolimbot is suggesting to have a variety of different arrow extension type names. I.e.:

  • geoarrow.point: FixedSizeList<double>[n_dim]
  • geoarrow.linestring: List<FixedSizeList<double>[n_dim]>
  • geoarrow.polygon: List<List<FixedSizeList<double>[n_dim]>>
  • geoarrow.multipoint: List<FixedSizeList<double>[n_dim]>
  • geoarrow.multilinestring: List<List<FixedSizeList<double>[n_dim]>>
  • geoarrow.multipolygon: List<List<List<FixedSizeList<double>[n_dim]>>>
  • geoarrow.wkb: Binary

What would you think about defining a new extension type for mixed data, say geoarrow.mixed, where the definition is a DenseUnion including the above types?

  • geoarrow.mixed: DenseUnion<geoarrow.point, geoarrow.linestring, geoarrow.polygon, geoarrow.multipoint, geoarrow.multilinestring, geoarrow.multipolygon>

(I think if I understand correctly your mixed-type columns are not necessarily the same as a GeometryCollection column? Because the mixed type geometries would not necessarily be within a single row?)

For users who desire only a single geometry type, they can mark their data as one of the above types. For users like yourselves who require mixed geometry types, you can "opt-in" to the mixed type, but some consumers otherwise might not support receiving geoarrow.mixed data.

@trxcllnt
Copy link

trxcllnt commented Jul 28, 2022

Looking at polars again as an example, it expects that each column is strongly typed, and a union column might violate many assumptions baked in to the library,

A Union is no less strongly typed than any other arrow dtype. If Polars doesn't support Unions, they probably should (after all, they claim to support Arrow!). It's unreasonable to compromise on memory layout because an unrelated library doesn't support it.

While in theory a column could be downcasted to an Arrow Union array

What do you mean by "downcasting" to an Arrow Union array? Unions are nested types -- if anything, you'd up-cast a one of the potential sub-dtypes into a singley-typed Union.

(I think if I understand correctly your mixed-type columns are not necessarily the same as a GeometryCollection column? Because the mixed type geometries would not necessarily be within a single row?)

Not sure what you mean? A Union's row can only be one of the types represented by the Union. If you have a Union<[point | linestring]>, then each row can be either a point or a linestring (and this information is encoded in the Union's typeIds buffer).

For users who desire only a single geometry type, they can mark their data as one of the above types.

Users can still have columns of a single geometry dtype if they need. This work is about representing columns with mixed geometry dtypes, i.e. [ Point(0, 1), MultiPolygon(...), LineString(...), etc. ]. Making this possible is why Arrow Unions exist.

For users like yourselves who require mixed geometry types, you can "opt-in" to the mixed type, but some consumers otherwise might not support receiving geoarrow.mixed data.

Anyone using the Arrow libraries can send and receive Union Arrays. If Polars wants to force their users to transform their data before they use Polars, then that's a decision I assume they're making with full knowledge of how much more difficult it makes their users' lives.

@kylebarron
Copy link
Member

I think me referencing polars as an example was probably a mistake for the purposes of this thread.

Regardless, my essential question is: are you proposing that every usage of geoarrow be wrapped in a dense union? If a user wishes to store a point column, the spec should disallow FixedSizeList<double>[n_dim] and require the top-level Arrow type be a DenseUnion instead?

I have no issue at all with the spec defining support for mixed geometries; I'm just trying to figure out the pros and cons to requiring all geoarrow data be stored within a DenseUnion.

@trxcllnt
Copy link

are you proposing that every usage of geoarrow be wrapped in a dense union?

No, definitely not. The DenseUnion is only for representing columns with multiple geometry dtypes interleaved (to provide parity to geopandas.GeoSeries). In rapidsai/cuspatial#585 @thomcom added cuspatial.GeoSeries that can be constructed from a gpd.GeoSeries (or Arrow DenseUnion of the same types).

@kylebarron
Copy link
Member

No, definitely not.

Apologies for the misunderstanding. I'm +1 on adding support for a DenseUnion, I was just hesitant about requiring all data to be within a DenseUnion, which I mistakenly thought you were suggesting. 😅

@paleolimbot
Copy link
Contributor

Hi all, sorry for the delay (summer holidays, conference, Arrow release, etc.)...I wanted to make sure to give this proper reading since it's an important one!

I'm definitely in support of a DenseUnion as a way to represent heterogeneous arrays of geometries. I think that the cuSpatial's approach is good, generally, although I have a few notes about details that I'm happy to discuss!

When I envisioned the union type...and with the brief discussions I've had with @jorisvandenbossche about this...I envisioned a union that did not have a fixed set of specified members. For example, a kernel computing a boundary of a Union[linestring, multipolygon_z] would have an output type of Union[point, multilinestring_z].

I get how it's potentially useful to only ever have one data type (in the type.Equals() sense) representing a heterogeneous collection of geometries, but I rather like that the types and dimensions of the members are immediately inspectable (even in C since the names are just const char*). For those who do need exactly one data type, I would have expected Union[multipoint, multilinestring, multipolygon], personally, but I don't see a problem to adding in points in there if it makes your internals faster/prettier.

One advantage of the non fixed-members thing is that we can represent nested collections: the geometrycollections member of the union can itself be a Union[]. This is a bit of an esoteric detail but allows a more literal translation of something of a kernel using GEOSDelaunayTriangulation(), which happens to return a GEOMETRYCOLLECTION (containing polygons).

@kylebarron kylebarron changed the title Improved GeoArrow in cuSpatial with native pyarrow DenseUnion Mixed geometry types with DenseUnion Oct 15, 2022
@kylebarron
Copy link
Member

@thomcom I updated the title on this to try and reflect the discussion being mostly around DenseUnion. Happy to revert if you prefer the old title

@kylebarron
Copy link
Member

One advantage of the non fixed-members thing is that we can represent nested collections: the geometrycollections member of the union can itself be a Union[]. This is a bit of an esoteric detail but allows a more literal translation of something of a kernel using GEOSDelaunayTriangulation(), which happens to return a GEOMETRYCOLLECTION (containing polygons).

I still need to read up on the layout specifics of DenseUnion and Union, but being able to represent geometry collections seems powerful. Could that even allow us to remove/not need WKB?

@thomcom
Copy link
Author

thomcom commented Oct 17, 2022

We'd definitely like to leave WKB behind since it has fundamental I/O limitations. DenseUnion is working really well in cuspatial now, with a GeoSeries ala GeoPandas allowing arbitrary geometry collections. The only limitation I can think of is via nested DenseUnion, which I don't have an understanding of presently. IIRC in WKB you can have a collection of collections, which I can't do in cuspatial at the moment.

@Maxxen
Copy link

Maxxen commented Nov 21, 2022

I agree that we should try to leave WKB behind and try to encode geometries using a native columnar representation. I'm currently working on geospatial support for DuckDB and have so far settled for a similar nested schema to what has been discussed here. I've also implemented Union types (although they are "sparse" for now, not sure if "dense" implementation would be worth it complexity/performance wise since consecutive nulls are pretty cheap for us to store) specifically to use in mixed-geometry scenarios. We similarly don't have any plans to support recursive geometry collections (or types in general).

Although from my own experience I rarely find that mixed geometries are all that useful or common, in which case allowing "sparse" union implementations might be an acceptable complexity-performance tradeoff? My impression is that you basically always split/validate/categorize any raw data into specific geometry types before further analysis, mixed-geometry columns are mostly just an immediate representation used in the initial ETL step. It would be interesting to hear a use-case from someone who uses them regularly.

@paleolimbot
Copy link
Contributor

consecutive nulls are pretty cheap for us to store

Is that also true for points? For list()s that makes sense because it's just a 64-bit integer append to the offsets, but at least in Arrow a null point takes the same space as a non-null point.

mixed-geometry columns are mostly just an immediate representation

I think that's true, or maybe I would add an intermediate representation. For example, when you do an intersection in GEOS you might get a POINT (vertices touching), LINESTRING (edges touching), POLYGON (overlapping), or any combination of those (GEOMETRYCOLLECTION). For the purposes of query engines that require an output type as a function of the input type (Arrow's query engine, Acero, or Substrait's spec), that function has to be able to represent the output type/data somehow (even if the immediate next step is a cast to one of the geometry types). For what it's worth, S2 lets you pick what dimension output you want and a GEOS-based kernel could emulate that.

Recursive geometry collections and collections that have mixed dimensions are also only useful as intermediate representations...the next step is almost always to flatten them or drop dimensions or fill empty dimensions. It could probably work to not be able to represent them, but then there would be need to be a set of conversion options for import of any type where this happens (in particular mixed dimensions comes up more than you'd think). I haven't prototyped these yet but I think it might not be all that hard to support both.

@Maxxen
Copy link

Maxxen commented Nov 21, 2022

Is that also true for points? For list()s that makes sense because it's just a 64-bit integer append to the offsets, but at least in Arrow a null point takes the same space as a non-null point.

yeah maybe I could have been clearer, In-memory they still take the same space, but once they hit the disk nulls are squashed as part of our compression scheme afaik.

I think that's true, or maybe I would add an intermediate representation. For example, when you do an intersection in GEOS you might get a POINT (vertices touching), LINESTRING (edges touching), POLYGON (overlapping), or any combination of those (GEOMETRYCOLLECTION).

Good call, I remember I actually brought this exact example up internally when we discussed adding Unions, but in our SQL case we could've just as well returned a struct or use a table-returning function instead to represent the multiple return values. That said I do think that it makes sense to have a GeometryCollection type, I'm just not sure how "advanced" it needs to be, so if implementation complexity is a concern my two cents would be to rather go for a list of sparse unions (or even a list of structs) than not having it at all.

It could probably work to not be able to represent them, but then there would be need to be a set of conversion options for import of any type where this happens (in particular mixed dimensions comes up more than you'd think).

This was my plan to tackle recursive collections as well, flatten and provide a "group" or "depth" column as well on import. Although I am secretly hoping they won't be that common, (I know the GeoJSON spec discourages nested GeometryCollections) but I might be wrong about that.
Regarding collections of mixed-dimension geometries I think I would opt to convert and zero all entries to fit the highest specified dimension, but that might also be overridable with an import option.

@kylebarron
Copy link
Member

Coming back to this issue as I'm thinking of how to implement a prototype in Rust. For the (non-nested) geometry collection array, you could just implement a List[Union[point | line | ...]] right? That seems to be what @Maxxen is suggesting in the previous comment if I'm reading it correctly?

@paleolimbot
Copy link
Contributor

It seems like List[Union[multipoint | multiline | multipolygon]] would be the simplest version that could store an arbitrary "these are the points for which..." (e.g., the result of an arbitrary st_intersects()). I think cuSpatial's version also includes point but it's not clear exactly what the motivation was for that from this thread. The simplicity of 3 children instead of 6 is appealing to me but as long as the memory layout is fixed I think it's OK.

@kylebarron
Copy link
Member

In terms of the specification I was thinking of proposing a geoarrow.mixed type that can contain any combination of other geometry types (for now just the 6 primitive types, but potentially in the future also a recursive geoarrow.geometrycollection if we cared to support recursive geometry collections).

This means that implementors can choose whether they care to have the overhead of always storing multi-geometries, or whether they have a use case to store, for example Union[point | polygon]. In that case, the outer array would be defined as geoarrow.mixed, and then the inner fields would have extension types of geoarrow.point and geoarrow.polygon. Then a geoarrow.GeometryCollection could be defined as a list of the mixed type

@kylebarron
Copy link
Member

I have a mostly-working "mixed geometry array" implementation (i.e. not supporting geometry collections) in Rust here geoarrow/geoarrow-rs#122. It passes a simple test round-tripping a collection of points, lines, and polygons, but it needs a little more work to ensure O(1) export to other arrow libraries (around the ordering of the types array).

I'm thinking of fleshing out a full prototype of this and a geometry collection array and then writing a spec PR here for discussion 🙂

@kylebarron
Copy link
Member

I put up #43 for discussion! Feedback is appreciated!

@paleolimbot
Copy link
Contributor

In reviewing #43 I had a thought about how this might be able to work without a Union, since union support does not currently exist in cudf, polars, native Parquet, JavaScript Arrow, R Arrow, or C# Arrow. In cudf, the cuSpatial implementation takes care of the custom reimplementation of the DenseUnion; however, I wonder if it is reasonable that other projects will add that support since this type would, in theory, be rather common for GeoArrow implementations.

A good example of an error that seems to make a DenseUnion-based solution problematic is:

import pyarrow as pa
import pyarrow.parquet as pq

mixed_type = pa.union(
    [pa.field("", pa.int32())],
    mode="dense"
)

table = pa.Table.from_batches([], schema=pa.schema([pa.field("col", mixed_type)]))
pq.write_table(table, "some_file.parquet")
#> ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: dense_union<: int32=0>

For argument's sake, a type in the form Struct<geometry_type: int32, list<list<list<coordinate>>>> would also work. The unnecessary levels of nesting (e.g., for POINT) would all have a size of 1 (e.g., POINT (0 1) would be stored identically to MULTIPOLYGON (((0 1)))) except for EMPTY. This has the useful property that an empty POINT can be represented and also that primitive geometries are also valid MULTI geometries. It has another useful property, which is that there is exactly one coordinate array: this means that we get bounding box statistics automatically in Parquet with most Parquet writers/Parquet-based storage formats (e.g., Iceberg).

@trxcllnt
Copy link

JavaScript Arrow supports unions!

@paleolimbot
Copy link
Contributor

Apologies! It may be worth updating https://github.com/apache/arrow/blob/main/docs/source/status.rst . It should be said that I can knock R off that list too since it's fairly straightforward to wrap the C++ implementation from R.

@kylebarron
Copy link
Member

I made a PR to update that table: apache/arrow#37108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants