Include Geometry and GeometryCollection arrays in spec #43

kylebarron · 2023-08-02T14:12:01Z

This updates the spec to include a Geometry array, represented by a union, and a GeometryCollection array, represented by a list of that union.

Closes #23

I have rust implementations of these here and here if anyone's curious.

paleolimbot

Thank you for writing this up! I really like this and think it's an important addition to the spec because a generic "geometry" is the cornerstone of most existing representations (e.g., GEOS).

For argument's sake, I'll also point out that one could also do:

Mixed == Struct<geometry_type: int8, multipolygon>

...where each unnecessary outer layer of nesting would have size 1 (unless the geometry was EMPTY). For example, POINT (0 1) would be encoded identically as MULTIPOLYGON (((0 1))) with the geometry_type as 1. That representation would have some extra overhead for each geometry but would retain the property of a single "coordinate" array. Union support is also not ubiquitous (e.g., not in Polars, cudf, Parquet, JavaScript Arrow, or C# Arrow).

format.md

kylebarron · 2023-08-04T02:30:38Z

Mixed == Struct<geometry_type: int8, multipolygon>

That's interesting and hadn't occurred to me.

I do think we should explore that more, but in terms of implementation it seems a bit difficult. I really like the union conceptually because I was able to reuse existing implementations of strongly typed arrays per type. In particular the coordinates array I'd more aptly describe as a generic List[List[List[coordinates]]] rather than a "multi polygon array", because the semantics are very different.

For some use cases it's useful to have the arrays already separated by type (thinking of visualization).

But yes if union support is widely lacking, then that's a good reason to entertain something else.

paleolimbot · 2023-08-04T20:52:17Z

That's interesting and hadn't occurred to me.

It hadn't really occurred to me either before reading this PR! I didn't mean to spring this on you/anybody after year(s) of theoretical discussions about a Union-based solution.

In particular the coordinates array I'd more aptly describe as a generic List[List[List[coordinates]]] rather than a "multi polygon array", because the semantics are very different.

Definitely!

I do think we should explore that more, but in terms of implementation it seems a bit difficult. I really like the union conceptually because I was able to reuse existing implementations of strongly typed arrays per type.

That's true...pre-sorting the elements by geometry type probably means some operations are easier. You can probably do some re-using of implementations because, for example, if you have 12000 linestring and/or multilinestring elements that are all in a row, you can take a slice of one of the children and send it to the multilinestring implementation. If you're plotting outlines, you can the inner list to whatever is doing the plotting since it doesn't really care whether something is a ring, linestring, or part of a linestring.

The single-coordinate-array version is nice for column statistics, too. That would mean that all geoarrow-native encodings have exactly one column that contains the min/max statistics.

But yes if union support is widely lacking, then that's a good reason to entertain something else.

I'm not sure what happens when a union gets passed to cudf or polars or written to a parquet or sent to JavaScript...there might be some casts that happen that make the lack of support not important (or something we really hope would work might not). Basically, the version that I suggested means that we'd take on some union-like implementation details in place of leaning on the existing union implementation details for Arrow implementations. That may or may not be worth the trouble (or it may not be that much trouble since we don't have to support a fully generic union).

paleolimbot · 2023-12-12T18:21:11Z

Some updates since my last review!

Unions are definitely supported in JavaScript
I can implement unions in the R bindings (and they're already supported in the nanoarrow R bindings)
DuckDB now supports unions
The importance of putting these stuctures in Parquet files is limited.

The ability for implementations to lean on existing Arrow implementations for union support is probably a far more attractive/sustainable solution than implementing anything custom involving a struct/multipolygon solution.

Another problem with both the struct/polygon solution and fixed-child union solution is that neither can handle mixed dimensions, where a Union/type id based solution with flexible children can handle this elegantly: the type_id buffer faithfully represents the geometry type/dimension and the mapping of child index to type id is handled by the standard Union Arrow semantics ( https://github.com/apache/arrow/blob/main/format/Schema.fbs#L149-L152 )

Basically, ignore anything I said about structs and/or fixed memory layouts and see geoarrow/geoarrow-rs#308 for an implementation!

kylebarron · 2023-12-12T19:01:56Z

Thoughts on whether to include or exclude GeometryCollection in this PR?

paleolimbot · 2023-12-12T19:10:23Z

Go for it! I don't think it adds a lot of extra complexity (it's just List<mixed>, correct?).

Any thoughts on geoarrow.mixed vs geoarrow.geometry? The "geometry" as "generic set of geometry items" definitely has precedent (e.g., in R/sf, a generic set has the class sfc_GEOMETRY; in the Wikipedia WKB section of the WKT page, GEOMETRY has the type ID 0).

kylebarron · 2023-12-15T02:14:57Z

Go for it! I don't think it adds a lot of extra complexity (it's just List<mixed>, correct?).

This PR already contained it; it was more a question on whether it should be taken out 🙂

Any thoughts on geoarrow.mixed vs geoarrow.geometry? The "geometry" as "generic set of geometry items" definitely has precedent (e.g., in R/sf, a generic set has the class sfc_GEOMETRY; in the Wikipedia WKB section of the WKT page, GEOMETRY has the type ID 0).

At least in the rust crate, I've been referring to a GeometryArray as "a generic array that is one of the geoarrow (native) array types" (excluding wkb from the enum). (in terms of Rust design, I'm not 100% sure whether to make it an enum or a trait object, but either way, GeometryArray will be "either a PointArray or a LineStringArray etc").

In that sense, I originally used the term "MixedArray" to try and draw a distinction with the type of geometry. That is, our geoarrow.* suffixes refer to the type of a geometry, and so I don't like geoarrow.geometry because geometry doesn't add type meaning. After all a point is a geometry too.

To me this code would be confusing to read:

switch (geoarrow_name) {
  case "geoarrow.point":
  case "geoarrow.linestring":
  case "geoarrow.geometry":
    ...
}

paleolimbot · 2023-12-15T13:05:16Z

I agree that "mixed" is more descriptive; however I would prefer not to invent a new term for something that has existed for at least a decade (e.g., see the GeoPackage specification "geometry types" section: http://www.geopackage.org/spec/#geometry_types ).

kylebarron · 2023-12-26T20:56:23Z

I updated this to reflect our most recent discussion in chat. Namely

"mixed" -> "geometry"
Allow geometry collection as a child of geometry
some short text to require the type id ordering to be 1. Point, 2. LineString and so on

paleolimbot

Awesome! Just a few nits on metadata and would benefit from @jorisvandenbossche's take on it as well!

paleolimbot · 2023-12-27T01:06:03Z

extension-types.md

+For the geometry and geometry collection arrays, child arrays must include
+extension metadata.


Is this still necessary? For unions we could use either the type ID or the child name to communicate this information. Duplicating the CRS in each child metadata seems like it might not be a good idea?

I think now that we have stable type ids, we don't need for the Geometry array's children to include extension metadata

What if we have geometry-type-specific extension metadata in the future? E.g. a winding order flag from #49? If the children of the Geometry array don't have their own CRS, that would mean that the Geometry array's metadata would also need to optionally have a winding flag?

Yes, I'd say the winding flag should live at the top level in that case. If you have a polygon marked as "wound" and a multipolygon (or nested something in a collection) that is not, for example, you'd have to drop the winding flag.

format.md

paleolimbot · 2024-01-03T01:08:41Z

format.md

+    37: GeometryCollection ZM
+    ```
+
+    This ordering was chosen to match the WKB specification.


Suggested change

This ordering was chosen to match the WKB specification.

These values were chosen to match the WKB specification exactly for 2D geometries and match the WKB specification conceptually for Z, M, and ZM geometries given the constraint that an Arrow union type ID must be between 0 and 127.

Is WKB or GeoPackage specification more precise?

paleolimbot

No further comments from me except the potential clarification you can take or leave!

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

kylebarron · 2024-01-03T01:32:39Z

Sounds good! In terms of implementation, having some integration tests here (or even a manual check) would be nice with your implementation, since it's less obvious that it's correct compared to the simpler arays.

I'll wait for a few days to see if @jorisvandenbossche has time to review before merging.

paleolimbot · 2024-01-03T13:48:43Z

I think in general having an integration test setup is a good idea, and I think we're close to being able to do it (maybe not for the new mixed type yet, since I haven't started an implementation). Arrow has a system for this that I just implemented in nanoarrow ( https://arrow.apache.org/docs/format/Integration.html )...basically, you give a producer some data source and a geoarrow type, ask for a C data pointer, pass it to a consumer with the data source, and ask it to check for equality.

kylebarron · 2024-01-03T16:37:36Z

I think we can tackle that separately from this PR. It may even be simplest to start the integration testing just with interop through our python packages

kylebarron · 2024-01-07T21:10:50Z

format.md

+
+**GeometryCollection**: `List<Geometry>`
+
+An array of GeometryCollections is represented as a list containing the above geometry array. Each element of the array thus represents one or more geometries of varied type.


We should probably update this to include the field name expected in this list

kylebarron · 2024-01-09T23:34:46Z

How should nullability be handled here? For geoarrow.geometrycollection, should we disallow any null items in the children of the union array?

paleolimbot · 2024-01-10T00:41:05Z

For geoarrow.geometrycollection, should we disallow any null items in the children of the union array?

For other geometry types we also specify that nullability is only allowed at the outermost level, so I think it makes sense to disallow them for a child of a geometrycollection, too?

Unions are funny in that they don't have their own nullability, so you would have to add a child array and make that null to append one to the mixed type.

kylebarron · 2024-01-30T20:56:37Z

It would also be great to get feedback from cuSpatial developers, especially as this PR closes #23. cc @thomcom @trxcllnt. How similar is this to your existing data structures? I would guess that the ordering of type ids in this PR (e.g. a type id of 1 always means Point, and so on) may be different from your implementation?

kylebarron · 2024-05-30T09:33:42Z

One comment came up in geoarrow/geoarrow-rs#646 (comment). In practice, should the Geometry array include all possible child arrays or just the child arrays that have data? E.g. if you have points and polygons in one column, should you have a data type that only references points and polygons, or a data type that references all possible types? I suppose the former, although it's slightly annoying that you can't create the data type in isolation of data.

paleolimbot · 2024-05-30T12:50:34Z

It would be really nice to have a set of types that has a finite set of memory layouts, which is functionally what would happen if they all contained the same child elements. (There would still be the matrix of dimensions x coord type for the mixed type, but at least there would be a bounded number such that each one could have an ID or something). Another benefit would be that pyarrow.concatenate() could be used out of the box (I am not sure it would concatenate two unions that do not have an identical structure). I don't think there would be meaningful overhead to doing this because of the dense union layout.

To support geometry collections I think you wouldn't be able to do that (so maybe that's where the "in practice" comes in): the geometry collection member of the union would still need to be a list of something. If we limited the geometrycollection member to not contain geometrycollections (which I think covers almost all use cases), it could be a finite set of layouts but would be very verbose to specify (union<point, linesting, polygon, mpoint, mlinestring, mpolygon, list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>>).

kylebarron · 2024-06-02T13:08:48Z

Is it true that if the types all have the same child ids, then the physical array layouts need to exist for all union children as well? I.e. I assume you can't have a type that describes 6 children but then export an array that only has 2 or 3 children?

paleolimbot · 2024-06-02T14:12:03Z

I assume you can't have a type that describes 6 children but then export an array that only has 2 or 3 children?

I believe that's true!

I think that having this type description as you have it here is good (i.e., it's "just" a union), with the convention that you either return union<point, linesting, polygon, mpoint, mlinestring, mpolygon> OR list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>> for maximum compatibility?

kylebarron · 2024-06-02T14:26:11Z

I think that having this type description as you have it here is good (i.e., it's "just" a union), with the convention that you either return union<point, linesting, polygon, mpoint, mlinestring, mpolygon> OR list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>> for maximum compatibility?

What do you mean by this OR? It seems the former type is a "geometry" and the latter is a "geometry collection"?

paleolimbot · 2024-06-02T17:38:01Z

It seems the former type is a "geometry" and the latter is a "geometry collection"?

What I had in mind was something like:

geoarrow.types.type_spec(Encoding.GEOARROW, GeometryType.GEOMETRY) -> union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>
geoarrow.types.type_spec(Encoding.GEOARROW, GeometryType.GEOMETRYCOLLECTION) -> list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>>

I suppose I only have a direct use-case for list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>> (concrete type for the result of an arbitrary overlay operation like intersection or union). Even then, point, linesting, polygon are optional if they complicate things.

jorisvandenbossche · 2024-06-03T15:20:25Z

In your last comment, I would expect that the most generic "geometry" type also includes GeometryCollection support? So like union<point, linestring, ...., list<union<point, linestring, ...>>>

jorisvandenbossche · 2024-06-03T15:27:38Z

Other question: do we want to support mixed coordinate dimensionality?
A characteristic of the current "point", "linestring", "polygon", etc encodings is that they always have a fixed dimension, depending on the storage type (xy, xyz, xym or xymz). I think that is a nice characteristic, and I would expect that we provide (some) mixed type that follows that logic, e.g. it can have both points and linestrings, but they are still all 2D or all 3D.

Maybe there is also room and use case for a truly generic and mixed type, but a type with a bit more constraints seems nice as well (and for this one, it would reduce the number of type ids a lot in the definition of the union, as you only need one for each geometry type, and we could then require that each child field of the union should have the same coord dimensionality)

paleolimbot · 2024-06-03T20:18:10Z

So like union<point, linestring, ...., list<union<point, linestring, ...>>>

That's a great point! It's very verbose but more consistent (and more useful).

e.g. it can have both points and linestrings, but they are still all 2D or all 3D.

I would prefer the dimensionality to be the same to constrain the number of layouts an implementation would have to support (and allow those layouts to be compile-time constant). One can always include extra dimensions and fill them with nan if this is an issue. This concept is also in simple features, sort of (or at least: WKB/WKT geometrycollections can be marked as GEOMETRYCOLLECTION Z, although I don't know to what extent that is enforced).

jorisvandenbossche · 2024-06-04T08:41:09Z

So like union<point, linestring, ...., list<union<point, linestring, ...>>>

One drawback of this way of representing GeometryCollection (and my understanding is that the current version of this PR is like that), is that you no longer have the property of all points in one child array, all lines in another, etc.

I was thinking that in theory one could also do just list<union<point, linestring, ..., mulitpolygon>>, where then a length-1 list means one of the concrete types, and a longer list is a GeometryCollection (this means you would't be able to roundtrip a GeometryCollection of one sub-geometry).

(this is essentially the format for GeometryCollection as currently written in the PR, except for not allowing recursive GeometryCollections, I think)

paleolimbot · 2024-06-04T12:19:15Z

I was thinking that in theory one could also do just list<union<point, linestring, ..., mulitpolygon>>

That would suffice for what I would use this for, and perhaps we could start here and add a top-level mixed version if this isn't needed? It is a good point that this version keeps all the linestrings (e.g.) together. The use-case for these is (probably) collecting truly arbitrary input and will almost always be followed by a step where a user errors, warns, or filters out unexpected geometry types.

The downside is that you can't roundtrip arbitrary WKB through it (i.e., POINT (0 1), GEOMETRYCOLLECTION (POINT (0 1)) would come back as GEOMETRYCOLLECTION (POINT (0 1)), GEOMETRYCOLLECTION (POINT (0 1)). or POINT (0 1), POINT (0 1)) depending on whether the WKB writer auto-simplifies or not. I am not sure that roundtripping WKB matters in practice but could be a nice property for testing.

Mixed and GeometryCollection arrays

55ffcb5

kylebarron requested review from jorisvandenbossche and paleolimbot August 2, 2023 14:12

kylebarron mentioned this pull request Aug 2, 2023

Mixed geometry types with DenseUnion #23

Open

paleolimbot reviewed Aug 3, 2023

View reviewed changes

format.md Outdated Show resolved Hide resolved

paleolimbot mentioned this pull request Sep 8, 2023

feat: add geometric data types and functions substrait-io/substrait#543

Merged

kylebarron mentioned this pull request Nov 11, 2023

How to handle Postgis geometry column geoarrow/geoarrow-rs#229

Closed

kylebarron added 2 commits December 26, 2023 12:53

mixed -> geometry

f8bf939

Merge branch 'main' into kyle/mixed-gc

8b8a690

kylebarron changed the title ~~Include Mixed and GeometryCollection arrays in spec~~ Include Geometry and GeometryCollection arrays in spec Dec 26, 2023

wording

139789a

paleolimbot reviewed Dec 27, 2023

View reviewed changes

kylebarron added 2 commits January 2, 2024 16:12

edits

df518b0

remove extension requirement

87c603c

paleolimbot reviewed Jan 3, 2024

View reviewed changes

paleolimbot approved these changes Jan 3, 2024

View reviewed changes

Update format.md

fafe6b1

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

kylebarron commented Jan 7, 2024

View reviewed changes

paleolimbot mentioned this pull request Jan 31, 2024

Add GeoArrow encoding as an option to the specification opengeospatial/geoparquet#189

Merged

kylebarron mentioned this pull request Feb 16, 2024

Support Float16 data type pola-rs/polars#7288

Open

kylebarron mentioned this pull request Mar 21, 2024

polars has fixed_size_list and a rust extension framework/tool/something geopolars/geopolars#234

Open

kylebarron mentioned this pull request May 30, 2024

Use arrow-csv geoarrow/geoarrow-rs#646

Draft

		For the geometry and geometry collection arrays, child arrays must include
		extension metadata.

	This ordering was chosen to match the WKB specification.
	These values were chosen to match the WKB specification exactly for 2D geometries and match the WKB specification conceptually for Z, M, and ZM geometries given the constraint that an Arrow union type ID must be between 0 and 127.


		GeometryCollection: `List<Geometry>`

		An array of GeometryCollections is represented as a list containing the above geometry array. Each element of the array thus represents one or more geometries of varied type.

Include Geometry and GeometryCollection arrays in spec #43

Are you sure you want to change the base?

Include Geometry and GeometryCollection arrays in spec #43

Conversation

kylebarron commented Aug 2, 2023 • edited

paleolimbot left a comment

Choose a reason for hiding this comment

kylebarron commented Aug 4, 2023

paleolimbot commented Aug 4, 2023

paleolimbot commented Dec 12, 2023

kylebarron commented Dec 12, 2023

paleolimbot commented Dec 12, 2023

kylebarron commented Dec 15, 2023 • edited

paleolimbot commented Dec 15, 2023

kylebarron commented Dec 26, 2023 • edited

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot Dec 27, 2023

Choose a reason for hiding this comment

kylebarron Jan 2, 2024

Choose a reason for hiding this comment

kylebarron Jan 22, 2024

Choose a reason for hiding this comment

paleolimbot Jan 22, 2024

Choose a reason for hiding this comment

paleolimbot Jan 3, 2024

Choose a reason for hiding this comment

kylebarron Jan 3, 2024

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

kylebarron commented Jan 3, 2024

paleolimbot commented Jan 3, 2024

kylebarron commented Jan 3, 2024

kylebarron Jan 7, 2024

Choose a reason for hiding this comment

kylebarron commented Jan 9, 2024

paleolimbot commented Jan 10, 2024

kylebarron commented Jan 30, 2024

kylebarron commented May 30, 2024

paleolimbot commented May 30, 2024

kylebarron commented Jun 2, 2024

paleolimbot commented Jun 2, 2024

kylebarron commented Jun 2, 2024

paleolimbot commented Jun 2, 2024

jorisvandenbossche commented Jun 3, 2024

jorisvandenbossche commented Jun 3, 2024

paleolimbot commented Jun 3, 2024

jorisvandenbossche commented Jun 4, 2024

paleolimbot commented Jun 4, 2024

kylebarron commented Aug 2, 2023 •

edited

kylebarron commented Dec 15, 2023 •

edited

kylebarron commented Dec 26, 2023 •

edited