Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include Geometry and GeometryCollection arrays in spec #43

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kylebarron
Copy link
Member

@kylebarron kylebarron commented Aug 2, 2023

This updates the spec to include a Geometry array, represented by a union, and a GeometryCollection array, represented by a list of that union.

Closes #23

I have rust implementations of these here and here if anyone's curious.

Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up! I really like this and think it's an important addition to the spec because a generic "geometry" is the cornerstone of most existing representations (e.g., GEOS).

For argument's sake, I'll also point out that one could also do:

Mixed == Struct<geometry_type: int8, multipolygon>

...where each unnecessary outer layer of nesting would have size 1 (unless the geometry was EMPTY). For example, POINT (0 1) would be encoded identically as MULTIPOLYGON (((0 1))) with the geometry_type as 1. That representation would have some extra overhead for each geometry but would retain the property of a single "coordinate" array. Union support is also not ubiquitous (e.g., not in Polars, cudf, Parquet, JavaScript Arrow, or C# Arrow).

format.md Outdated Show resolved Hide resolved
@kylebarron
Copy link
Member Author

Mixed == Struct<geometry_type: int8, multipolygon>

That's interesting and hadn't occurred to me.

I do think we should explore that more, but in terms of implementation it seems a bit difficult. I really like the union conceptually because I was able to reuse existing implementations of strongly typed arrays per type. In particular the coordinates array I'd more aptly describe as a generic List[List[List[coordinates]]] rather than a "multi polygon array", because the semantics are very different.

For some use cases it's useful to have the arrays already separated by type (thinking of visualization).

But yes if union support is widely lacking, then that's a good reason to entertain something else.

@paleolimbot
Copy link
Contributor

That's interesting and hadn't occurred to me.

It hadn't really occurred to me either before reading this PR! I didn't mean to spring this on you/anybody after year(s) of theoretical discussions about a Union-based solution.

In particular the coordinates array I'd more aptly describe as a generic List[List[List[coordinates]]] rather than a "multi polygon array", because the semantics are very different.

Definitely!

I do think we should explore that more, but in terms of implementation it seems a bit difficult. I really like the union conceptually because I was able to reuse existing implementations of strongly typed arrays per type.

That's true...pre-sorting the elements by geometry type probably means some operations are easier. You can probably do some re-using of implementations because, for example, if you have 12000 linestring and/or multilinestring elements that are all in a row, you can take a slice of one of the children and send it to the multilinestring implementation. If you're plotting outlines, you can the inner list to whatever is doing the plotting since it doesn't really care whether something is a ring, linestring, or part of a linestring.

The single-coordinate-array version is nice for column statistics, too. That would mean that all geoarrow-native encodings have exactly one column that contains the min/max statistics.

But yes if union support is widely lacking, then that's a good reason to entertain something else.

I'm not sure what happens when a union gets passed to cudf or polars or written to a parquet or sent to JavaScript...there might be some casts that happen that make the lack of support not important (or something we really hope would work might not). Basically, the version that I suggested means that we'd take on some union-like implementation details in place of leaning on the existing union implementation details for Arrow implementations. That may or may not be worth the trouble (or it may not be that much trouble since we don't have to support a fully generic union).

@paleolimbot
Copy link
Contributor

Some updates since my last review!

  • Unions are definitely supported in JavaScript
  • I can implement unions in the R bindings (and they're already supported in the nanoarrow R bindings)
  • DuckDB now supports unions
  • The importance of putting these stuctures in Parquet files is limited.

The ability for implementations to lean on existing Arrow implementations for union support is probably a far more attractive/sustainable solution than implementing anything custom involving a struct/multipolygon solution.

Another problem with both the struct/polygon solution and fixed-child union solution is that neither can handle mixed dimensions, where a Union/type id based solution with flexible children can handle this elegantly: the type_id buffer faithfully represents the geometry type/dimension and the mapping of child index to type id is handled by the standard Union Arrow semantics ( https://github.com/apache/arrow/blob/main/format/Schema.fbs#L149-L152 )

Basically, ignore anything I said about structs and/or fixed memory layouts and see geoarrow/geoarrow-rs#308 for an implementation!

@kylebarron
Copy link
Member Author

Thoughts on whether to include or exclude GeometryCollection in this PR?

@paleolimbot
Copy link
Contributor

Go for it! I don't think it adds a lot of extra complexity (it's just List<mixed>, correct?).

Any thoughts on geoarrow.mixed vs geoarrow.geometry? The "geometry" as "generic set of geometry items" definitely has precedent (e.g., in R/sf, a generic set has the class sfc_GEOMETRY; in the Wikipedia WKB section of the WKT page, GEOMETRY has the type ID 0).

@kylebarron
Copy link
Member Author

kylebarron commented Dec 15, 2023

Go for it! I don't think it adds a lot of extra complexity (it's just List<mixed>, correct?).

This PR already contained it; it was more a question on whether it should be taken out 🙂

Any thoughts on geoarrow.mixed vs geoarrow.geometry? The "geometry" as "generic set of geometry items" definitely has precedent (e.g., in R/sf, a generic set has the class sfc_GEOMETRY; in the Wikipedia WKB section of the WKT page, GEOMETRY has the type ID 0).

At least in the rust crate, I've been referring to a GeometryArray as "a generic array that is one of the geoarrow (native) array types" (excluding wkb from the enum). (in terms of Rust design, I'm not 100% sure whether to make it an enum or a trait object, but either way, GeometryArray will be "either a PointArray or a LineStringArray etc").

In that sense, I originally used the term "MixedArray" to try and draw a distinction with the type of geometry. That is, our geoarrow.* suffixes refer to the type of a geometry, and so I don't like geoarrow.geometry because geometry doesn't add type meaning. After all a point is a geometry too.

To me this code would be confusing to read:

switch (geoarrow_name) {
  case "geoarrow.point":
  case "geoarrow.linestring":
  case "geoarrow.geometry":
    ...
}

@paleolimbot
Copy link
Contributor

I agree that "mixed" is more descriptive; however I would prefer not to invent a new term for something that has existed for at least a decade (e.g., see the GeoPackage specification "geometry types" section: http://www.geopackage.org/spec/#geometry_types ).

@kylebarron kylebarron changed the title Include Mixed and GeometryCollection arrays in spec Include Geometry and GeometryCollection arrays in spec Dec 26, 2023
@kylebarron
Copy link
Member Author

kylebarron commented Dec 26, 2023

I updated this to reflect our most recent discussion in chat. Namely

  • "mixed" -> "geometry"
  • Allow geometry collection as a child of geometry
  • some short text to require the type id ordering to be 1. Point, 2. LineString and so on

Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Just a few nits on metadata and would benefit from @jorisvandenbossche's take on it as well!

Comment on lines 48 to 49
For the geometry and geometry collection arrays, child arrays must include
extension metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary? For unions we could use either the type ID or the child name to communicate this information. Duplicating the CRS in each child metadata seems like it might not be a good idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now that we have stable type ids, we don't need for the Geometry array's children to include extension metadata

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have geometry-type-specific extension metadata in the future? E.g. a winding order flag from #49? If the children of the Geometry array don't have their own CRS, that would mean that the Geometry array's metadata would also need to optionally have a winding flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd say the winding flag should live at the top level in that case. If you have a polygon marked as "wound" and a multipolygon (or nested something in a collection) that is not, for example, you'd have to drop the winding flag.

format.md Outdated Show resolved Hide resolved
format.md Outdated Show resolved Hide resolved
format.md Outdated
37: GeometryCollection ZM
```

This ordering was chosen to match the WKB specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This ordering was chosen to match the WKB specification.
These values were chosen to match the WKB specification exactly for 2D geometries and match the WKB specification conceptually for Z, M, and ZM geometries given the constraint that an Arrow union type ID must be between 0 and 127.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is WKB or GeoPackage specification more precise?

Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further comments from me except the potential clarification you can take or leave!

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
@kylebarron
Copy link
Member Author

Sounds good! In terms of implementation, having some integration tests here (or even a manual check) would be nice with your implementation, since it's less obvious that it's correct compared to the simpler arays.

I'll wait for a few days to see if @jorisvandenbossche has time to review before merging.

@paleolimbot
Copy link
Contributor

I think in general having an integration test setup is a good idea, and I think we're close to being able to do it (maybe not for the new mixed type yet, since I haven't started an implementation). Arrow has a system for this that I just implemented in nanoarrow ( https://arrow.apache.org/docs/format/Integration.html )...basically, you give a producer some data source and a geoarrow type, ask for a C data pointer, pass it to a consumer with the data source, and ask it to check for equality.

@kylebarron
Copy link
Member Author

I think we can tackle that separately from this PR. It may even be simplest to start the integration testing just with interop through our python packages


**GeometryCollection**: `List<Geometry>`

An array of GeometryCollections is represented as a list containing the above geometry array. Each element of the array thus represents one or more geometries of varied type.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably update this to include the field name expected in this list

@kylebarron
Copy link
Member Author

How should nullability be handled here? For geoarrow.geometrycollection, should we disallow any null items in the children of the union array?

@paleolimbot
Copy link
Contributor

For geoarrow.geometrycollection, should we disallow any null items in the children of the union array?

For other geometry types we also specify that nullability is only allowed at the outermost level, so I think it makes sense to disallow them for a child of a geometrycollection, too?

Unions are funny in that they don't have their own nullability, so you would have to add a child array and make that null to append one to the mixed type.

@kylebarron
Copy link
Member Author

It would also be great to get feedback from cuSpatial developers, especially as this PR closes #23. cc @thomcom @trxcllnt. How similar is this to your existing data structures? I would guess that the ordering of type ids in this PR (e.g. a type id of 1 always means Point, and so on) may be different from your implementation?

@kylebarron
Copy link
Member Author

One comment came up in geoarrow/geoarrow-rs#646 (comment). In practice, should the Geometry array include all possible child arrays or just the child arrays that have data? E.g. if you have points and polygons in one column, should you have a data type that only references points and polygons, or a data type that references all possible types? I suppose the former, although it's slightly annoying that you can't create the data type in isolation of data.

@paleolimbot
Copy link
Contributor

It would be really nice to have a set of types that has a finite set of memory layouts, which is functionally what would happen if they all contained the same child elements. (There would still be the matrix of dimensions x coord type for the mixed type, but at least there would be a bounded number such that each one could have an ID or something). Another benefit would be that pyarrow.concatenate() could be used out of the box (I am not sure it would concatenate two unions that do not have an identical structure). I don't think there would be meaningful overhead to doing this because of the dense union layout.

To support geometry collections I think you wouldn't be able to do that (so maybe that's where the "in practice" comes in): the geometry collection member of the union would still need to be a list of something. If we limited the geometrycollection member to not contain geometrycollections (which I think covers almost all use cases), it could be a finite set of layouts but would be very verbose to specify (union<point, linesting, polygon, mpoint, mlinestring, mpolygon, list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>>).

@kylebarron
Copy link
Member Author

Is it true that if the types all have the same child ids, then the physical array layouts need to exist for all union children as well? I.e. I assume you can't have a type that describes 6 children but then export an array that only has 2 or 3 children?

@paleolimbot
Copy link
Contributor

I assume you can't have a type that describes 6 children but then export an array that only has 2 or 3 children?

I believe that's true!

I think that having this type description as you have it here is good (i.e., it's "just" a union), with the convention that you either return union<point, linesting, polygon, mpoint, mlinestring, mpolygon> OR list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>> for maximum compatibility?

@kylebarron
Copy link
Member Author

I think that having this type description as you have it here is good (i.e., it's "just" a union), with the convention that you either return union<point, linesting, polygon, mpoint, mlinestring, mpolygon> OR list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>> for maximum compatibility?

What do you mean by this OR? It seems the former type is a "geometry" and the latter is a "geometry collection"?

@paleolimbot
Copy link
Contributor

It seems the former type is a "geometry" and the latter is a "geometry collection"?

What I had in mind was something like:

  • geoarrow.types.type_spec(Encoding.GEOARROW, GeometryType.GEOMETRY) -> union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>
  • geoarrow.types.type_spec(Encoding.GEOARROW, GeometryType.GEOMETRYCOLLECTION) -> list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>>

I suppose I only have a direct use-case for list<union<point, linesting, polygon, mpoint, mlinestring, mpolygon>>> (concrete type for the result of an arbitrary overlay operation like intersection or union). Even then, point, linesting, polygon are optional if they complicate things.

@jorisvandenbossche
Copy link
Contributor

In your last comment, I would expect that the most generic "geometry" type also includes GeometryCollection support? So like union<point, linestring, ...., list<union<point, linestring, ...>>>

@jorisvandenbossche
Copy link
Contributor

Other question: do we want to support mixed coordinate dimensionality?
A characteristic of the current "point", "linestring", "polygon", etc encodings is that they always have a fixed dimension, depending on the storage type (xy, xyz, xym or xymz). I think that is a nice characteristic, and I would expect that we provide (some) mixed type that follows that logic, e.g. it can have both points and linestrings, but they are still all 2D or all 3D.

Maybe there is also room and use case for a truly generic and mixed type, but a type with a bit more constraints seems nice as well (and for this one, it would reduce the number of type ids a lot in the definition of the union, as you only need one for each geometry type, and we could then require that each child field of the union should have the same coord dimensionality)

@paleolimbot
Copy link
Contributor

So like union<point, linestring, ...., list<union<point, linestring, ...>>>

That's a great point! It's very verbose but more consistent (and more useful).

e.g. it can have both points and linestrings, but they are still all 2D or all 3D.

I would prefer the dimensionality to be the same to constrain the number of layouts an implementation would have to support (and allow those layouts to be compile-time constant). One can always include extra dimensions and fill them with nan if this is an issue. This concept is also in simple features, sort of (or at least: WKB/WKT geometrycollections can be marked as GEOMETRYCOLLECTION Z, although I don't know to what extent that is enforced).

@jorisvandenbossche
Copy link
Contributor

So like union<point, linestring, ...., list<union<point, linestring, ...>>>

One drawback of this way of representing GeometryCollection (and my understanding is that the current version of this PR is like that), is that you no longer have the property of all points in one child array, all lines in another, etc.

I was thinking that in theory one could also do just list<union<point, linestring, ..., mulitpolygon>>, where then a length-1 list means one of the concrete types, and a longer list is a GeometryCollection (this means you would't be able to roundtrip a GeometryCollection of one sub-geometry).

(this is essentially the format for GeometryCollection as currently written in the PR, except for not allowing recursive GeometryCollections, I think)

@paleolimbot
Copy link
Contributor

I was thinking that in theory one could also do just list<union<point, linestring, ..., mulitpolygon>>

That would suffice for what I would use this for, and perhaps we could start here and add a top-level mixed version if this isn't needed? It is a good point that this version keeps all the linestrings (e.g.) together. The use-case for these is (probably) collecting truly arbitrary input and will almost always be followed by a step where a user errors, warns, or filters out unexpected geometry types.

The downside is that you can't roundtrip arbitrary WKB through it (i.e., POINT (0 1), GEOMETRYCOLLECTION (POINT (0 1)) would come back as GEOMETRYCOLLECTION (POINT (0 1)), GEOMETRYCOLLECTION (POINT (0 1)). or POINT (0 1), POINT (0 1)) depending on whether the WKB writer auto-simplifies or not. I am not sure that roundtripping WKB matters in practice but could be a nice property for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mixed geometry types with DenseUnion
3 participants