-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include Geometry and GeometryCollection arrays in spec #43
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up! I really like this and think it's an important addition to the spec because a generic "geometry" is the cornerstone of most existing representations (e.g., GEOS).
For argument's sake, I'll also point out that one could also do:
Mixed
== Struct<geometry_type: int8, multipolygon>
...where each unnecessary outer layer of nesting would have size 1 (unless the geometry was EMPTY). For example, POINT (0 1)
would be encoded identically as MULTIPOLYGON (((0 1)))
with the geometry_type
as 1
. That representation would have some extra overhead for each geometry but would retain the property of a single "coordinate" array. Union support is also not ubiquitous (e.g., not in Polars, cudf, Parquet, JavaScript Arrow, or C# Arrow).
That's interesting and hadn't occurred to me. I do think we should explore that more, but in terms of implementation it seems a bit difficult. I really like the union conceptually because I was able to reuse existing implementations of strongly typed arrays per type. In particular the coordinates array I'd more aptly describe as a generic For some use cases it's useful to have the arrays already separated by type (thinking of visualization). But yes if union support is widely lacking, then that's a good reason to entertain something else. |
It hadn't really occurred to me either before reading this PR! I didn't mean to spring this on you/anybody after year(s) of theoretical discussions about a Union-based solution.
Definitely!
That's true...pre-sorting the elements by geometry type probably means some operations are easier. You can probably do some re-using of implementations because, for example, if you have 12000 linestring and/or multilinestring elements that are all in a row, you can take a slice of one of the children and send it to the multilinestring implementation. If you're plotting outlines, you can the inner The single-coordinate-array version is nice for column statistics, too. That would mean that all geoarrow-native encodings have exactly one column that contains the min/max statistics.
I'm not sure what happens when a union gets passed to cudf or polars or written to a parquet or sent to JavaScript...there might be some casts that happen that make the lack of support not important (or something we really hope would work might not). Basically, the version that I suggested means that we'd take on some union-like implementation details in place of leaning on the existing union implementation details for Arrow implementations. That may or may not be worth the trouble (or it may not be that much trouble since we don't have to support a fully generic union). |
Some updates since my last review!
The ability for implementations to lean on existing Arrow implementations for union support is probably a far more attractive/sustainable solution than implementing anything custom involving a struct/multipolygon solution. Another problem with both the struct/polygon solution and fixed-child union solution is that neither can handle mixed dimensions, where a Union/type id based solution with flexible children can handle this elegantly: the type_id buffer faithfully represents the geometry type/dimension and the mapping of child index to type id is handled by the standard Union Arrow semantics ( https://github.com/apache/arrow/blob/main/format/Schema.fbs#L149-L152 ) Basically, ignore anything I said about structs and/or fixed memory layouts and see geoarrow/geoarrow-rs#308 for an implementation! |
Thoughts on whether to include or exclude |
Go for it! I don't think it adds a lot of extra complexity (it's just Any thoughts on |
This PR already contained it; it was more a question on whether it should be taken out 🙂
At least in the rust crate, I've been referring to a In that sense, I originally used the term "MixedArray" to try and draw a distinction with the type of geometry. That is, our To me this code would be confusing to read: switch (geoarrow_name) {
case "geoarrow.point":
case "geoarrow.linestring":
case "geoarrow.geometry":
...
} |
I agree that "mixed" is more descriptive; however I would prefer not to invent a new term for something that has existed for at least a decade (e.g., see the GeoPackage specification "geometry types" section: http://www.geopackage.org/spec/#geometry_types ). |
I updated this to reflect our most recent discussion in chat. Namely
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Just a few nits on metadata and would benefit from @jorisvandenbossche's take on it as well!
extension-types.md
Outdated
For the geometry and geometry collection arrays, child arrays must include | ||
extension metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still necessary? For unions we could use either the type ID or the child name to communicate this information. Duplicating the CRS in each child metadata seems like it might not be a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now that we have stable type ids, we don't need for the Geometry array's children to include extension metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have geometry-type-specific extension metadata in the future? E.g. a winding order flag from #49? If the children of the Geometry
array don't have their own CRS, that would mean that the Geometry
array's metadata would also need to optionally have a winding
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd say the winding flag should live at the top level in that case. If you have a polygon marked as "wound" and a multipolygon (or nested something in a collection) that is not, for example, you'd have to drop the winding flag.
format.md
Outdated
37: GeometryCollection ZM | ||
``` | ||
|
||
This ordering was chosen to match the WKB specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ordering was chosen to match the WKB specification. | |
These values were chosen to match the WKB specification exactly for 2D geometries and match the WKB specification conceptually for Z, M, and ZM geometries given the constraint that an Arrow union type ID must be between 0 and 127. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is WKB
or GeoPackage
specification more precise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No further comments from me except the potential clarification you can take or leave!
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Sounds good! In terms of implementation, having some integration tests here (or even a manual check) would be nice with your implementation, since it's less obvious that it's correct compared to the simpler arays. I'll wait for a few days to see if @jorisvandenbossche has time to review before merging. |
I think in general having an integration test setup is a good idea, and I think we're close to being able to do it (maybe not for the new mixed type yet, since I haven't started an implementation). Arrow has a system for this that I just implemented in nanoarrow ( https://arrow.apache.org/docs/format/Integration.html )...basically, you give a producer some data source and a geoarrow type, ask for a C data pointer, pass it to a consumer with the data source, and ask it to check for equality. |
I think we can tackle that separately from this PR. It may even be simplest to start the integration testing just with interop through our python packages |
|
||
**GeometryCollection**: `List<Geometry>` | ||
|
||
An array of GeometryCollections is represented as a list containing the above geometry array. Each element of the array thus represents one or more geometries of varied type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably update this to include the field name expected in this list
How should nullability be handled here? For |
For other geometry types we also specify that nullability is only allowed at the outermost level, so I think it makes sense to disallow them for a child of a geometrycollection, too? Unions are funny in that they don't have their own nullability, so you would have to add a child array and make that null to append one to the mixed type. |
It would also be great to get feedback from cuSpatial developers, especially as this PR closes #23. cc @thomcom @trxcllnt. How similar is this to your existing data structures? I would guess that the ordering of type ids in this PR (e.g. a type id of 1 always means Point, and so on) may be different from your implementation? |
One comment came up in geoarrow/geoarrow-rs#646 (comment). In practice, should the Geometry array include all possible child arrays or just the child arrays that have data? E.g. if you have points and polygons in one column, should you have a data type that only references |
It would be really nice to have a set of types that has a finite set of memory layouts, which is functionally what would happen if they all contained the same child elements. (There would still be the matrix of dimensions x coord type for the mixed type, but at least there would be a bounded number such that each one could have an ID or something). Another benefit would be that To support geometry collections I think you wouldn't be able to do that (so maybe that's where the "in practice" comes in): the geometry collection member of the union would still need to be a list of something. If we limited the geometrycollection member to not contain geometrycollections (which I think covers almost all use cases), it could be a finite set of layouts but would be very verbose to specify ( |
Is it true that if the types all have the same child ids, then the physical array layouts need to exist for all union children as well? I.e. I assume you can't have a type that describes 6 children but then export an array that only has 2 or 3 children? |
I believe that's true! I think that having this type description as you have it here is good (i.e., it's "just" a union), with the convention that you either return |
What do you mean by this |
What I had in mind was something like:
I suppose I only have a direct use-case for |
In your last comment, I would expect that the most generic "geometry" type also includes GeometryCollection support? So like |
Other question: do we want to support mixed coordinate dimensionality? Maybe there is also room and use case for a truly generic and mixed type, but a type with a bit more constraints seems nice as well (and for this one, it would reduce the number of type ids a lot in the definition of the union, as you only need one for each geometry type, and we could then require that each child field of the union should have the same coord dimensionality) |
That's a great point! It's very verbose but more consistent (and more useful).
I would prefer the dimensionality to be the same to constrain the number of layouts an implementation would have to support (and allow those layouts to be compile-time constant). One can always include extra dimensions and fill them with |
One drawback of this way of representing GeometryCollection (and my understanding is that the current version of this PR is like that), is that you no longer have the property of all points in one child array, all lines in another, etc. I was thinking that in theory one could also do just (this is essentially the format for GeometryCollection as currently written in the PR, except for not allowing recursive GeometryCollections, I think) |
That would suffice for what I would use this for, and perhaps we could start here and add a top-level mixed version if this isn't needed? It is a good point that this version keeps all the linestrings (e.g.) together. The use-case for these is (probably) collecting truly arbitrary input and will almost always be followed by a step where a user errors, warns, or filters out unexpected geometry types. The downside is that you can't roundtrip arbitrary WKB through it (i.e., |
This updates the spec to include a Geometry array, represented by a union, and a GeometryCollection array, represented by a list of that union.
Closes #23
I have rust implementations of these here and here if anyone's curious.