Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CRS section of metadata specification #40

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Contributor

This PR updates the documentation for the "crs" metadata key to (1) better align with the GeoParquet specification and (2) ensure that producers have the option to provide whatever CRS information they have in available. Real-world examples of producers where this matters are:

  • PostGIS database driver. The only information available in the SPATIAL_REF_SYS table is auth_name, auth_srid, srtext, and proj4text.
  • GeoPackage database driver. The only relevant information available in the gpkg_spatial_ref_sys table is srs_name, srs_id, and definition
  • Shapefile reader: the .prj sidecar file contains "the crs" but can contain it in various (unspecified) forms.

Allowing flexibility in the CRS definition is a balance between making it easy for producers to participate and pass on information that is essential for interpreting coordinates and making it easy for CRS-aware consumers to consume GeoArrow arrays.

The other notable change here is the change that an omitted CRS defaults to longitude/latitude. This is to align with the GeoParquet specification and I recall that decision was arrived at after some very long debates. I don't particularly agree with this decision but I do see the value of this assumption in the data engineering world where there are a lot of practitioners without geospatial experience.

Comment on lines 51 to 56
information about the CRS. Note that regardless of the axis
order specified by the CRS, axis order will be interpreted
according to the wording in the
[GeoPackage WKB binary encoding](https://www.geopackage.org/spec130/index.html#gpb_format):
axis order is always (longitude, latitude) and (easting, northing)
regardless of the the axis order encoded in the CRS specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this "Note that regardless .." down outside the "One of" sublist, since this is relevant for all different ways you can specify the CRS.

Comment on lines 50 to 51
This key can also be omitted if the producer does not have any
information about the CRS. Note that regardless of the axis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "this key can also be omitted .. " will need to be updated.

@jorisvandenbossche
Copy link
Contributor

+1 on adding the flexibility to allow any generic string repr (with the recommendation to use PROJJSON)

The other notable change here is the change that an omitted CRS defaults to longitude/latitude. This is to align with the GeoParquet specification and I recall that decision was arrived at after some very long debates. I don't particularly agree with this decision but I do see the value of this assumption in the data engineering world where there are a lot of practitioners without geospatial experience.

While I certainly agree the consistency with GeoParquet is nice, I do think that some of the arguments for making this choice in GeoParquet don't fully apply here. IIRC, for GeoParquet, one of the reasons was to make it easier to produce data without having a full-fledged geo stack available that supports projjson. But, in our case, since we don't require to use the json format, you can actually also easily put "EPSG:4326" if you know it's lon/lat.

Also, for GeoArrow, one of the main use cases is to pass information along with the data from one system to another. I think we should very strongly recommend to always be explicit about the CRS in such a case (while I know for GeoParquet some argued for defaulting to have the CRS field not present as indication for lon/lat)

@jorisvandenbossche
Copy link
Contributor

Another option could be to say that the CRS metadata is not optional, so we force to be explicit (explicitly lon/lat or null, instead of one of those as the default). Not sure I like this myself though.

@paleolimbot
Copy link
Contributor Author

Agreed on all points! I removed the "omitted == lon/lat" option...I think there is a real danger that accidentally omitting the CRS will result in spurious interpretations.

@kylebarron
Copy link
Member

Linking this to a bunch of previous discussion in #8

@paleolimbot
Copy link
Contributor Author

Thanks for all comments! If there are no objections, I'm planning to merge tomorrow afternoon 🙂

Comment on lines 48 to 53
- A JSON object describing the coordinate reference system (CRS)
using [PROJJSON](https://proj.org/specifications/projjson.html).
- A string containing an undefined CRS representation. This option
is intended as a fallback for producers (e.g., database drivers or
file readers) that are provided a CRS in some form but do not have the
means by which to convert it to PROJJSON.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's either JSON or a random string, how do you know whether to try to parse it as json?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to allow anything I think it would be a lot easier to have different keys for different value types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire metadata is JSON, so at the point that it can be inspected it will already have been parsed (if using the JSON option). If using the "random string" option, it will remain an (unparsed) string.

  • Full metadata with CRS encoded as JSON: {"crs": {"key": "value"}}
  • Full metadata with CRS encoded as random string: {"crs": "EPSG:4326"}

I don't know of a language where differentiating between those two cases is not trival (e.g., in Python it could be isinstance(metadata["crs"], str); in R it could be is.character(metadata$crs); in C the differentiation happens during parsing...if the first non-whitespace character after : is {, it's a JSON CRS).

Is there a something here that I missed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot that this was already inside a JSON object. I thought for a second that it was a top-level key in the Arrow extension metadata

@kylebarron
Copy link
Member

Maybe it would be useful to consider flatgeobuf's approach of having various ways to represent the CRS: https://github.com/flatgeobuf/flatgeobuf/blob/b013ac326568914aee43bcf3aa3c77574d629aa5/src/fbs/header.fbs#L58-L65. Or STAC's PROJ extension, which allows any of epsg, wkt, projjson but with informative keys.

  • PostGIS database driver. The only information available in the SPATIAL_REF_SYS table is auth_name, auth_srid, srtext, and proj4text.

In this case what did you have in mind for the CRS value in geoarrow? If the application has PROJ on hand, it can convert the WKT to PROJJSON, but otherwise it dumps the WKT in the same field? Then you're trying to maybe parse as JSON and maybe parse as WKT? Sounds like a mess.

@kylebarron
Copy link
Member

The other notable change here is the change that an omitted CRS defaults to longitude/latitude.

Did this get changed back in the PR?

@paleolimbot
Copy link
Contributor Author

Did this get changed back in the PR?

Yes - the updated text now reads that a missing CRS indicates that no information about the CRS is known.

@paleolimbot
Copy link
Contributor Author

Then you're trying to maybe parse as JSON and maybe parse as WKT?

I think the difference between the two linked specs is that the only representation available there are strings. Here, we have all of JSON available to us, so it can be an object, a string, a number, or a null; PROJJSON is already helpfully parsed by whatever is parsing the metadata. We could add a "crs_type" key alongside the "crs" to disambiguate the parsing of the string if this is a problem for implementors?

Sounds like a mess.

At the end of the day it's pretty much all just proj_create() ( https://github.com/OSGeo/PROJ/blob/master/src/proj.h#L589 ) which doesn't let you specify any of that anyway.

@kylebarron
Copy link
Member

We could add a "crs_type" key alongside the "crs" to disambiguate the parsing of the string if this is a problem for implementors?

I was thinking this might be useful

@paleolimbot
Copy link
Contributor Author

It very well may be! I think I would prefer to defer adding it for now and introduce it with a specific usage difficulty in mind. Adding the key later would be backward compatible (if absent, it would continue to mean that the producer does not know what the CRS representation is). Or if there's a current example of a usage difficulty we could tackle it now?

I would hope that implementations only use the "random string" method if absolutely necessary (e.g., anything with access to PROJ should be writing a PROJJSON object, which will be true for anything coming from GeoPandas or the R spatial stack). All of the use-cases I have in mind for the "random string" method involve not knowing anything about the representation, so I can't immediately think of a time I would actually set the value.

@kylebarron
Copy link
Member

What about converting e.g. from FlatGeobuf to GeoArrow without access to PROJ, such as in the browser or Node. FlatGeobuf doesn't include PROJJSON, so I'd have to include one of the other FlatGeobuf CRS forms, but it seems like it would be useful to know that the CRS is e.g. WKT or an EPSG code

@paleolimbot
Copy link
Contributor Author

It definitely might be useful but does come at a non-trivial implementation complexity. I'm game to add the change and do the implementation...I think my point here is just that it's a consideration that we can defer until one of us runs into an implementation problem (e.g., when it can be demonstrated that there is a performance cost to guessing, that there are cases where guessing is not reliable, there is an implementation that does not have the capability to guess). There are quite a lot of things to implement in the meantime!

@kylebarron
Copy link
Member

Being able to put anything in there just smells bad. Reading back through the discussion in #8... is 4326 valid input? Do we defer to PROJ for what is valid input? In #8 we didn't like WKID because it's only meaningful if you also have a spatial reference table defined. PROJ happens to default 4326 to looking up in the EPSG database, but that's not that explicit. EPSG:4326 as an (authority, code) pair is definitely better, but does that mean 4326 should be allowed and discouraged or disallowed?

I agree there are other things to implement in the meantime, but it feels like it's always easier to expand a spec than narrow it down.

@paleolimbot
Copy link
Contributor Author

The sheer number of options is I think the reason why the specification should not enter into the discussion lightly. This change is intended to empower producers to be good citizens and make a clear path form them to pass on CRS information they have (the other options being to omit it or attempt to pass it along using a method not defined in this specification).

It's a valid discussion whether or not there should be a "crs_type" key to provide even more information; however, its contents and how that contents constrains the contents of the "crs" key is a potentially large can of worms and has a much wider scope of what I'm trying to do with this PR. Even if there is a "crs_type" key, there still needs to be a fallback option for producers that just don't know: this PR enables that fallback option.

I added some language to hopefully clarify the intent of allowing a string representation (and reiterate that if a producer needs any kind of guarantee about how a consumer will interpret the CRS, they need to use PROJJSON).

@jorisvandenbossche
Copy link
Contributor

With some delay, trying to wrap my head around this discussion (and sorry for the long post .. and more questions than answers)

To start, I think there is agreement on allowing something beyond PROJJSON, right? (with typical use case: implementation of some reader that doesn't have access to PROJ to convert the format-specific CRS representation to PROJJSON)

Then the next question is: what exactly do we allow? Do we want to be strict and clearly define a strict set of options (for example like the stac extension having epsg, wkt2 or projjson options)? Or do we want to be flexible and also allow essentially any, potentially even unknown, format (but always as a string, I suppose)?
Some thoughts on this question:

  • If we would limit it to a strict set of options, which ones would we go with? In addition to PROJJSON, also allow WKT2 and AUTH:CODE (as generalization of EPSG), similarly as stac? And then, how strictly would we define each of those options? For example is just any kind of WKT fine, or strictly WKT2 (although also here there are already multiple versions ..)? Is any authority fine? (or the ones recognized by PROJ's db, but that's again tied to a specific implementation)
    If we start to strictly define it, it might quickly become complicated, just because this topic/ecosystem is that way.
  • Not trying to strictly specify the content would makes those above questions easier for us (geoarrow spec). But of course that's not exactly solving the problem, but just avoiding it / moving it the user or consumer to handle with. But for a transport mechanism, that stance might be fine?
  • And only allowing a strict subset of options, does that actually solve the driving use cases for this PR?

Let's take the example of flatgeobuf: this format has a WKT field, but is that guaranteed to be WKT2? My understanding is that it is not (flatgeobuf/flatgeobuf@c1e9f89), and so if we require WKT2, that would still be a problem (although for this use case you could also code the auth:code way, as that will typically be available in the fgb file)
Or the example of a custom shapefile reader, which typically uses ESRI's flavor of WKT. Would that be allowed as "wkt" value? Or do we add a custom type for that ("wkt_esri")?

Another question might also be: do we want to allow multiple formats at the same time? (i.e. in the same metadata) For example, STAC and FlatGeoBuf essentially do this, allowing to store multiple representations (and PostGIS spatial_ref_sys table in the end does the same as well)
I would personally say this might not be that important, but it would need a change of the single "crs" field name.

And a final question: if we want to allow multiple formats, how to specify the specific format that being used?
The option is this PR is essentially: don't specify it, just let the user guess (and with the idea that the common formats PROJJSON, WKT or AUTH:CODE are easy to distinguish, but it also allows anything else)
Another option, which Kyle mentions above and which was also discussed on the original issue (#8 (comment)) is to add an additional field "crs_encoding" (or "crs_type", or ..) which can optionally be added and specifies the format of the "crs" field value.
Some thoughts here:

  • If we only want to allow a strict subset of options, we could also add fields "crs_wkt2", "crs_epsg", .. instead of a generic "crs_encoding" (although the latter is probably more flexible if we want to extend it in the future)
  • If we allow to specify the "crs_encoding", we then have to specify if the value here is only allowed to be a specific subset of well defined options (related to the first question above), or if this is more free form.
  • I do agree that if you know that you have eg WKT, it seems better to be able to specify this.
  • If we do want to allow any free form text for the actual CRS, we could still support this with something like "crs_encoding": "unknown" if we want to be explicit.
  • What I like about the "crs_encoding" option, with specified that the default for this field (if omitted) is PROJJSON, is that this keeps it compatible with what we already have and with the GeoParquet metadata. I think that would be a good outcome.
  • (if we didn't have the previous bullet point as history, I think I might like something like "crs": {"type": .., "value": ..}, or {"crs": {"wkt": ..}})

@paleolimbot
Copy link
Contributor Author

Thank you for the detailed summary! These are all good questions.

The motivation for this PR is to solve a right now problem: it is currently not possible for non-spatial aware producers to include CRS metadata. From the previous discussion, I think there are two ways to solve that:

  • Non spatial-aware producers pass {"crs": "<whatever information the producer has>"} as extension type metadata.
  • Non spatial-aware producers pass {"crs_encoding":"unknown", "crs": "<whatever information the producer has>"}

Neither of these options would change the case where the producer has no CRS information (the "crs" key would be omitted); norwould it change how a spatial-aware producer SHOULD write the "crs" key (PROJJSON as a JSON object with no "crs_encoding" key). I do not think that the initial version of the spec should/needs to provide language about any CRS encoding other than PROJJSON or unknown (although I am happy to review PRs if anybody else wants to come up with that language).

Given the content of this PR, I obviously feel that the first option is best. This is because I don't know of any situation in which the content of "crs_encoding" would be inspected (in general, but also very much for the case where its value is "unknown").

I am primarily working to embed this specification with engineers who don't know about GIS and need convincing that any geospatial type metadata is important to propagate (I'm billing "crs" as a timezone but, like, for map things). I am happy to advocate that "crs_encoding" is also important geospatial metadata to propagate; however, I have not heard any arguments yet that suggest that its value would ever be inspected or used.

We could also provide some guidance as documentation for engineers writing producers that might not have any background in GIS (e.g., on AUTH:CODE, WKT2 vs WKT, etc.); however, that change is more closely tied to adding a "crs_encoding" option. In my opinion, at the time we specify guidance on how to encode CRS values other than PROJJSON, a "crs_encoding" would be appropriate.

@hobu
Copy link

hobu commented Sep 6, 2023

A couple of points since this is exactly the same discussion we had with GeoJSON and you know where we ended up with that 😏

  • There was no JSON encoding of CRS when GeoJSON was created
  • For non-PROJ computing environments, creation of PROJJSON is often just a key-value lookup that returns a string/object, especially for the common CRSs like 3857 and 4326. Parsing and unpacking actual coordinate descriptions requires significant software. The only thing PROJJSON has going for it vs. WKT is you can push it through json-schema to validate its syntax (this itself is a huge improvement IMO as most WKT writers write invalid syntax WKT due to it being difficult and a bit ambiguous).
  • The first iteration of GeoJSON (the non-IETF variety) had something like @paleolimbot described, where the writer could attempt to describe the CRS encoding. From what we could tell no one used it because actual interchange of data was happening in common coordinate systems (3857 or 4326), or people were writing their own extensive metadata to store things in other coordinate systems (these people showed up to complain about GeoJSON only purporting interop with 4326 even though writers could write coordinates in whatever they wanted).

It all boils down to "do I really need infinite possibilities of coordinate systems supported by my format?" The answer to that is almost always 'NO'. Now that I'm old, I've come to rely on one true thing... if GDAL's SetFromUserInput reads it, it's good enough. If it can't, what's the point of writing CRS info in the first place? 😄

@paleolimbot
Copy link
Contributor Author

Thank you! One of the challenges with this spec is that we're not quite sure exactly how (or if?) it will be used. We sort of have three audiences in mind:

  • Data engineers looking to for a way to get geometry columns from one place to another using arrow-based tooling (Flight, IPC, C Data interface, etc.). This audience is almost 100% lon/lat and occasionally get glazed-looking eyes at the mention that a geometry type needs any parameters whatsoever.
  • Users of spatial data science tools (sf, geopandas, geojulia) looking to get geometry columns from one place to another using arrow-based tooling. This audience cares deeply about the CRS but typically has access to PROJ, so writing/reading CRS from PROJJSON is not a limitation.
  • Developers of very low-level tools (e.g., vectorized compute functions, file readers, file writers). This is the audience this PR is targeted at: the people (person? It may just be me...) writing libraries below the level of PROJ and GDAL. These libraries are mostly about using arrow-based tooling to pipe the data to a place where PROJ and/or GDAL exists.

Not all of those audiences may end up using this spec...I suppose with this PR I am asking for some flexibility to prototype those low-level tools within the (draft) spec. If nobody uses them, I have no problem tightening the spec back up (just as GeoJSON did!).

@hobu
Copy link

hobu commented Sep 7, 2023

  • Users of spatial data science tools (sf, geopandas, geojulia) looking to get geometry columns from one place to another using arrow-based tooling. This audience cares deeply about the CRS but typically has access to PROJ, so writing/reading CRS from PROJJSON is not a limitation.

This is the reason why I've been implementing GeoParquet and Arrow (not GeoArrow, just XYZ columns of Arrow IPC) to PDAL. See PDAL/PDAL#4115

@kylebarron
Copy link
Member

person? It may just be me...

Hey me too! 😂 There's two of us!

This is the reason why I've been implementing GeoParquet and Arrow (not GeoArrow, just XYZ columns of Arrow IPC) to PDAL

Sorry, can you expand on why it's easier for you to implement Arrow rather than GeoArrow here?

@hobu
Copy link

hobu commented Sep 7, 2023

Sorry, can you expand on why it's easier for you to implement Arrow rather than GeoArrow here?

Because point clouds are simply XYZ columns and arrow-cpp API has an easy to use IPC writer.

@paleolimbot
Copy link
Contributor Author

Because point clouds are simply XYZ columns

GeoArrow is simply an XYZ struct with a CRS that doesn't get lost when you put it across an application boundary 🙂

@kylebarron
Copy link
Member

Yeah, I wasn't clear but I meant to say: you're so close to supporting GeoArrow already! If you just wrap the three columns into a logical struct can give it the extension name geoarrow.point, then voilà!

@hobu
Copy link

hobu commented Sep 7, 2023

If you just wrap the three columns into a logical struct can give it the extension name geoarrow.point, then voilà!

So what happens when you compress structs? As individual columns, they compress decently due to not much change in row-to-row value within each dimension. As a struct, the number of bits changing row-to-row is substantial.

@kylebarron
Copy link
Member

kylebarron commented Sep 7, 2023

The key is that in arrow the struct is logical. So each of the X, Y, Z columns (within the struct!) are still their own separate buffers. If you save those to parquet, the physical buffers are exactly the same as if you saved them as three different logical columns. There's no interleaving happening when you store a struct column (that only happens if you store a FixedSizeList column; the GeoArrow spec allows either). Maybe other confusion is in how Parquet and Arrow relate to each other.

Here's an example:

Setup:

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

x = np.full(100, 1)
y = np.full(100, 2)
z = np.full(100, 3)

Creating a table with three separate columns:

table_separate_columns = pa.table({
    'x': x,
    'y': y,
    'z': z
})
pq.write_table(table_separate_columns, 'table_separate.parquet')

pq.read_schema reads an Arrow schema from the file, while pq.read_metadata reads the Parquet metadata from the file, which is the difference between the logical (Arrow) layout and the physical (Parquet) layout.

So in this case, they overlap. If we read the Arrow schema and the Parquet metadata, they both have three columns:

pq.read_schema('table_separate.parquet')
# x: int64
# y: int64
# z: int64

pq.read_metadata('table_separate.parquet')
# <pyarrow._parquet.FileMetaData object at 0x16b424680>
#   created_by: parquet-cpp-arrow version 12.0.0
#   num_columns: 3
#   num_rows: 100
#   num_row_groups: 1
#   format_version: 2.6
#   serialized_size: 708

But consider what happens when we write a struct array:

struct_column = pa.StructArray.from_arrays([x, y, z], ['x', 'y', 'z'])
table_one_column = pa.table({
    'xyz': struct_column
})
pq.write_table(table_one_column, 'table_one_col.parquet')

Now the Arrow schema has one logical column, while three columns were written to Parquet:

pq.read_schema('table_one_col.parquet')
# xyz: struct<x: int64, y: int64, z: int64>
#   child 0, x: int64
#   child 1, y: int64
#   child 2, z: int64

pq.read_schema('table_one_col.parquet').names
# ['xyz']

pq.read_metadata('table_one_col.parquet')
# <pyarrow._parquet.FileMetaData object at 0x16b9309a0>
#   created_by: parquet-cpp-arrow version 12.0.0
#   num_columns: 3
#   num_rows: 100
#   num_row_groups: 1
#   format_version: 2.6
#   serialized_size: 794

And if you look at the file size on disk, it's virtually the same for each:

ls -l *.parquet
# -rw-r--r--@ 1 kyle  staff  1395 Sep  7 14:07 table_one_col.parquet
# -rw-r--r--@ 1 kyle  staff  1297 Sep  7 14:01 table_separate.parquet

And 86 of those extra bytes in the first file are from the larger size of the metadata stored in the Parquet file. That's Arrow's metadata ensuring that Arrow can accurately round trip the data to and from Parquet without any loss

@paleolimbot
Copy link
Contributor Author

@kylebarron @jorisvandenbossche Is there any middle ground we can settle on here? I am intererested specifically in how to communicate a CRS whose encoding is unknown (in a way that does not prevent future specialization if required).

Joris suggested a separate key (i.e., {"crs_unknown": "some unknown content"} instead of {"crs": "some unknown content"}), which is slightly harder to implement but definitely possible.

The obsolete GeoJSON spec uses something like {"crs" {"type": "unknown", "properties": {"content": "some unknown content"}}}, which is a little too deeply nested for my liking, but also possible.

We could do {"crs_type": "unknown", "crs": "some unknown content"} as well.

There was some talk of compatibility with the GeoParquet spec; however, I don't think that keeping full compatibility with the GeoParquet "crs" key is possible (unless we are willing to completely align our definition with theirs, including the default of OGC:CRS84 if the key is absent).

@kylebarron
Copy link
Member

kylebarron commented Sep 22, 2023

I agree that we should have some way to declare an unknown CRS, just also feel that we want to store the CRS in some structured way. I actually hit this in geoarrow/geoarrow-rs#196, where I implemented writing from GeoArrow to FlatGeobuf, and realized that I won't be able to create the FgbCrs struct if I have only a "CRS blob".

I could be convinced but don't immediately love the crs_unknown key. I think it's ok to not align 100% with the GeoParquet spec, but the same problem as above will happen when writing GeoParquet files... I want to be able to read and write data without necessarily having my rust code require PROJ. If I have a CRS blob in my Arrow metadata, I have no way to write GeoParquet without PROJ as I have no way to convert it to PROJJSON.

It appears impossible to be able to write both FlatGeobuf and GeoParquet without access to PROJ, because FlatGeobuf doesn't support PROJJSON and GeoParquet requires PROJJSON.

@paleolimbot
Copy link
Contributor Author

I agree that we should have some way to declare an unknown CRS

I agree that the separate key thing is probably not quite what we want. That leaves:

  • {"crs_type": "unknown", "crs": "some crs string"}, or
  • {"crs": "some crs string"}

This PR as currently worded is option 2. I would also be happy to implement option 1! Which would you prefer?

I want to be able to read and write data without necessarily having my rust code require PROJ

This PR is entirely about being able to read data without PROJ, and I think it makes sense to have a "crs_type": "WKT2" (or similar) so that it is easier to figure out what conversion needs to happen (if any). I'm happy to come up with that language in another PR but it's harder than the unknown option because of the various WKT flavours. As I understand it, the transformation between PROJJSON and WKT2 can be done without PROJ or its database (even though nobody else that I know of has actually implemented it).

@kylebarron
Copy link
Member

Is there ever a case where you want to store multiple representations of the same CRS, so that a consumer can choose whichever is easiest for them to use? Like FlatGeobuf storing both EPSG:4326 and its WKT2 representation.

I think variant 1 might make it easier to adopt specified CRS types in the future at least?

As I understand it, the transformation between PROJJSON and WKT2 can be done without PROJ or its database

Even implemented PROJJSON -> WKT2 in https://github.com/rouault/projjson_to_wkt; I made an issue about the converse and he said it's possible, especially if restricting to WKT2:2019 -> PROJJSON

@jorisvandenbossche
Copy link
Contributor

Is there ever a case where you want to store multiple representations of the same CRS, so that a consumer can choose whichever is easiest for them to use? Like FlatGeobuf storing both EPSG:4326 and its WKT2 representation.

I suppose that would be useful for your example of writing FlatGeoBuf? (because if the source data only has PROJJSON, you need PROJ access in the writer to convert PROJJSON to WKT (although fgb might also be fine with just an auth:code that you might be able to extract from PROJJSON?)
But that then also only works if the data producer actually puts the multiple representations in the metadata, and that might not be guaranteed, so your fbg writer still needs to handle the case where there is only PROJJSON? (or only some random unknown representation?)

@jorisvandenbossche
Copy link
Contributor

Joris suggested a separate key (i.e., {"crs_unknown": "some unknown content"} instead of {"crs": "some unknown content"}), which is slightly harder to implement but definitely possible.

I could be convinced but don't immediately love the crs_unknown key.

I wouldn't call it "crs_unknown", but maybe rather something like "crs_source" (or "crs_text", or "crs_description"). Then you either have well specified PROJJSON in the "crs" key, or the original representation from the source of the data in a "crs_source" (or both, if you can convert to PROJJSON but want to preserve the original source as well.
If you don't know what kind of repr you get, you just put it in "crs_source" to pass it along.

To be clear, I don't know if this is better than the other two options that Dewey listed above. It was just a random thought to have something that is also backwards compatible with the existing "crs" key being PROJJSON (and compatible with the geoparquet spec). And it gives the guarantee that "crs" is always well specified, if present.

@paleolimbot
Copy link
Contributor Author

Is there ever a case where you want to store multiple representations of the same CRS

At the GeoArrow level I think this would result in some pretty verbose language in the spec around which one you should pick if both are specified. It is perhaps worth discussing, but definitely not required for the initial version of the spec (how to represent an unknown CRS is, though!).

I think variant 1 might make it easier to adopt specified CRS types in the future at least?

I am not sure. I think we should add "crs_type" as a top-level metadata key, but I don't think requiring it will add anything to the specific case where you do not know what it is. As some initial text, I would just write "if present, specifies the interpretation of the content of the "crs" key".

@rouault
Copy link

rouault commented Sep 28, 2023

"crs_type": "WKT2"

There are several flavors of WKT2. WKT2 is actually not an official name at all. The first flavor of WKT2 is formally "Well-known text representation of coordinate reference systems v1.0" (OGC 12-063r5 : https://docs.ogc.org/is/12-063r5/12-063r5.html), also registered as ISO 19162:2015(E), that we dubbed WKT2:2015 in PROJ. And the latest one is "Well-known text representation of coordinate reference systems v2.1.11" (OGC 18-010r11 https://docs.ogc.org/is/18-010r11/18-010r11.pdf / ISO 19162:2019/Amd 1:2023), aka WKT2:2019 in PROJ. For WKT1, see https://proj.org/en/9.3/development/reference/cpp/cpp_general.html#general_doc_1WKT1 . Good luck to make that clear to non-geo folks (and even to geo ones :-))

@paleolimbot
Copy link
Contributor Author

Good luck to make that clear to non-geo folks (and even to geo ones :-))

Including me! I certainly don't feel qualified to come up with the language around what constraints should and should not be placed on the content of the "crs" key for various possible values of "crs_type" (if we do choose to add a "crs_type" key).

I think my main point in this PR is that (1) it is useful for some producers to put something that isn't PROJJSON in the "crs" key (e.g., Kyle's FlatGeoBuf reader) and (2) "crs_type": "unknown" is unlikely to disambiguate anything for anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants