Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Arrow: Support for new options as of GDAL 3.8 #345

Open
kylebarron opened this issue Jan 24, 2024 · 7 comments
Open

Read Arrow: Support for new options as of GDAL 3.8 #345

kylebarron opened this issue Jan 24, 2024 · 7 comments

Comments

@kylebarron
Copy link
Contributor

I was just re-reading the GetArrowStream API docs and saw that there are two new options as of GDAL 3.8:

  • TIMEZONE="unknown", "UTC", "(+|:)HH:MM" or any other value supported by Arrow. (GDAL >= 3.8)
  • GEOMETRY_METADATA_ENCODING=OGC/GEOARROW (GDAL >= 3.8).

It would be great to support these. The metadata encoding would be useful for my purposes. I don't use timezone information as often, but someone probably would like that.

@brendan-ward
Copy link
Member

Looks like we'd need to add a keyword on ogr_open_arrow to set the geometry encoding, maybe geometry_encoding="wkb"|"geoarrow"
(not finding "ogc" as a meaningful value here compared to "wkb" which is specific).

@brendan-ward
Copy link
Member

Oops, I mistakenly conflated the actual geometry encoding - which looks like it is either WKB or native (which may be GeoArrow) without a way to specifically request GeoArrow geometry encoding - with the metadata that sets extension types (which I don't entirely follow here).

Does the geoarrow.wkb extension type mean that the actual geometry encoding is still (OGC) WKB, or is it serialized GeoArrow bytes. Apologies for not being more up to speed on GeoArrow!

@kylebarron
Copy link
Contributor Author

Looks like we'd need to add a keyword on ogr_open_arrow to set the geometry encoding, maybe geometry_encoding="wkb"|"geoarrow" (not finding "ogc" as a meaningful value here compared to "wkb" which is specific).

I agree this is confusing 🙂 . The key is that the switch does not change the actual encoding of the geometry. The geometries are always WKB. Rather, the switch only changes the metadata applied to the Arrow geometry column. As it is, the geometry column is always loaded with an Arrow extension name of ogc.wkb, which is "non-standard" (if you call GeoArrow a standard 😅).

What this toggle does is change the extension name from ogc.wkb to geoarrow.wkb and includes the CRS as projjson onto the arrow metadata. So for example, right now read_arrow returns a tuple with (metadata, table) = read_arrow() where the metadata object includes information about the CRS. When you set GEOMETRY_METADATA_ENCODING=GEOARROW, that metadata object is no longer needed because the CRS is stored on the table's internal metadata.

@brendan-ward
Copy link
Member

Thanks for the explanation. It seems like this is therefore more of a boolean, opt-in option. metadata="ogc"|"geoarrow" doesn't seem great though it somewhat parallels the options on the GDAL side.

What about geoarrow_metadata=False|True? Where True means GEOMETRY_METADATA_ENCODING=GEOARROW and returns the stream with the additional things in the metadata. And presumably if set it means that we don't need to separately try to resolve the CRS using other calls to GDAL.

@jorisvandenbossche
Copy link
Member

For the geoarrow metadata, I would personally consider making that the default here (I understand that GDAL didn't do that for backwards compatibility, but given this is quite new, I would still make the change here)

@jorisvandenbossche
Copy link
Member

The TIMEZONE option was added after an issue I opened ((OSGeo/gdal#8460)) triggered by our testing here when adding timezone read/write support in pyogrio.

The problem is that GDAL's data model stores a fixed offset to UTC per individual value, and not an actual time "zone" (like "Europe/Brussels"). While for the Arrow output, it needs to be a single timezone indicator for the full column (as it is part of the column's type). For some formats, Even added a specialized code path to preserve the timezone information. But for other formats, the best it can do is either preserve the wall time (unknown time zone) or convert to UTC.

And so my understanding is that the option allows you to choose between those. But the default is driver-dependent, so the option allows you to override the default of the driver. It's a bit complex to clearly explain ..

@kylebarron
Copy link
Contributor Author

I don't work with timezones a lot so in #366 I took the simpler route and only tried to add geoarrow metadata handling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants