-
Notifications
You must be signed in to change notification settings - Fork 1
Description
The driver already emits geoarrow.wkb Arrow extension metadata (thanks! confirmed working in #1), but the ARROW:extension:metadata field has no CRS information. Since Redshift sends geometry as EWKB over the wire — which includes the SRID in the header — the driver could propagate it as PROJJSON CRS in the geoarrow metadata.
Why this matters
Without CRS metadata, consumers (DuckDB, GeoParquet writers, pyarrow) receive geometry with no coordinate reference system. When the data gets written to GeoParquet, the CRS field is empty. This means downstream tools can't distinguish EPSG:4326 from EPSG:3857 or any other projection — they have to guess or require the user to specify it out-of-band.
What the driver does today
- Receives EWKB from Redshift (flag bit
0x20000000+ 4-byte SRID in header) - Strips the SRID, outputs plain WKB
- Tags the Arrow column with
ARROW:extension:name = "geoarrow.wkb" ARROW:extension:metadatais empty (no CRS)
Proposed change
Extract the SRID from the first non-null EWKB value and encode it as PROJJSON in ARROW:extension:metadata:
{
"crs": {
"type": "ProjectedCRS" or "GeographicCRS",
"name": "EPSG:4326",
"id": { "authority": "EPSG", "code": 4326 }
}
}This is the same approach used in the Databricks ADBC driver: adbc-drivers/databricks#350. The Databricks case is more complex because it needs to flatten Struct<srid, wkb> first — here the SRID is already in the EWKB header, so it's just a matter of reading it during decode and attaching it to the schema.
For SRID 0 (unset), the metadata should remain empty per the geoarrow spec (no CRS = unknown).
Two-phase schema
Since the SRID isn't known until the first record batch is read, this likely needs the same two-phase approach as Databricks: emit the schema without CRS initially, then rebuild it with CRS after seeing the first non-null geometry. Alternatively, the driver could query SELECT ST_SRID(geom_col) FROM table LIMIT 1 upfront, but reading it from the EWKB bytes is simpler and doesn't require an extra round-trip.
Context
We're using the Redshift ADBC driver from DuckDB (adbc_scanner) to export geospatial data to GeoParquet. DuckDB picks up geoarrow.wkb as native GEOMETRY, which is great — but the missing CRS means the output GeoParquet has no coordinate reference system. We recently got this working end-to-end for Databricks via the PR linked above, and the Redshift case should be strictly simpler.