Skip to content

Propagate SRID from EWKB as PROJJSON CRS in geoarrow.wkb metadata #2

@jatorre

Description

@jatorre

The driver already emits geoarrow.wkb Arrow extension metadata (thanks! confirmed working in #1), but the ARROW:extension:metadata field has no CRS information. Since Redshift sends geometry as EWKB over the wire — which includes the SRID in the header — the driver could propagate it as PROJJSON CRS in the geoarrow metadata.

Why this matters

Without CRS metadata, consumers (DuckDB, GeoParquet writers, pyarrow) receive geometry with no coordinate reference system. When the data gets written to GeoParquet, the CRS field is empty. This means downstream tools can't distinguish EPSG:4326 from EPSG:3857 or any other projection — they have to guess or require the user to specify it out-of-band.

What the driver does today

  1. Receives EWKB from Redshift (flag bit 0x20000000 + 4-byte SRID in header)
  2. Strips the SRID, outputs plain WKB
  3. Tags the Arrow column with ARROW:extension:name = "geoarrow.wkb"
  4. ARROW:extension:metadata is empty (no CRS)

Proposed change

Extract the SRID from the first non-null EWKB value and encode it as PROJJSON in ARROW:extension:metadata:

{
  "crs": {
    "type": "ProjectedCRS" or "GeographicCRS",
    "name": "EPSG:4326",
    "id": { "authority": "EPSG", "code": 4326 }
  }
}

This is the same approach used in the Databricks ADBC driver: adbc-drivers/databricks#350. The Databricks case is more complex because it needs to flatten Struct<srid, wkb> first — here the SRID is already in the EWKB header, so it's just a matter of reading it during decode and attaching it to the schema.

For SRID 0 (unset), the metadata should remain empty per the geoarrow spec (no CRS = unknown).

Two-phase schema

Since the SRID isn't known until the first record batch is read, this likely needs the same two-phase approach as Databricks: emit the schema without CRS initially, then rebuild it with CRS after seeing the first non-null geometry. Alternatively, the driver could query SELECT ST_SRID(geom_col) FROM table LIMIT 1 upfront, but reading it from the EWKB bytes is simpler and doesn't require an extra round-trip.

Context

We're using the Redshift ADBC driver from DuckDB (adbc_scanner) to export geospatial data to GeoParquet. DuckDB picks up geoarrow.wkb as native GEOMETRY, which is great — but the missing CRS means the output GeoParquet has no coordinate reference system. We recently got this working end-to-end for Databricks via the PR linked above, and the Redshift case should be strictly simpler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions