# Import geospatial formats into Delta Lake with DuckDB

This example focuses on a few example formats, but the same workflow works just as much for any spatial format that DuckDB Spatial supports via its GDAL integration, see the output of [ST_Drivers](https://duckdb.org/docs/stable/core_extensions/spatial/functions.html#st_drivers).

## Setup

In [0]:
%pip install duckdb --quiet

In [0]:
import duckdb

duckdb.sql("install spatial; load spatial")

In [0]:
CATALOG = "workspace"
SCHEMA = "default"
VOLUME = "default"

## Import Geopackage

::: {.callout-note}

NOTE: This example is focusing on using DuckDB to parse the GeoPackage. For more complex GeoPackages, you may need to [install GDAL](../appendix/databricks_ogr2ogr_parquet.ipynb]) and use the GDAL command line tools.

:::

In [0]:
GPKG_URL = "https://service.pdok.nl/kadaster/bestuurlijkegebieden/atom/v1_0/downloads/BestuurlijkeGebieden_2025.gpkg"

In [0]:
layers = duckdb.sql(
    f"""
with t as (
    select unnest(layers) layer
     from st_read_meta('{GPKG_URL}'))
select
    layer.name layer_name,
    layer.geometry_fields[1].name geom_field
from t"""
).df()

layers

# Returns:

# layer_name	geom_field
# 0	gemeentegebied	geom
# 1	landgebied	geom
# 2	provinciegebied	geom

In [0]:
# pick a layer to read
layer_name, geom_field = layers.loc[0, ["layer_name", "geom_field"]]

duckdb.sql(
    f"""copy (
  select * replace(st_aswkb({geom_field}) as {geom_field})
  from
    st_read(
      '{GPKG_URL}',
      layer='{layer_name}')
  ) to '/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{layer_name}.parquet' (format parquet)"""
)

In [0]:
spark.read.parquet(
    "/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{layer_name}.parquet"
).display()  # noqa: S108

You can store the above spark data frame as a Delta Lake table as needed.

## Import OpenStreetMap data

If you need data from OpenStreetMap (OSM) that is also available via Overture Maps, you are way better off using the latter. You could follow their [DuckDB tutorial](https://docs.overturemaps.org/getting-data/duckdb/), or, even better, make use of CARTO's pre-loaded delta lake tables via the [Marketplace](https://marketplace.databricks.com/provider/dd56dcf4-cb70-449e-abad-c8038c0de3d9/CARTO).

However, by far not all OSM data is available in Overture Maps. For example, transit data is absent. If you need such data layers, you'll need to load OSM data yourself, such as below.

Pick your desired area to download at https://download.geofabrik.de/ , or, not really recommended, but you could try loading the whole world via https://planet.openstreetmap.org/ .

In [0]:
GEOFABRIK_URL = "https://download.geofabrik.de/europe/netherlands-latest.osm.pbf"

In [0]:
file_name = GEOFABRIK_URL.split("/")[-1]
file_basename = file_name.rsplit(".")[0]
volume_file_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{file_name}"
volume_parquet_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{file_basename}.parquet"

In [0]:
!curl -o {volume_file_path} {GEOFABRIK_URL}

The below humble script actually does quite some heavy lifting: DuckDB Spatial recognizes the `.osm.pbf` file as an OSM extract, and calls [ST_Read_OSM](https://duckdb.org/docs/stable/core_extensions/spatial/functions.html#st_readosm) under the hood.

In [0]:
duckdb.sql(
    f"""
copy (
    select
        *
    from
        '{volume_file_path}'
) to '{volume_parquet_path}'
(format parquet)
;
"""
)

In [0]:
spark.read.parquet(volume_parquet_path).createOrReplaceTempView("osm")

You can then store the result of the above into a persistent Delta Lake table instead of a temporary view.

You can further process this dataset into Nodes, Ways, and Relations with SQL, which is beyond the scope of this doc, but nevertheless here is a minimal example to visualize some data:

In [0]:
%sql
with route as (
  select
    id as route_id,
    posexplode(refs) as (id, pos)
  from
    osm
  where
    lower(osm.tags.brand) = 'intercity direct'
)
select
  st_makeline(collect_list(st_point(osm.lon, osm.lat)) over (
    partition by
      route_id
    order by
      pos
  ))
from
  route
  join osm using (id)