# Export Delta Lake table to other formats with DuckDB

## Setup

In [None]:
%pip install duckdb --quiet

import duckdb

duckdb.sql("install spatial; load spatial")

::: {.callout-note}

If `install spatial` fails (especially if you are _not_ using the Free Edition or Serverless Compute, but classic compute), check whether HTTP is blocked on your (corporate) network. If so, then you need to work around it as described [here](../appendix/https_install_duckdbextension.ipynb).

:::

In [None]:
CATALOG = "workspace"
SCHEMA = "default"
VOLUME = "default"

GEOMETRY_COLUMN = "geometry"

spark.sql(f"create volume if not exists {CATALOG}.{SCHEMA}.{VOLUME}")

Let's first create an example table with GEOMETRY columns:

In [None]:
%sql
create or replace table tmp_geometries as
select
  st_point(0, 0, 4326) as geometry,
  "Null Island" as name
union all
select
  st_transform(st_point(155000, 463000, 28992), 4326) as geometry,
  "Onze Lieve Vrouwetoren" as name
union all
select
  st_makepolygon(
    st_makeline(
      array(
        st_point(- 80.1935973, 25.7741566, 4326),
        st_point(- 64.7563086, 32.3040273, 4326),
        st_point(- 66.1166669, 18.4653003, 4326),
        st_point(- 80.1935973, 25.7741566, 4326)
      )
    )
  ) as geometry,
  "Bermuda Triangle" as name;

select
  *
from
  tmp_geometries
-- Returns:

-- _sqldf:pyspark.sql.connect.dataframe.DataFrame
-- geometry:geometry(OGC:CRS84)
-- name:string

-- geometry	name
-- SRID=4326;POINT(0 0)	Null Island
-- SRID=4326;POINT(5.3872035084137675 52.15517230119224)	Onze Lieve Vrouwetoren
-- SRID=4326;POLYGON((-80.1935973 25.7741566,-64.7563086 32.3040273,-66.1166669 18.4653003,-80.1935973 25.7741566))	Bermuda Triangle

## Parquet files

We'll use DuckDB Spatial to write he Geoparquet file, so first, we output the above Delta Lake table as a directory of Parquet files, using lon/lat coordinates.

(You could also use Databricks [Temporary Table Credentials API](https://docs.databricks.com/api/workspace/temporarytablecredentials) to directly read the Delta Lake table with the DuckDB [Delta Extension](https://duckdb.org/docs/stable/core_extensions/delta.html) instead.)

In [None]:
from pyspark.sql import functions as F

spark.table("tmp_geometries").withColumn(
    "geometry", F.expr("st_transform(geometry, 4326)")
).write.mode("overwrite").parquet(
    f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.parquet"
)

We will use the above parquet export as a stepping stone to produce other formats below.

## Geoparquet

We can use duckdb to transform the Parquet files into a valid [Geoparquet](https://geoparquet.org/) files:

::: {.callout-note}

(Note that if you didn't load the DuckDB Spatial extension, the below would still succeed but Geoparquet metadata would _not_ be written.)

:::

In [None]:
query = f"""
load spatial;
copy (
select 
    * replace (st_geomfromwkb({GEOMETRY_COLUMN}) as geometry)
from
    read_parquet('/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.parquet/part-*.parquet')
) to '/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries_geo.parquet' (format parquet)"""
duckdb.sql(query)

There are more details around writing Geoparquet such as writing custom CRS's or defining a ["covering"](https://geoparquet.org/releases/v1.1.0/) using bounding boxes, but the above example is already a valid Geoparquet. For example, if your QGIS already supports the Parquet format (as of Aug 2025, the latest Windows version does but the latest macOS version doesn't), then you can open this file in QGIS (after having downloaded from Volumes):

![geoparquet in qgis](img/geoparquet_qgis.png)

(in fact, the [GDAL Parquet reader](https://gdal.org/en/stable/drivers/vector/parquet.html) used by QGIS can even open parquet files that are not valid geoparquet, as long as they have a WKB or WKT column and the column name and CRS matches the expected defaults or correctly defined)

## Flatgeobuf

Exporting to Flatgeobuf is very similar to the above. Flatgeobuf as a format has two key advantages here:
- It is faster to render (e.g. in QGIS) than Geoparquet, and
- It can act as input to `tippecanoe` (see below), which we'll use to produce PMTiles, which is even better suited for web mapping.

In [None]:
query = f"""
load spatial;
copy (
select 
    * replace (st_geomfromwkb({GEOMETRY_COLUMN}) as geometry)
from
    read_parquet('/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.parquet/part-*.parquet')
) to '/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.fgb'
(FORMAT GDAL, DRIVER flatgeobuf, LAYER_CREATION_OPTIONS 'TEMPORARY_DIR=/tmp/')"""
duckdb.sql(query)

### Streaming the Flatgeobuf file to QGIS

You can stream this to QGIS (i.e. without downloading the file first -- this is very useful for much larger datasets) via a token and the [Files API](https://docs.databricks.com/api/workspace/files/download). For example, after setting up a personal access token, you can stream the above file with a link like below:

In [None]:
f"/vsicurl?header.Authorization=Bearer%20<YOUR_PERSONAL_ACCESS_TOKEN>&url=https://{spark.conf.get('spark.databricks.workspaceUrl')}/api/2.0/fs/files/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.fgb"

Replace `<YOUR_PERSONAL_ACCESS_TOPEN>` in the output of the above with your token, and you can copy the resulting string (together with "/vsicurl" at the beginning, but without the quotes) to QGIS, inserting a vector layer.

::: {.callout-note}

This way of streaming might work with other formats too, such as Parquet; however, for larger datasets, Flatgeobuf is a great choice. And for smaller datasets, simply downloading the file might be faster than setting up the above authentication.

:::

# PMTiles

For PMTiles, while theoretically we could keep using DuckDB Spatial with GDAL, we'll instead use [tippecanoe](https://github.com/felt/tippecanoe).

In [None]:
%sh
# ~5 min
cd /tmp && git clone https://github.com/felt/tippecanoe.git
cd tippecanoe
make -j
make install PREFIX=$HOME/.local

In [None]:
import os

HOME = os.environ["HOME"]

# see https://github.com/felt/tippecanoe/blob/main/README.md#try-this-first and e.g.
# https://github.com/OvertureMaps/overture-tiles/blob/main/scripts/2024-07-22/places.sh
# for possible options
!{HOME}/.local/bin/tippecanoe -zg -rg -o /tmp/geometries.pmtiles --drop-densest-as-needed --extend-zooms-if-still-dropping --maximum-tile-bytes=2500000 --progress-interval=10 -l geometries --force /Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.fgb
# NOTE: this mv will emit an error related to updating metadata ("mv: preserving
# permissions for ‘[...]’: Operation not permitted"), this can be ignored.
!mv /tmp/geometries.pmtiles /Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.pmtiles

To visualize, download the PMTiles from Volumes, and upload it to https://pmtiles.io/ (see below screenshot). To directly visualize it via Databricks Apps via downloading, see TODO:.

![pmtiles_io](img/pmtilesio_geometries.png)

TO be clear: the advantage the PMTiles format is to be able to visualize very large datasets such as all of OpenStreetMap -- this notebook only uses a very simple example but see TODO: for a larger case.

## Cleanup

In [None]:
%sql
-- drop table tmp_geometries