# Read in a Delta Lake table with a geometry column in DuckDB

The best method depends on whether the table (or the sample you need) fits into (driver) memory or not.
- If it does, you can simply go through Arrow.
- If not, you can write out a copy of your data to plain Parquet file(s) in a Volume, which DuckDB can read.
- Finally, if your you can use the Delta extension of DuckDB, but this somes with some limitations. Finally, if your data set is so large that you want to avoid the copy, you can use Temporary Table Credentials, but this requires extra permissions on the Unity Catalog object and the caller; furthermore, does not support `GEOMETRY` types yet.

## Setup

In [None]:
%pip install duckdb --quiet

import duckdb

In [None]:
CATALOG = "mainworkspace_1863054340605750"
SCHEMA = "dsparing"
VOLUME = "default"
TABLENAME = "tmp_delta2duck"

table_fullname = f"{CATALOG}.{SCHEMA}.{TABLENAME}"

## Delta Lake to DuckDB via Arrow

If your (sample) data fits into memory, you can go through Arrow:


In [None]:
spark.sql("select st_point(1, 2, 28992) as geometry").write.mode(
    "overwrite"
).saveAsTable(table_fullname)

dfa = spark.table(table_fullname).toArrow()

> [!NOTE]
> If the below install stalls, you might have HTTP traffic blocked, see [TODO: link] for the workaround.

In [None]:
HTTP_BLOCKED = True

if HTTP_BLOCKED:
    import os
    from urllib.parse import urlparse
    import requests

    ARCHITECTURE = "linux_amd64"
    duckdb_version = duckdb.__version__
    url = f"https://extensions.duckdb.org/v{duckdb_version}/{ARCHITECTURE}/httpfs.duckdb_extension.gz"

    output_file = os.path.basename(urlparse(url).path)
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    with open(output_file, "wb") as f:
        f.write(response.content)

    duckdb.install_extension(output_file)

    os.remove(output_file)

    duckdb.sql("SET custom_extension_repository='https://extensions.duckdb.org'")

In [None]:
duckdb.sql("install spatial; load spatial")

In [None]:
duckdb.sql("select geometry.srid, st_geomfromwkb(geometry.wkb) geometry from dfa")

## Delta Lake to DuckDB via a Parquet copy in Volumes

Note that the SRID is lost in this transformation, buth we can capture it anyway for later processing, e.g. for the Flatgeobuf export below.

In [None]:
srid = (
    spark.table(table_fullname)
    .selectExpr("any_value(st_srid(geometry)) as srid")
    .first()[0]
)

In [None]:
parquet_volume_path = (
    f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/parquet/{TABLENAME}.parquet"
)

spark.table(table_fullname).write.mode("overwrite").parquet(parquet_volume_path)

In [None]:
!ls {parquet_volume_path}/part-*.parquet

In [None]:
duckdb.sql(
    f"""select * replace(st_geomfromwkb(geometry) as geometry)
    from read_parquet('{parquet_volume_path}/part-*.parquet')"""
)

### Side story: Streaming Flatgeobuf
We can also use this Parquet copy and DuckDB to further convert it into a Flatgeobuf file, which can e.g. be very efficiently streamed to QGIS:

In [None]:
fgb_volume_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/fgb/{TABLENAME}.fgb"

duckdb.sql(
    f"""COPY (
    select * replace(st_geomfromwkb(geometry) as geometry)
    from read_parquet('{parquet_volume_path}/part-*.parquet')
) TO '{fgb_volume_path}' (
    FORMAT GDAL,
    DRIVER flatgeobuf,
    LAYER_CREATION_OPTIONS 'TEMPORARY_DIR=/tmp/',
    SRS '{srid}'  -- doesn't seem to be used by QGIS downstream
)
"""
)

fgb_volume_path

You can download the above Flatgeobuf file and open it in QGIS -- or even better, with a PAT, you can stream it via the Files API. Copy the result of the below cell into the source of your new vector layer in QGIS, replacing the section `<INSERT PAT>` with your actual PAT:

In [None]:
f"/vsicurl?header.Authorization=Bearer%20<INSERT PAT>&url=https://{spark.conf.get('spark.databricks.workspaceUrl')}/api/2.0/fs/files{fgb_volume_path}"

## Delta Lake to DuckDB via Temporary Table Credentials

The `delta` extension of DuckDB does not support GEOMETRY types yet (as of July 2025), so the below approach only makes sense if your geometry column is still in WKB (or WKT).

In [None]:
spark.sql(
    f"""select * except (geometry), st_aswkb(geometry) as wkb_geometry
    from {table_fullname}"""
).write.mode("overwrite").saveAsTable(f"{table_fullname}_wkb")

In [None]:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import TableOperation

w = WorkspaceClient()

In [None]:
ttc = w.temporary_table_credentials.generate_temporary_table_credentials(
    operation=TableOperation.READ,
    table_id=w.tables.get(f"{table_fullname}_wkb").table_id,
)

metastore_region = w.metastores.get(w.metastores.current().metastore_id).region

storage_location = w.tables.get(f"{table_fullname}_wkb").storage_location

In [None]:
os.environ["AWS_ACCESS_KEY_ID"] = ttc.aws_temp_credentials.access_key_id
os.environ["AWS_SECRET_ACCESS_KEY"] = ttc.aws_temp_credentials.secret_access_key
os.environ["AWS_SESSION_TOKEN"] = ttc.aws_temp_credentials.session_token
os.environ["AWS_DEFAULT_REGION"] = metastore_region

# These explicit installs is probably only needed if http is blocked, otherwise it would
# be implicitly installed by the `CREATE SECRET` and `delta_scan()`
duckdb.sql("install aws; load aws")
duckdb.sql("install delta; load delta")

duckdb.sql("""
CREATE OR REPLACE SECRET (
    TYPE s3,
    PROVIDER credential_chain
)""")

In [None]:
duckdb.sql(f"""
select 
* exclude (wkb_geometry), st_geomfromwkb(wkb_geometry) geometry
from
delta_scan('{storage_location}')
""")