Skip to content

sql_database() & sql_table() source functions not working with DuckDB/ MotherDuck #2421

@thenaturalist

Description

@thenaturalist

dlt version

1.8.0

Describe the problem

I want to load data from a DuckDB/ MotherDuck source using either the sql_database or sql_table functions.

Both fail as dlt fails to resolve DuckDBs catalog inspection to correctly scope table and column qualifications.

I am using https://github.com/Mause/duckdb_engine verions 0.15.0.

Expected behavior

dlt is generating SQL which correctly identifies a specified table based on DuckDB's catalog conventions and the pipeline completes without error.

Steps to reproduce

  1. Create a DuckDB instance with a synthetic test table as described here: https://duckdb.org/docs/stable/guides/snippets/create_synthetic_data.html

  2. Create a pipeline which uses DuckDB as a source like so:

conn_string = f"duckdb:///md:{db_name}{access_token_suffix}"

credentials = create_engine(
    conn_string,
    connect_args={
        'read_only': True,
    }
)

pipeline = dlt.pipeline(
        pipeline_name=pipeline_name,
        destination='postgres',
        dataset_name=dataset_name,
        progress="log",
    )

source1 = sql_database(
   credentials,
   schema=schema_name, # with our without, same effect
   table_names=['foobar'], # or schema_name.foobar or db_name.schema_name.foobar
   reflection_level="full",
 )

source2 = sql_database(
   credentials,
   schema=schema_name, # with our without, same effect
   reflection_level="full",
 ).with_resources('foobar')

source3 = sql_table(
    credentials,
    schema="info",
    table="foobar",
)

load_info1 = pipeline.run(source1, write_disposition=write_disposition)
print(load_info1)

load_info2 = pipeline.run(source2, write_disposition=write_disposition)
print(load_info2)

load_info3 = pipeline.run(source3, write_disposition=write_disposition)
print(load_info3)

Expected errors to see:

dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1742387100.2242408 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe foo: extraction of resource foo in generator table_rows caused an exception: (duckdb.duckdb.CatalogException) Catalog Error: Table with name foo does not exist!
Did you mean "db_name.schema_name.foo"?

LINE 2: FROM foo
             ^
[SQL: SELECT foo.id, foo.created_at, foo.bar
FROM foo]
(Background on this error at: https://sqlalche.me/e/20/f405)

When fully qualifying the table name in the functions like db_name.schema_name.table_name, the error returned is:

sqlalchemy.exc.InvalidRequestError: Could not reflect: requested table(s) not available in Engine(duckdb:///foobar): db_name.schema_name.table_name

Operating system

macOS

Runtime environment

Local

Python version

3.12

dlt data source

sql_database

dlt destination

Postgres

Other deployment details

No response

Additional information

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions