Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add duckdb as DataSource - Fixes #14563 #19317

Merged
merged 7 commits into from
Mar 24, 2022

Conversation

rwhaling
Copy link
Contributor

@rwhaling rwhaling commented Mar 22, 2022

SUMMARY

Adds Duckdb as an embedded, in-process OLAP db engine.
Duckdb can directly query CSV or Parquet files on disk - eventually, we should be able to query Parquet files directly on S3 as well.
Supersedes #19265
Relying on https://github.com/Mause/duckdb_engine for the SQLAlchemy implementation, and building on top of @alitrack's original work.
Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in TestConnectionDatabaseCommand.run feels a bit weird. Might want to work with @Mause on a narrow exception class for the threading blip.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screen Shot 2022-03-22 at 3 22 53 PM

Screen Shot 2022-03-22 at 3 23 10 PM

Screen Shot 2022-03-22 at 4 12 34 PM

Screen Shot 2022-03-22 at 4 42 28 PM

TESTING INSTRUCTIONS

  • install duckdb_engine via pip install duckdb-engine
  • DuckDB appears in the add a database dropdown
  • Provide path to a duckdb database on disk in the url, e.g., duckdb:////Users/whoever/path/to/duck.db
  • in SQL lab, query any tables defined in your on-disk database, tables, schemas, and column defs all appear as normal
  • have not tested in-memory a ton but duckdb:///:memory: seems to work?
  • once you are in SQL Lab, you can directly query parquet or CSV files on disk: SELECT * from 'test.parquet' etc.

ADDITIONAL INFORMATION

  • Has associated issue: duckdb as DataSource #14563
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Mar 22, 2022

Codecov Report

Merging #19317 (4aa477b) into master (d645579) will decrease coverage by 0.03%.
The diff coverage is 74.77%.

❗ Current head 4aa477b differs from pull request most recent head d1e16c4. Consider uploading reports for the commit d1e16c4 to get more accurate results

@@            Coverage Diff             @@
##           master   #19317      +/-   ##
==========================================
- Coverage   66.53%   66.50%   -0.04%     
==========================================
  Files        1667     1672       +5     
  Lines       64360    64564     +204     
  Branches     6493     6493              
==========================================
+ Hits        42824    42936     +112     
- Misses      19854    19946      +92     
  Partials     1682     1682              
Flag Coverage Δ
hive ?
mysql 81.65% <88.63%> (+0.16%) ⬆️
postgres 81.69% <88.63%> (+0.16%) ⬆️
presto ?
python 81.78% <88.63%> (-0.18%) ⬇️
sqlite 81.46% <88.63%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...t-ui-chart-controls/src/operators/pivotOperator.ts 100.00% <ø> (ø)
...i-chart-controls/src/operators/resampleOperator.ts 100.00% <ø> (ø)
.../superset-ui-core/src/query/types/QueryFormData.ts 100.00% <ø> (ø)
...tend/packages/superset-ui-core/src/style/index.tsx 100.00% <ø> (ø)
...end/plugins/legacy-plugin-chart-chord/src/Chord.js 0.00% <0.00%> (ø)
...ns/legacy-plugin-chart-chord/src/transformProps.js 0.00% <0.00%> (ø)
.../legacy-plugin-chart-country-map/src/CountryMap.js 0.00% <0.00%> (ø)
...acy-plugin-chart-country-map/src/transformProps.js 0.00% <0.00%> (ø)
...ns/legacy-plugin-chart-histogram/src/Histogram.jsx 0.00% <0.00%> (ø)
...egacy-plugin-chart-histogram/src/transformProps.js 0.00% <0.00%> (ø)
... and 258 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d645579...d1e16c4. Read the comment docs.

@zhaoyongjie zhaoyongjie self-requested a review March 23, 2022 02:06
Copy link
Member

@zhaoyongjie zhaoyongjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except for lint error, LGTM. BTW, I was impressed by DuckDB as a column base lite database.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - a few non-blocking style related comments

Comment on lines 97 to 99
except RuntimeError:
# Catches the equivalent single-threading error from duckdb.
alive = engine.dialect.do_ping(conn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we have these in the same except:

                except (sqlite3.ProgrammingError, RuntimeError):
                    # SQLite can't run on a separate thread, so ``func_timeout`` fails
                    # RuntimeError catches the equivalent single-threading error from duckdb.
                    alive = engine.dialect.do_ping(conn)


@classmethod
def get_table_names(
cls, database: "Database", inspector: Inspector, schema: Optional[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we add from __future__ import annotations in the beginning of the file, we can remove the quotes. See an example here:

from __future__ import annotations

Suggested change
cls, database: "Database", inspector: Inspector, schema: Optional[str]
cls, database: Database, inspector: Inspector, schema: Optional[str]

@rwhaling
Copy link
Contributor Author

Thanks for the review @villebro @zhaoyongjie - I'll incorporate your feedback and get the linter passing.

@rwhaling
Copy link
Contributor Author

OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green.

@zhaoyongjie
Copy link
Member

OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green.

@rwhaling, CI looks like waiting to finish other tasks. When the CI is all green, I will merge it. Thanks for the following up!

@zhaoyongjie zhaoyongjie reopened this Mar 24, 2022
@zhaoyongjie zhaoyongjie merged commit 202e34a into apache:master Mar 24, 2022
@villebro
Copy link
Member

@rwhaling - One thing that I forgot while reviewing was if you would be able to open a PR that adds the relevant documentation for connecting to DuckDB? Something similar to the what this PR did for Trino (there's some unrelated changes, you can ignore those): #13171

@rwhaling
Copy link
Contributor Author

@rwhaling - One thing that I forgot while reviewing was if you would be able to open a PR that adds the relevant documentation for connecting to DuckDB? Something similar to the what this PR did for Trino (there's some unrelated changes, you can ignore those): #13171

absolutely 👍

@Alex-Monahan
Copy link

Alex-Monahan commented Mar 24, 2022

Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in TestConnectionDatabaseCommand.run feels a bit weird. Might want to work with @Mause on a narrow exception class for the

I had one small comment in case it is helpful! DuckDB is indeed embedded in your local process, but it is multi-threaded and can use as many CPU cores as you would like.

Thanks for building this connector!

villebro pushed a commit that referenced this pull request Apr 3, 2022
* + duckdb support

needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)

* Update duckdb.py

update  _time_grain_expressions

* removed superfluous get_all_datasource_names def in duckdb engine spec

* added exception handling for duckdb single-threaded RuntimeError

* fixed linter blips and other stylistic cleanup in duckdb.py

* one last round of linter tweaks in test_connection.py for duckdb support

Co-authored-by: Steven Lee <admin@alitrack.com>
Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local>
(cherry picked from commit 202e34a)
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
* + duckdb support

needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)

* Update duckdb.py

update  _time_grain_expressions

* removed superfluous get_all_datasource_names def in duckdb engine spec

* added exception handling for duckdb single-threaded RuntimeError

* fixed linter blips and other stylistic cleanup in duckdb.py

* one last round of linter tweaks in test_connection.py for duckdb support

Co-authored-by: Steven Lee <admin@alitrack.com>
Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local>
@inakianduaga
Copy link

inakianduaga commented Jun 21, 2022

Hi, testing this and Superset doesn't work when installing duckdb-engine, throws a 500 internal error on any page in the UI, backend says

module 'duckdb_engine' has no attribute 'name'
Traceback (most recent call last):                                                                                                                                                                                                                              File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/app/superset/utils/log.py", line 245, in wrapper
    value = f(*args, **kwargs)
  File "/app/superset/views/core.py", line 2677, in welcome
    "common": common_bootstrap_payload(),
  File "/app/superset/views/base.py", line 368, in common_bootstrap_payload
    available_specs = get_available_engine_specs()
  File "/app/superset/db_engine_specs/__init__.py", line 141, in get_available_engine_specs
    backend = dialect.name

Here's my full Dockerfile

FROM apache/superset

# Switching to root to install the required packages
USER root

# Example: installing the MySQL driver to connect to the metadata database
# if you prefer Postgres, you may want to use `psycopg2-binary` instead
RUN pip install psycopg2-binary

# Example: installing a driver to connect to Redshift
# Find which driver you need based on the analytics database
# you want to connect to here:
# https://superset.apache.org/installation.html#database-dependencies
# RUN pip install sqlalchemy-redshift
RUN pip install pyhive[hive]
RUN pip install pyhive[presto]

# DuckDB support
RUN pip install duckdb-engine

ADD ./config/superset_config.py /app/pythonpath/superset_config.py

# Switching back to using the `superset` user
USER superset

@Mause
Copy link

Mause commented Jun 21, 2022

@inakianduaga heya, that's on me - unfortunately sqlalchemy doesn't really document what's expected from a dialect, so it's hard to keep up with the changes on their side. I'll push out a fixed version shortly

@Mause
Copy link

Mause commented Jun 21, 2022

@inakianduaga it should work now if you try with duckdb_engine 0.1.11, sorry about that 😅

@inakianduaga
Copy link

ok thanks. It actually worked for me by going to the fixed superset:1.5.1 docker image instead of latest. Not sure that made the difference or it was simultaneously your fix. But either way, it works now 🙇

@Mageswaran1989
Copy link

I wanted to access the S3 parquet files from Superset/SQL Editor. While I am able to use DuckDB to do the same in my python shell, I am wondering how to do it with the Superset and/or duckdb-engine

My python snippet to load parquet files from S3:

import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'")

@Mageswaran1989
Copy link

@Mause Could you please help with loading S3 parquet files with your engine?
Reference: https://duckdb.org/docs/guides/import/s3_import

@Mause
Copy link

Mause commented Jul 14, 2022

@Mageswaran1989

if you're having an issue with

PS. @zhaoyongjie any chance you could lock this conversation?

@apache apache locked as off-topic and limited conversation to collaborators Jul 15, 2022
@apache apache unlocked this conversation Jul 15, 2022
@apache apache locked as resolved and limited conversation to collaborators Jul 15, 2022
@mistercrunch mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels lts-v1 size/M 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants