Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: query datasets from SQL Lab #15241

Merged
merged 3 commits into from
Jun 1, 2022

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Jun 17, 2021

SUMMARY

Introduce the dataset macro, allowing users to query physical and virtual datasets in SQL Lab. This is useful if you've defined computed columns and metrics on your datasets, and want to reuse the definition in adhoc SQL Lab queries.

Because currently dataset names are not unique the macro uses the dataset ID:

SELECT * FROM {{ dataset(42) }} LIMIT 10

If users want to select the metric definitions as well, in addition to the columns, they can pass an additional keyword argument:

SELECT * FROM {{ dataset(42, include_metrics=True) }} LIMIT 10

Since metrics are aggregations, the resulting SQL expression will be grouped by all non-metric columns. Users can also specify a subset of columns to group by instead:

SELECT * FROM {{ dataset(42, include_metrics=True, groupby=["ds", "category"]) }} LIMIT 10

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screenshot 2021-06-17 at 15-12-21 Superset

TESTING INSTRUCTIONS

Added unit tests.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Jun 17, 2021

Codecov Report

Merging #15241 (ed43b79) into master (6244728) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #15241      +/-   ##
==========================================
+ Coverage   66.46%   66.51%   +0.04%     
==========================================
  Files        1721     1724       +3     
  Lines       64466    64647     +181     
  Branches     6794     6794              
==========================================
+ Hits        42846    42997     +151     
- Misses      19892    19922      +30     
  Partials     1728     1728              
Flag Coverage Δ
hive 53.68% <16.66%> (-0.01%) ⬇️
mysql ?
postgres 82.27% <100.00%> (+0.08%) ⬆️
presto 53.54% <16.66%> (-0.01%) ⬇️
python 82.62% <100.00%> (+<0.01%) ⬆️
sqlite ?
unit 50.00% <100.00%> (+0.60%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/connectors/base/models.py 86.74% <ø> (ø)
superset/jinja_context.py 90.82% <100.00%> (+0.53%) ⬆️
superset/datasets/filters.py 86.95% <0.00%> (-13.05%) ⬇️
superset/common/utils/dataframe_utils.py 85.71% <0.00%> (-7.15%) ⬇️
superset/db_engine_specs/sqlite.py 91.89% <0.00%> (-5.41%) ⬇️
superset/db_engine_specs/mysql.py 93.97% <0.00%> (-3.62%) ⬇️
superset/utils/celery.py 86.20% <0.00%> (-3.45%) ⬇️
superset/connectors/sqla/utils.py 88.75% <0.00%> (-2.50%) ⬇️
superset/result_set.py 96.85% <0.00%> (-1.58%) ⬇️
superset/models/core.py 88.43% <0.00%> (-0.73%) ⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6244728...ed43b79. Read the comment docs.

@rumbin
Copy link
Contributor

rumbin commented Sep 30, 2021

Any progress in this?
I wonder to what extent this functionality requires the correct database/schema of the dataset to be selected in order to make it work.
Especially when two datasets are to be joined this way, they need to refer to the same database.
Is this being checked?

Dreaming this functionality a bit further, I could imagine having a dataset browser included in the schema browser, which presents all datasets of the selected database, so they can be selected from like the existing tables/views of the DB.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 16, 2022
@rumbin
Copy link
Contributor

rumbin commented Apr 16, 2022

This issue should not be closed by the stale bot, in my eyes. The suggested functionality is just too promising.

@stale stale bot removed the inactive Inactive for >= 30 days label Apr 16, 2022
@betodealmeida
Copy link
Member Author

I agree, @rumbin, let me do some more work on this.

@betodealmeida betodealmeida force-pushed the dataset_macro_sqllab branch 2 times, most recently from d567e51 to 3cbdda6 Compare May 17, 2022 21:45
@pull-request-size pull-request-size bot added size/L and removed size/M labels May 17, 2022
@shenrie
Copy link

shenrie commented May 21, 2022

Want, want want....take my money!

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool feature, great way to leverage the SQLA model! Left a couple non-blocking comments, LGTM.

superset/jinja_context.py Show resolved Hide resolved
Comment on lines 633 to 634
"columns": columns,
"groupby": groupby,
Copy link
Member

@villebro villebro May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since groupby is deprecated in QueryObject and just gets mapped into columns, maybe we could just do this:

Suggested change
"columns": columns,
"groupby": groupby,
"columns": groupby or columns,

I believe it has the same effect

def dataset_macro(
dataset_id: int,
include_metrics: bool = False,
groupby: Optional[List[str]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I was thinking, should we replace groupby with columns here? It feels it kinda makes more sense, as you should be able to pick a subset of columns from the dataset even when you don't have metrics with a GROUP BY

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, great idea!

@betodealmeida betodealmeida merged commit 05a138a into apache:master Jun 1, 2022
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
* feat: Jinja2 macro for querying datasets

* Add docs

* Address comments
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants