Druid support via SQLAlchemy #4163

betodealmeida · 2018-01-05T20:51:23Z

I recently created a module called druiddb (merged into pydruid this week) that provides a SQLAlchemy dialect for Druid. This allows Superset to talk to Druid using its standard SQLAlchemy connector, instead of the custom one.

One problem with this approach is that Druid does not support joins, and some queries (timeseries with limit) perform an inner join to get the top overall groups. In order to handle Druid correctly I added a new attribute to engine specs called inner_joins, defaulting to true.

If this attribute is false, instead of building the inner join we run a "prequery", fetching the top groups similar to how the Druid connector works. The values are then used as an extra filter in the main query. Eg, this is how a query with join works (from the birth_names dataset):

SELECT name AS name,
       ds AS __timestamp,
       SUM(birth_names.num) AS sum__num
FROM birth_names
JOIN
  (SELECT name AS name__,
          SUM(birth_names.num) AS mme_inner__
   FROM birth_names
   WHERE ds >= '1918-01-05 00:00:00.000000'
     AND ds <= '2018-01-05 11:59:32.000000'
   GROUP BY name
   ORDER BY mme_inner__ DESC
   LIMIT 5
   OFFSET 0) AS anon_1 ON name = name__
WHERE ds >= '1918-01-05 00:00:00.000000'
  AND ds <= '2018-01-05 11:59:32.000000'
GROUP BY name,
         ds
ORDER BY sum__num DESC
LIMIT 50000
OFFSET 0;

And here how it looks when inner joins are not supported:

SELECT name AS name,
       SUM(birth_names.num) AS sum__num
FROM birth_names
WHERE ds >= '1918-01-05 00:00:00.000000'
  AND ds <= '2018-01-05 12:00:29.000000'
GROUP BY name
ORDER BY SUM(birth_names.num) DESC
LIMIT 5
OFFSET 0;

SELECT name AS name,
       ds AS __timestamp,
       SUM(birth_names.num) AS sum__num
FROM birth_names
WHERE ds >= '1918-01-05 00:00:00.000000'
  AND ds <= '2018-01-05 12:00:29.000000'
  AND (name = 'Michael'
       OR name = 'Christopher'
       OR name = 'David'
       OR name = 'James'
       OR name = 'John')
GROUP BY name,
         ds
ORDER BY sum__num DESC
LIMIT 50000
OFFSET 0;

Both queries are shown when clicking "View Query", instead of only the last one. See the screenshot:

In order to do that, I added two new arguments to the query object:

prequeries is a list that stores prequeries;
is_prequery is a boolean indicating if a given query is the final one, or a prequery.

When a main query runs a prequery, it will append it to prequeries. The functions query and get_query_string_response then take care of combining prequeries with the main query, so it can be displayed to the user correctly.

betodealmeida · 2018-01-05T21:13:00Z

I'll fix the unit tests.

mistercrunch · 2018-01-05T21:42:41Z

superset/connectors/sqla/models.py

+                result = self.query(subquery_obj)
+                dimensions = [c for c in result.df.columns if c not in metrics]
+                top_groups = self._get_top_groups(result.df, dimensions)
+                qry = qry.where(top_groups)


If where is called multiple times, does SQLAlchemy goes for a logical AND? Couldn't find the documentation for that method quickly...

Found the answer, logical AND is applied

For reference:

return a new select() construct with the given expression added to its WHERE clause, joined to the existing clause via AND, if any. (emphasis mine)

mistercrunch · 2018-01-05T21:52:55Z

LGTM

* Use druiddb * Remove auto formatting * Show prequeries * Fix subtle bug with lists * Move arguments to query object * Fix druid run_query

betodealmeida added 6 commits January 4, 2018 16:09

Use druiddb

2f34ebd

Remove auto formatting

6a99e8a

Show prequeries

bd096ac

Fix subtle bug with lists

25a6d3f

Move arguments to query object

d898c1c

Fix druid run_query

3f77fb4

mistercrunch reviewed Jan 5, 2018

View reviewed changes

mistercrunch merged commit 686023c into apache:master Jan 5, 2018

mistercrunch mentioned this pull request Jan 8, 2018

SQLAlchemy connector query, add support for 2 phase queries #4085

Closed

betodealmeida added a commit to lyft/incubator-superset that referenced this pull request Jan 8, 2018

Druid support via SQLAlchemy (apache#4163)

69e2810

* Use druiddb * Remove auto formatting * Show prequeries * Fix subtle bug with lists * Move arguments to query object * Fix druid run_query

wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018

Druid support via SQLAlchemy (apache#4163)

5ebab98

* Use druiddb * Remove auto formatting * Show prequeries * Fix subtle bug with lists * Move arguments to query object * Fix druid run_query

This was referenced Jul 18, 2019

Filter Box Caching Incorrectly for Multi-Query Use Case #7666

Closed

[Bugfix] Remove prequery properties from query_obj #7896

Merged

WChCh mentioned this pull request Nov 13, 2019

Global "prequery" for dashboard #8555

Closed

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Druid support via SQLAlchemy #4163

Druid support via SQLAlchemy #4163

betodealmeida commented Jan 5, 2018

betodealmeida commented Jan 5, 2018

mistercrunch Jan 5, 2018

mistercrunch Jan 5, 2018

betodealmeida Jan 5, 2018

mistercrunch commented Jan 5, 2018

Druid support via SQLAlchemy #4163

Druid support via SQLAlchemy #4163

Conversation

betodealmeida commented Jan 5, 2018

betodealmeida commented Jan 5, 2018

mistercrunch Jan 5, 2018

Choose a reason for hiding this comment

mistercrunch Jan 5, 2018

Choose a reason for hiding this comment

betodealmeida Jan 5, 2018

Choose a reason for hiding this comment

mistercrunch commented Jan 5, 2018