Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid support via SQLAlchemy #4163

Merged
merged 6 commits into from Jan 5, 2018
Merged

Conversation

betodealmeida
Copy link
Member

I recently created a module called druiddb (merged into pydruid this week) that provides a SQLAlchemy dialect for Druid. This allows Superset to talk to Druid using its standard SQLAlchemy connector, instead of the custom one.

One problem with this approach is that Druid does not support joins, and some queries (timeseries with limit) perform an inner join to get the top overall groups. In order to handle Druid correctly I added a new attribute to engine specs called inner_joins, defaulting to true.

If this attribute is false, instead of building the inner join we run a "prequery", fetching the top groups similar to how the Druid connector works. The values are then used as an extra filter in the main query. Eg, this is how a query with join works (from the birth_names dataset):

SELECT name AS name,
       ds AS __timestamp,
       SUM(birth_names.num) AS sum__num
FROM birth_names
JOIN
  (SELECT name AS name__,
          SUM(birth_names.num) AS mme_inner__
   FROM birth_names
   WHERE ds >= '1918-01-05 00:00:00.000000'
     AND ds <= '2018-01-05 11:59:32.000000'
   GROUP BY name
   ORDER BY mme_inner__ DESC
   LIMIT 5
   OFFSET 0) AS anon_1 ON name = name__
WHERE ds >= '1918-01-05 00:00:00.000000'
  AND ds <= '2018-01-05 11:59:32.000000'
GROUP BY name,
         ds
ORDER BY sum__num DESC
LIMIT 50000
OFFSET 0;

And here how it looks when inner joins are not supported:

SELECT name AS name,
       SUM(birth_names.num) AS sum__num
FROM birth_names
WHERE ds >= '1918-01-05 00:00:00.000000'
  AND ds <= '2018-01-05 12:00:29.000000'
GROUP BY name
ORDER BY SUM(birth_names.num) DESC
LIMIT 5
OFFSET 0;

SELECT name AS name,
       ds AS __timestamp,
       SUM(birth_names.num) AS sum__num
FROM birth_names
WHERE ds >= '1918-01-05 00:00:00.000000'
  AND ds <= '2018-01-05 12:00:29.000000'
  AND (name = 'Michael'
       OR name = 'Christopher'
       OR name = 'David'
       OR name = 'James'
       OR name = 'John')
GROUP BY name,
         ds
ORDER BY sum__num DESC
LIMIT 50000
OFFSET 0;

Both queries are shown when clicking "View Query", instead of only the last one. See the screenshot:

screen shot 2018-01-05 at 12 24 41 pm

In order to do that, I added two new arguments to the query object:

  1. prequeries is a list that stores prequeries;
  2. is_prequery is a boolean indicating if a given query is the final one, or a prequery.

When a main query runs a prequery, it will append it to prequeries. The functions query and get_query_string_response then take care of combining prequeries with the main query, so it can be displayed to the user correctly.

@betodealmeida
Copy link
Member Author

I'll fix the unit tests.

result = self.query(subquery_obj)
dimensions = [c for c in result.df.columns if c not in metrics]
top_groups = self._get_top_groups(result.df, dimensions)
qry = qry.where(top_groups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If where is called multiple times, does SQLAlchemy goes for a logical AND? Couldn't find the documentation for that method quickly...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the answer, logical AND is applied

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference:

return a new select() construct with the given expression added to its WHERE clause, joined to the existing clause via AND, if any. (emphasis mine)

@mistercrunch
Copy link
Member

LGTM

@mistercrunch mistercrunch merged commit 686023c into apache:master Jan 5, 2018
betodealmeida added a commit to lyft/incubator-superset that referenced this pull request Jan 8, 2018
* Use druiddb

* Remove auto formatting

* Show prequeries

* Fix subtle bug with lists

* Move arguments to query object

* Fix druid run_query
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
* Use druiddb

* Remove auto formatting

* Show prequeries

* Fix subtle bug with lists

* Move arguments to query object

* Fix druid run_query
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* Use druiddb

* Remove auto formatting

* Show prequeries

* Fix subtle bug with lists

* Move arguments to query object

* Fix druid run_query
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants