Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Druid aims to be a more powerful analytical database, and implementing joins is a very common ask from the user community. Druid does support some related features today:
Real JOIN support would be more powerful, enabling even more star-schema and subquery-based use cases:
The idea is to add a "join" datasource, expose it through SQL, and add machinery to brokers and data servers to allow hash-based equijoins of zero or one "table" datasource and any number of "lookup", "inline", or "query" datasources. As a side effect, these proposed changes will add the ability to query lookups directly, using a datasource of type "lookup".
I think this subset of functionality is a good start, because it adds meaningful new capabilities and helps unify existing ones. Things that are not in scope for this initial proposal include: joining two "table" datasources to each other, non-equijoins, non-hash-based joins. I think these should all be implemented at some point, as future work.
There are four main areas to this proposal,
The next few sections expand on these areas.
An example SQL query might be:
This query takes advantage of the fact that unqualified tables like sales are assumed to be normal datasources. Lookups are referenced as part of the lookup schema, like lookup.products.
Multiple join queries can be specified per SQL query. We will need to guide Calcite’s cost-based optimizer towards reordering them optimally. This may require more statistics than we currently possess, so adding these and improving join plans may end up being an area of future work.
The rows coming out of a join datasource would be the result of the join. Any query type could use a join datasource without being aware of the fact that joins exist.
Join datasources can be nested within each other. Unlike SQL, native query evaluation will not reorder joins. It will execute them in the order that the join tree is provided.
Probably will not allow joining on "table" datasources, except as the extreme left-hand side of a join.
In order to protect against column name ambiguity (what if the left and right side have a column of the same name?), I propose adding a "rightPrefix" parameter to the join datasource. This would be prefixed to every column name coming from the right side, and should be chosen by the caller to be something that won’t conflict with any left-side columns. Druid SQL will choose one automatically when planning a SQL join.
The join datasource used by the earlier SQL query, above, would be:
The following technique should be implemented by CachingClusteredClient. The idea is to either evaluate the query locally, or else get it into a format that data servers can handle (see next section).
Data server behavior
The following technique should be implemented by ServerManager (historical) and SinkQuerySegmentWalker (indexer) when a join datasource is encountered.
Some other alternatives considered were:
Add lots of unit tests in the druid-sql and druid-processing modules.
Out of scope for this proposal, but would be nice in the future:
Hi, @kstrempel . IMO, I don't think this is a problem on the same level. Presto does provide more richer SQL queries, but it does not store data itself, so there will be a process to ingest data from Druid, which will inevitably have some performance loss. And if Presto pushes most of the calculations down to Druid, then Druid still needs to have more SQL query capabilities, which is what the proposal is doing. I think this may be one of the reasons why Spark does not directly borrow the SQL features of Impala or Presto, but implement Spark SQL on its own.
I think a Druid connector for Presto would help get Druid data into Presto and help support 100% of SQL use cases, so it's interesting. But as @asdf2014 mentioned, doing what we can in Druid directly should mean better performance.
@kstrempel we looked into join support through presto couple years ago.
Its possible but presto connector interface did not have API to push down predicates. So Druid would have to forward raw non aggregated data to presto and presto would do all query processing, which compromises all the performance benefits of Druid.
Great to see this initiative.