Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join and subquery support #5664

Open
siddharthteotia opened this issue Jul 7, 2020 · 9 comments
Open

Join and subquery support #5664

siddharthteotia opened this issue Jul 7, 2020 · 9 comments
Assignees
Labels

Comments

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Jul 7, 2020

There are two standard data-modeling approaches in analytical databases.

Star Schema

This is the de-facto standard to model data in data-warehouses to efficiently run OLAP (analytical / BI) style queries. We have a single fact table (containing measures / numeric data, e.g sales) surrounded by one or more dimension tables (e.g product). There is no relationship between dimension tables themselves. The only join path is between fact and dimension tables. In other words, you get a [1 : many] join since a single record in dimension table can be associated with multiple records in fact table.

One of the key things to note about modeling data this way is that dimension tables are denormalized (which again reiterates the fact that there is no relationship between two dimension tables).

Snowflake Schema

It is similar to star schema in the sense that there is a fact table and dimension tables. The key difference is that at least some of the dimension tables are normalized (thus leading to more dimension tables). This way you also establish relationship between multiple dimension tables as well.

The potential problem with this schema is normalization and thus resorting to join often even for simple queries. Not only that, writing a join query is way more complex in snowflake as compared to star. This is the main reason why star schema is generally the preferred choice in OLAP world. Managing the schema (I mean the collection of tables) is also complicated in this case.

There are few other ways to model as well but the above two are standard ones. The modeling step is going to be critical since it will dictate (to some extent) what kind of joins we support, complexity of such queries and the complexity at the user end to write such queries.

Standard Distributed Join Techniques

Now regardless of what we do above, there are couple of ways to implement join in distributed query engines like Pinot/Presto/Kusto/Spark etc and broadcast join is one of them.

Broadcast join is a common way to execute standard star schema join where we join a large fact table with smaller dimension table(s). The smaller table is then broadcasted to each server for the server to execute a local in-memory join (potentially a hash join by using the dimension table as the build side of the join and fact table as the probe side). The reducer/aggregator layer can do the final processing. In the spark community, this is commonly referred to as map-side join.

Co-located join for partitioned tables

Another way is to do a co-located join for partitioned tables. Let's say we are joining tables T1 and T2 on the join key column K. If both tables are partitioned on the join key column with the same uniform partition function, then each node essentially has all the data locally to execute it's side of the join without any data movement (as involved in broadcast join).

There are more known ways (shuffle) to do distributed join with varying degree of data movement, complexity etc. However, I feel the modeling approach that we will adopt along with any restrictions should be the first thing to get clarity on and this will also be determined by (to some extent) what our users currently expect from Pinot in terms of join. I feel we should start looking at join in Pinot with limited support for star schema join (where we try to restrict the number of dimension tables) as a reasonable starting point. We can also look at dimension-to-dimension join but we need to be more careful when modeling the data for such scenario.

I will start creating a proposal cum design document and share with the community. Meanwhile, we can use the issue to have some discussion on requirements etc.

@dharakk
Copy link
Contributor

dharakk commented Oct 28, 2020

Design document for a star schema join involving a fact table and small size dim tables: https://docs.google.com/document/d/1InWmxbRqwcqIakzvoEWHLxtX4XR9H5L01256EbAUHV8/edit?usp=sharing

This design require the dim table to be available on each host hence the restriction on size. Future work for this will involve lifting this restriction by implementing a broadcast join operator, which will enable larger and partitioned dim tables and help solve more generic join use cases.

@siddharthteotia siddharthteotia changed the title Join support Join and subquery support Jul 12, 2021
@siddharthteotia
Copy link
Contributor Author

At LinkedIn, we are exploring a combined design for subqueries and distributed joins since design / architecture wise there might be overlap and it is good to think of both design wise.

cc @amrishlal @jackjlli

Once we have had some preliminary thought / initial design, we will share with community in the next couple of weeks/months to iterate upon and collaborate further.

@atris
Copy link
Contributor

atris commented Jul 13, 2021

Great to see this. I am excited about this feature - but I don't see a strict need for subqueries to be designed in conjunction with joins. IMO subqueries are simpler and have independent use cases (nested aggregates).

I have been working on a sub query only proposal that I plan to share today. Please share your inputs and feedback on it, and let's ensure its aligned with your overall plan of join support

@siddharthteotia
Copy link
Contributor Author

@atris, Subqueries can broadly be divided into two categories

  • inner query (and there could be multiple of these) referring to same table as outer query
  • inner query (and there could be multiple of these) referring to different tables not necessarily same as outer query.

Since the overall query might touch more than 1 table and might require us to consider data movement / exchange operators (as in the case of join), we think it is better to think about both during design. This is the reason why we want to approach them together in the design phase to prevent situation where we go ahead and design, implement one of the two features and later on find ourselves retro-fitting the other one.

@atris
Copy link
Contributor

atris commented Jul 13, 2021

If the outer query is being executed on a broker node, on top of the result of the underlying topmost Combine operator, would this still be a problem?

@siddharthteotia
Copy link
Contributor Author

I don't completely follow that. Can you give an example ? Also, are you approaching this from making subqueries work for a particular scenario ?

@MrNeocore
Copy link

Any update on this topic ?

@mcvsubbu
Copy link
Contributor

mcvsubbu commented Nov 1, 2021

@kishoreg is anyone working on this?

@kishoreg
Copy link
Member

kishoreg commented Nov 1, 2021

We are thinking about it- we will fix the query-spi layer first so that the implementation is less invasive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants