New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: Consider adding a GroupJoin physical operator #38707
Comments
Oh, this looks interesting, I'll take this on. |
I ran into another query example where this operator would likely result in large speedups. It comes from this paper: https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf?sequence=1. Here's the schema and data generation commands I used:
The following query from the paper takes 10.5s to run:
I believe that would be substantially faster with a group-join operator, since the plan does an outer join which creates a huge output of ~7M rows, then immediately groups that output back down to ~700 rows:
|
We have marked this issue as stale because it has been inactive for |
This paper (http://www.vldb.org/pvldb/vol4/p843-moerkotte.pdf) lays out an operator that combines
GroupBy
andJoin
operators into a unified operator that runs much faster. This operator would be used when theJoin
columns are the same as theGroupBy
columns. In that special (but common) case, theGroupJoin
operator can compute the join results and the aggregate functions in a single step.For each row on the left side, matching rows are found on the right side, and the aggregate functions are immediately computed over those matching rows. The output cardinality of the
GroupJoin
is <= the left input cardinality. Contrast this withJoin
+GroupBy
, where theJoin
first expands the cardinality, only to have theGroupBy
immediately reduce it back again.One of the biggest beneficiaries of a
GroupJoin
operator would be decorrelated queries. As an example of a common query pattern:This results in the following query plan:
Notice that the
LeftJoin
is merging its two inputs on theid
column of its left input, and that theGroupBy
is then grouping on that same column in order to compute the counts. Combining these two operations into one physical operator should give big speedups on some important queries. Here are the numbers from the paper:Jira issue: CRDB-5618
The text was updated successfully, but these errors were encountered: