sql: Consider adding a GroupJoin physical operator #38707

andy-kimball · 2019-07-05T17:09:04Z

This paper (http://www.vldb.org/pvldb/vol4/p843-moerkotte.pdf) lays out an operator that combines GroupBy and Join operators into a unified operator that runs much faster. This operator would be used when the Join columns are the same as the GroupBy columns. In that special (but common) case, the GroupJoin operator can compute the join results and the aggregate functions in a single step.

For each row on the left side, matching rows are found on the right side, and the aggregate functions are immediately computed over those matching rows. The output cardinality of the GroupJoin is <= the left input cardinality. Contrast this with Join + GroupBy, where the Join first expands the cardinality, only to have the GroupBy immediately reduce it back again.

One of the biggest beneficiaries of a GroupJoin operator would be decorrelated queries. As an example of a common query pattern:

-- Return all customers with more than 100 orders.
SELECT *
FROM customers c
WHERE (SELECT COUNT(*) FROM orders o WHERE o.cust_id = c.id) > 100

This results in the following query plan:

render                    |                |
 └── filter               |                |
      │                   | filter         | count > 100
      └── group           |                |
           │              | aggregate 0    | id
           │              | aggregate 1    | count(cust_id)
           │              | aggregate 2    | any_not_null(name)
           │              | aggregate 3    | any_not_null(bill)
           │              | group by       | id
           │              | ordered        | +id
           └── merge-join |                |
                │         | type           | left outer
                │         | equality       | (id) = (cust_id)
                │         | mergeJoinOrder | +"(id=cust_id)"
                ├── scan  |                |
                │         | table          | customers@primary
                │         | spans          | ALL
                └── scan  |                |
                          | table          | orders@orders_auto_index_fk_cust_id_ref_customers
                          | spans          | ALL

Notice that the LeftJoin is merging its two inputs on the id column of its left input, and that the GroupBy is then grouping on that same column in order to compute the counts. Combining these two operations into one physical operator should give big speedups on some important queries. Here are the numbers from the paper:

TPC-H (SF=1)
    with       without
Q   GroupJoin  GroupJoin
------------------------
13  84ms       278ms
3   70ms       104ms
21  127ms      500ms
5   59ms       68ms
9   212ms      222ms
10  51ms       74ms
16  45ms       49ms
17  33ms       34ms
20  37ms       37ms

Jira issue: CRDB-5618

The text was updated successfully, but these errors were encountered:

yuzefovich · 2020-04-04T02:36:30Z

Oh, this looks interesting, I'll take this on.

andy-kimball · 2020-05-07T15:17:48Z

I ran into another query example where this operator would likely result in large speedups. It comes from this paper: https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf?sequence=1.

Here's the schema and data generation commands I used:

create table students (id int8 primary key, name text, major text, year int8);
create table exams (id int8 primary key, sid int8 references students(id), grade decimal, course text, curriculum text, date int8);
insert into students values (1, 'Andy', 'CS', 1996);
insert into students values (2, 'Lisa', 'CS', 2000);
insert into students values (3, 'Jack', 'CS', 1998);
insert into students values (4, 'Zack', 'CS', 1998);
insert into students values (5, 'Sarah', 'CS', 1999);
insert into students values (6, 'Beverly', 'EN', 1998);
insert into students values (7, 'Sam', 'MA', 1996);
insert into students values (8, 'Will', 'MA', 1998);
insert into students values (9, 'Julie', 'EN', 1999);
insert into students values (10, 'Rachel', 'ME', 2000);
insert into students select id+10, name, major, year from students;
insert into students select id+20, name, major, year from students;
insert into students select id+40, name, major, year from students;
insert into students select id+80, name, major, year from students;
insert into students select id+160, name, major, year from students;
insert into students select id+320, name, major, year from students;
insert into students select id+640, name, major, year from students;
insert into exams select unique_rowid(), 1+(random() * 1280)::int, (random() * 5)::decimal, '', 'CS', 1995 + (random() * 6)::int from generate_series(1,5000);

The following query from the paper takes 10.5s to run:

select *
from students s, exams e
where s.id=e.sid and
        (s.major = 'CS' or s.major = 'Games Eng') and
        e.grade >=
        (
                select avg(e2.grade)+1
                from exams e2
                where s.id=e2.sid or (e2.curriculum=s.major and s.year>e2.date)
        )

I believe that would be substantially faster with a group-join operator, since the plan does an outer join which creates a huge output of ~7M rows, then immediately groups that output back down to ~700 rows:

render                             |                    |
 └── filter                        |                    |
      │                            | filter             | any_not_null >= (avg + 1)
      └── group                    |                    |
           │                       | aggregate 0        | id
           │                       | aggregate 1        | avg(grade)
           │                       | aggregate 2        | any_not_null(sid)
           │                       | aggregate 3        | any_not_null(grade)
           │                       | aggregate 4        | any_not_null(course)
           │                       | aggregate 5        | any_not_null(curriculum)
           │                       | aggregate 6        | any_not_null(date)
           │                       | aggregate 7        | any_not_null(name)
           │                       | aggregate 8        | any_not_null(major)
           │                       | aggregate 9        | any_not_null(year)
           │                       | aggregate 10       | any_not_null(id)
           │                       | group by           | id
           └── render              |                    |
                └── cross-join     |                    |
                     │             | type               | right outer
                     │             | pred               | (id = sid) OR ((curriculum = major) AND (year > date))
                     ├── scan      |                    |
                     │             | table              | exams@primary
                     │             | spans              | FULL SCAN
                     └── hash-join |                    |
                          │        | type               | inner
                          │        | equality           | (sid) = (id)
                          │        | right cols are key |
                          ├── scan |                    |
                          │        | table              | exams@primary
                          │        | spans              | FULL SCAN
                          └── scan |                    |
                                   | table              | students@primary
                                   | spans              | FULL SCAN
                                   | filter             | (major = 'CS') OR (major = 'Games Eng')

github-actions · 2023-09-20T11:08:31Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

andy-kimball assigned jordanlewis Jul 5, 2019

andy-kimball added this to Triage in [DEPRECATED] Old SQLExec board. Don't move stuff here via automation Jul 5, 2019

awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 8, 2019

jordanlewis moved this from Triage to Lower priority backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here Jul 9, 2019

asubiotto added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Apr 2, 2020

asubiotto removed this from Lower priority backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here Apr 2, 2020

yuzefovich assigned yuzefovich and unassigned jordanlewis Apr 4, 2020

asubiotto moved this from Triage to [VECTORIZED BACKLOG] Enhancements/Features in BACKLOG, NO NEW ISSUES: SQL Execution Apr 15, 2020

yuzefovich moved this from [VECTORIZED BACKLOG] Enhancements/Features to [GENERAL BACKLOG] Enhancements/Features in BACKLOG, NO NEW ISSUES: SQL Execution Apr 17, 2020

jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021

yuzefovich mentioned this issue Sep 9, 2021

WIP colexec: introduce HashGroupJoiner #69994

Closed

yuzefovich removed this from [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution May 31, 2022

yuzefovich added this to Triage in SQL Queries via automation May 31, 2022

yuzefovich moved this from Triage to Backlog in SQL Queries May 31, 2022

yuzefovich moved this from Backlog (DO NOT ADD NEW ISSUES) to New Backlog in SQL Queries Apr 24, 2023

github-actions bot added the no-issue-activity label Sep 20, 2023

yuzefovich removed the no-issue-activity label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: Consider adding a GroupJoin physical operator #38707

sql: Consider adding a GroupJoin physical operator #38707

andy-kimball commented Jul 5, 2019 •

edited by cockroach-jira-scripts

yuzefovich commented Apr 4, 2020

andy-kimball commented May 7, 2020 •

edited

github-actions bot commented Sep 20, 2023

sql: Consider adding a GroupJoin physical operator #38707

sql: Consider adding a GroupJoin physical operator #38707

Comments

andy-kimball commented Jul 5, 2019 • edited by cockroach-jira-scripts

yuzefovich commented Apr 4, 2020

andy-kimball commented May 7, 2020 • edited

github-actions bot commented Sep 20, 2023

andy-kimball commented Jul 5, 2019 •

edited by cockroach-jira-scripts

andy-kimball commented May 7, 2020 •

edited