Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Consider adding a GroupJoin physical operator #38707

Open
andy-kimball opened this issue Jul 5, 2019 · 3 comments
Open

sql: Consider adding a GroupJoin physical operator #38707

andy-kimball opened this issue Jul 5, 2019 · 3 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects

Comments

@andy-kimball
Copy link
Contributor

andy-kimball commented Jul 5, 2019

This paper (http://www.vldb.org/pvldb/vol4/p843-moerkotte.pdf) lays out an operator that combines GroupBy and Join operators into a unified operator that runs much faster. This operator would be used when the Join columns are the same as the GroupBy columns. In that special (but common) case, the GroupJoin operator can compute the join results and the aggregate functions in a single step.

For each row on the left side, matching rows are found on the right side, and the aggregate functions are immediately computed over those matching rows. The output cardinality of the GroupJoin is <= the left input cardinality. Contrast this with Join + GroupBy, where the Join first expands the cardinality, only to have the GroupBy immediately reduce it back again.

One of the biggest beneficiaries of a GroupJoin operator would be decorrelated queries. As an example of a common query pattern:

-- Return all customers with more than 100 orders.
SELECT *
FROM customers c
WHERE (SELECT COUNT(*) FROM orders o WHERE o.cust_id = c.id) > 100

This results in the following query plan:

render                    |                |
 └── filter               |                |
      │                   | filter         | count > 100
      └── group           |                |
           │              | aggregate 0    | id
           │              | aggregate 1    | count(cust_id)
           │              | aggregate 2    | any_not_null(name)
           │              | aggregate 3    | any_not_null(bill)
           │              | group by       | id
           │              | ordered        | +id
           └── merge-join |                |
                │         | type           | left outer
                │         | equality       | (id) = (cust_id)
                │         | mergeJoinOrder | +"(id=cust_id)"
                ├── scan  |                |
                │         | table          | customers@primary
                │         | spans          | ALL
                └── scan  |                |
                          | table          | orders@orders_auto_index_fk_cust_id_ref_customers
                          | spans          | ALL

Notice that the LeftJoin is merging its two inputs on the id column of its left input, and that the GroupBy is then grouping on that same column in order to compute the counts. Combining these two operations into one physical operator should give big speedups on some important queries. Here are the numbers from the paper:

TPC-H (SF=1)
    with       without
Q   GroupJoin  GroupJoin
------------------------
13  84ms       278ms
3   70ms       104ms
21  127ms      500ms
5   59ms       68ms
9   212ms      222ms
10  51ms       74ms
16  45ms       49ms
17  33ms       34ms
20  37ms       37ms

Jira issue: CRDB-5618

@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 8, 2019
@jordanlewis jordanlewis moved this from Triage to Lower priority backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here Jul 9, 2019
@asubiotto asubiotto added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Apr 2, 2020
@yuzefovich
Copy link
Member

Oh, this looks interesting, I'll take this on.

@yuzefovich yuzefovich assigned yuzefovich and unassigned jordanlewis Apr 4, 2020
@asubiotto asubiotto moved this from Triage to [VECTORIZED BACKLOG] Enhancements/Features in BACKLOG, NO NEW ISSUES: SQL Execution Apr 15, 2020
@yuzefovich yuzefovich moved this from [VECTORIZED BACKLOG] Enhancements/Features to [GENERAL BACKLOG] Enhancements/Features in BACKLOG, NO NEW ISSUES: SQL Execution Apr 17, 2020
@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 7, 2020

I ran into another query example where this operator would likely result in large speedups. It comes from this paper: https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf?sequence=1.

Here's the schema and data generation commands I used:

create table students (id int8 primary key, name text, major text, year int8);
create table exams (id int8 primary key, sid int8 references students(id), grade decimal, course text, curriculum text, date int8);
insert into students values (1, 'Andy', 'CS', 1996);
insert into students values (2, 'Lisa', 'CS', 2000);
insert into students values (3, 'Jack', 'CS', 1998);
insert into students values (4, 'Zack', 'CS', 1998);
insert into students values (5, 'Sarah', 'CS', 1999);
insert into students values (6, 'Beverly', 'EN', 1998);
insert into students values (7, 'Sam', 'MA', 1996);
insert into students values (8, 'Will', 'MA', 1998);
insert into students values (9, 'Julie', 'EN', 1999);
insert into students values (10, 'Rachel', 'ME', 2000);
insert into students select id+10, name, major, year from students;
insert into students select id+20, name, major, year from students;
insert into students select id+40, name, major, year from students;
insert into students select id+80, name, major, year from students;
insert into students select id+160, name, major, year from students;
insert into students select id+320, name, major, year from students;
insert into students select id+640, name, major, year from students;
insert into exams select unique_rowid(), 1+(random() * 1280)::int, (random() * 5)::decimal, '', 'CS', 1995 + (random() * 6)::int from generate_series(1,5000);

The following query from the paper takes 10.5s to run:

select *
from students s, exams e
where s.id=e.sid and
        (s.major = 'CS' or s.major = 'Games Eng') and
        e.grade >=
        (
                select avg(e2.grade)+1
                from exams e2
                where s.id=e2.sid or (e2.curriculum=s.major and s.year>e2.date)
        )

I believe that would be substantially faster with a group-join operator, since the plan does an outer join which creates a huge output of ~7M rows, then immediately groups that output back down to ~700 rows:

render                             |                    |
 └── filter                        |                    |
      │                            | filter             | any_not_null >= (avg + 1)
      └── group                    |                    |
           │                       | aggregate 0        | id
           │                       | aggregate 1        | avg(grade)
           │                       | aggregate 2        | any_not_null(sid)
           │                       | aggregate 3        | any_not_null(grade)
           │                       | aggregate 4        | any_not_null(course)
           │                       | aggregate 5        | any_not_null(curriculum)
           │                       | aggregate 6        | any_not_null(date)
           │                       | aggregate 7        | any_not_null(name)
           │                       | aggregate 8        | any_not_null(major)
           │                       | aggregate 9        | any_not_null(year)
           │                       | aggregate 10       | any_not_null(id)
           │                       | group by           | id
           └── render              |                    |
                └── cross-join     |                    |
                     │             | type               | right outer
                     │             | pred               | (id = sid) OR ((curriculum = major) AND (year > date))
                     ├── scan      |                    |
                     │             | table              | exams@primary
                     │             | spans              | FULL SCAN
                     └── hash-join |                    |
                          │        | type               | inner
                          │        | equality           | (sid) = (id)
                          │        | right cols are key |
                          ├── scan |                    |
                          │        | table              | exams@primary
                          │        | spans              | FULL SCAN
                          └── scan |                    |
                                   | table              | students@primary
                                   | spans              | FULL SCAN
                                   | filter             | (major = 'CS') OR (major = 'Games Eng')

@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@yuzefovich yuzefovich removed this from [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution May 31, 2022
@yuzefovich yuzefovich added this to Triage in SQL Queries via automation May 31, 2022
@yuzefovich yuzefovich moved this from Triage to Backlog in SQL Queries May 31, 2022
@yuzefovich yuzefovich moved this from Backlog (DO NOT ADD NEW ISSUES) to New Backlog in SQL Queries Apr 24, 2023
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Status: Backlog
SQL Queries
New Backlog
Development

No branches or pull requests

5 participants