Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GROUP BY ALL/ORDER BY ALL #2899

Merged
merged 6 commits into from
Jan 11, 2022
Merged

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Jan 10, 2022

The GROUP BY clause is often superfluous, as it often simply contains all of the elements in the SELECT clause that do not contain aggregations. For example, we might write a query like this:

SELECT state, city, COUNT(*) as population
FROM people
GROUP BY state, city;

We see here that the two columns state and city are repeated in the GROUP BY. In reality, the system knows we need to group by at least state and city. If we alter the query and remove either state or city from the group by, we will receive an error:

SELECT state, city, COUNT(*) as population
FROM people
GROUP BY state;
-- Error: Binder Error: column "city" must appear in the GROUP BY clause or be used in an aggregate function

While we can specify additional columns to group by that are not contained in the selection list (e.g. we could specify GROUP BY country, state, city) this is rare in practice. After all, if we group by certain columns but do not add them to the SELECT list, we do not know what we are looking at in the query result!

As an example, in the set of 22 TPC-H queries, every single group by clause contains only columns duplicated from the SELECT clause (and can thus be replaced by GROUP BY ALL).

As a result, in almost all scenarios, the GROUP BY simply contains duplicated columns from the SELECT list. This is not particularly useful, since the system can easily figure out by itself by which columns to group (in fact, it needs to know that to be able to emit such an error message!).

GROUP BY ALL

For that reason, we add support for the GROUP BY ALL, which causes a grouping to happen on all ungrouped columns in the SELECT list. For example:

SELECT state, city, COUNT(*) as population
FROM people
GROUP BY ALL;
-- group by state, city

Now we no longer need to duplicate the columns in the GROUP BY list, and it becomes much easier to add or remove groups, since we only need to add or remove groups in one location! For example, if we want to group just by state, we only need to change the query as follows:

SELECT state, COUNT(*) as population
FROM people
GROUP BY ALL;

ORDER BY ALL

In many cases you will want a result to be ordered for easy viewing/browsing. You might add a clause such as ORDER BY 1 ,2, 3, ... to the query. However, this requires you to keep track of the number of columns and also change the number of orders when you alter the query and change the number of columns. For that reason, we also add an ORDER BY ALL clause, which automatically orders by the columns from left to right.

SELECT state, COUNT(*) as population
FROM people
GROUP BY ALL
ORDER BY ALL;
-- GROUP BY state
-- ORDER BY state, population

ORDER BY ALL can also contain the usual order specifiers (ASC/DESC and NULLS FIRST/NULLS LAST), these will be applied to all columns:

SELECT state, COUNT(*) as population
FROM people
GROUP BY ALL
ORDER BY ALL DESC NULLS LAST;
-- GROUP BY state
-- ORDER BY state DESC NULLS LAST, population DESC NULLS LAST

WHERE
l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY ALL
ORDER BY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ORDER BY * here to stress that code path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good!

c_address,
c_comment
ORDER BY
revenue DESC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a fine stress test for ORDER BY * DESC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this won't work since revenue is not the first column in the select list.

GROUP BY
c_count
ORDER BY
custdist DESC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple ORDER BY * DESC if you swap the SELECT clause order?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapping the SELECT clause will change the result, which means we can't read the result from the given CSV anymore.

1 3

query II
SELECT * FROM integers UNION SELECT * FROM integers ORDER BY ALL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the implied order here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as if there was no set operation, the first and then the second column. Set operations do not change the number of columns.


query IIIII nosort v1
SELECT * FROM v1
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something here - why are these results empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result checker here uses the labels (v1 and v2) to verify that the result does not change after restarting. The result is empty because when labels are present the result is ignored and instead results with the same label are compared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh duh I knew that. Sorry!

@Alex-Monahan
Copy link
Contributor

First, I'm still super excited about this feature!

I had one small question. How would this interact with Grouping sets / cube / rollup? Are they mutually exclusive? Or would it be possible to do a syntax like:

Group By
Cube ALL 

Thanks!

@Mytherin
Copy link
Collaborator Author

For now they are mutually exclusive, but in theory this could work like that as well. This would require a bit more work, however. If there is enough demand to extend the syntax to allow CUBE ALL/ROLLUP ALL I would be happy to have a go at that in a future PR.

@Alex-Monahan
Copy link
Contributor

I think that your current solution solves 99% of cases! I'm just curious for my own use (but I'm an outlier and only use those features very occasionally!).

@Mytherin Mytherin merged commit cd8ea60 into duckdb:master Jan 11, 2022
@Mytherin Mytherin deleted the groupbyall branch January 17, 2022 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants