-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add support for Pivot/Unpivot statements #6387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…upport for star and COLUMNS expressions in unpivot
|
This looks really awesome and is something I've wanted to do for a long time. Since which direction pivot and unpivot move the data, can I ask that this is very explicit and the functions are called just pivot_longer and pivot_wider? |
|
I can't wait to see this in duckdb since we rely on the Snowflake pivot function to deal with large denormalized tables similar to your initial example. Our assignment tables include extraneous columns and numerous categorical values in the 'name' field, so I'm sharing a sample query here for consideration while you design the duckdb solution. Where pivoting fails is when there are multiple userpropertyvalues for a single userid and userpropertyname. In this case, my query only returns a single "max" value. I know there's manual solutions but aggregated values would be a magnificent addition to any pivot function. No pressure, and thank you for adding pivot to duckdb. |
You could use a D PIVOT Cities ON Year USING LIST(Population) GROUP BY Country;
┌─────────┬─────────────┬─────────────┬─────────────┐
│ Country │ 2000 │ 2010 │ 2020 │
│ varchar │ int32[] │ int32[] │ int32[] │
├─────────┼─────────────┼─────────────┼─────────────┤
│ NL │ [1005] │ [1065] │ [1158] │
│ US │ [564, 8015] │ [608, 8175] │ [738, 8772] │
└─────────┴─────────────┴─────────────┴─────────────┘
|
Sure, I have added |
|
Thanks! I think this will lead to a lot less confusion and many more happy users. |
|
This is amazing. Now DuckDB has all basic data transformations. Really cool! |
This comment was marked as abuse.
This comment was marked as abuse.
|
v0.7.1 was a bug-fix release - this will be in v0.8.0 |
This comment was marked as abuse.
This comment was marked as abuse.
|
This is super awesome! I am going to use the heck out of this feature. Thank you!! I do have one request though @Mytherin... Sorry to even ask! However, it's the last piece that's needed for full Excel pivot table parity. Is there a way that the high level pivot syntax could support multiple aggregation metrics / measures? It appears that the SQL standard syntax that you implemented can support it based on this test: # multiple aggregations
query IIIIIIIII
SELECT year, q1_total, q1_avg, q2_total, q2_avg, q3_total, q3_avg, q4_total, q4_avg
FROM (SELECT year, quarter, sales FROM sales) AS s
PIVOT (sum(sales) AS total, avg(sales) AS avg
FOR quarter
IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4));Could we have multiple aggregates in the USING clause, maybe something like this? PIVOT Cities ON Year USING (SUM(Population) as sum_pop, count(population) as count_pop) GROUP BY Country;or maybe even just without the aliases? PIVOT Cities ON Year USING (SUM(Population), count(population)) GROUP BY Country;I added parentheses, but maybe they wouldn't be needed either? PIVOT Cities ON Year USING SUM(Population), count(population) GROUP BY Country; |
|
Yes, this is actually trivially implemented as it was already supported. Pushed it in #6644 |
|
GAME ON!! For those of you following along at home, that is a 10 minute turnaround time from the wizard himself, Dr. @Mytherin... Thank you! |
|
Dr. Mytherin please ;) |
Gah!! Sorry! Fixed... |
|
Hah, no worries :) not a stickler for titles, but if I get an opportunity to reference Austin Powers I will :) |
|
Hi, this is great! I have a question though and sorry in advance if it comes from ignorance. I hope I was clear, if not feel free to ask any clarifying questions Thanks! |
Implements #2061
This PR adds support for the pivot/unpivot statements in DuckDB. We add both the "SQL standard" manner of pivoting/unpivoting (supported by Snowflake, BigQuery, DataBricks, SQL Server, others?) and a pivot/unpivot statement that is (hopefully) easier to use/understand.
In the examples below we use this example data set:
Or for the non-database engines out there:
Pivot
Pivot, also known as
PIVOT_WIDER, takes the values of a column and turns them into separate columns.For example, in the above table, if we want to move the values from the "year" and "population" column into separate columns (pivoted over name) we can do that as follows:
And using the SQL Standard pivot syntax:
As part of the syntax, we need to specify:
Note that when using the new
Pivotsyntax we do not need to specify the values of the columns - they are automatically discovered from the table. We can still specify the columns, however, in case we want to pivot only a subset of values. Any values not specified will be ignored.We can also specify an optional
GROUP BYclause, in case we want to group by only a subset of the remaining columns. For example:Unpivot
Unpivot, also known as
PIVOT_LONGER, takes the names of a set of columns and turns them into values. For example, if we start with the previously pivoted table:We can revert it back to the original table using the following
UNPIVOTstatement:Using the SQL standard syntax, that looks like this:
The unpivot clause takes the given set of columns and decomposes them into two columns: a
NAMEand aVALUEcolumn. TheNAMEcolumn contains the column names, and theVALUEcolumn contains the column values.Note that providing names for these columns is optional. We could also run the query without specifying these. In that case, the columns will be called
nameandvalue.We also allow
COLUMNSand star expressions in theUNPIVOT. For example:Internals
Pivoting and unpivoting are implemented entirely as rewrites into SQL queries.
PIVOTis implemented as a set of aggregations withFILTERclauses, andUNPIVOTis implemented using lists andUNNESTstatement.Here are the equivalent SQL clauses:
Pivot
Becomes:
Both providing the following result:
Unpivot
Becomes:
Both providing the following result:
Pivot without specifying the names
The true magic (or hack - depending on your perspective :)) in this PR is in allowing
PIVOTwithout specifying the column names up front. DuckDB, like most SQL engines, is completely dependent on knowing all the column names and types within a single SQL statement up front. The magic - and what happens here - is that running aPIVOTwithout specifying the names will result in running multiple SQL statements.Specifically, we (ab)use the enums for this purpose. Enums hold a set of values that are known at query compilation time, and can be created from a query.
So the following query:
Actually gets translated into the following queries:
Using multiple pivots is also possible. For example, we can join on the results of a
PIVOTstatement with anotherPIVOTstatement: