Skip to content

Add support for Pivot/Unpivot statements #6387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Feb 21, 2023
Merged

Conversation

Mytherin
Copy link
Collaborator

Implements #2061

This PR adds support for the pivot/unpivot statements in DuckDB. We add both the "SQL standard" manner of pivoting/unpivoting (supported by Snowflake, BigQuery, DataBricks, SQL Server, others?) and a pivot/unpivot statement that is (hopefully) easier to use/understand.

In the examples below we use this example data set:

CREATE TABLE Cities(Country VARCHAR, Name VARCHAR, Year INT, Population INT);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2000, 1005);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2010, 1065);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2020, 1158);
INSERT INTO Cities VALUES ('US', 'Seattle', 2000, 564);
INSERT INTO Cities VALUES ('US', 'Seattle', 2010, 608);
INSERT INTO Cities VALUES ('US', 'Seattle', 2020, 738);
INSERT INTO Cities VALUES ('US', 'New York City', 2000, 8015);
INSERT INTO Cities VALUES ('US', 'New York City', 2010, 8175);
INSERT INTO Cities VALUES ('US', 'New York City', 2020, 8772);

Or for the non-database engines out there:

┌─────────┬───────────────┬───────┬────────────┐
│ Country │     Name      │ Year  │ Population │
│ varchar │    varchar    │ int32 │   int32    │
├─────────┼───────────────┼───────┼────────────┤
│ NL      │ Amsterdam     │  2000 │       1005 │
│ NL      │ Amsterdam     │  2010 │       1065 │
│ NL      │ Amsterdam     │  2020 │       1158 │
│ US      │ Seattle       │  2000 │        564 │
│ US      │ Seattle       │  2010 │        608 │
│ US      │ Seattle       │  2020 │        738 │
│ US      │ New York City │  2000 │       8015 │
│ US      │ New York City │  2010 │       8175 │
│ US      │ New York City │  2020 │       8772 │
└─────────┴───────────────┴───────┴────────────┘

Pivot

Pivot, also known as PIVOT_WIDER, takes the values of a column and turns them into separate columns.

For example, in the above table, if we want to move the values from the "year" and "population" column into separate columns (pivoted over name) we can do that as follows:

PIVOT Cities ON Year USING SUM(Population);
┌─────────┬───────────────┬────────┬────────┬────────┐
│ Country │     Name      │  200020102020  │
│ varcharvarchar    │ int128 │ int128 │ int128 │
├─────────┼───────────────┼────────┼────────┼────────┤
│ NL      │ Amsterdam     │   100510651158 │
│ US      │ Seattle       │    564608738 │
│ US      │ New York City │   801581758772 │
└─────────┴───────────────┴────────┴────────┴────────┘

And using the SQL Standard pivot syntax:

FROM Cities PIVOT (SUM(Population) FOR Year IN (2000, 2010, 2020));
┌─────────┬───────────────┬────────┬────────┬────────┐
│ Country │     Name      │  200020102020  │
│ varcharvarchar    │ int128 │ int128 │ int128 │
├─────────┼───────────────┼────────┼────────┼────────┤
│ NL      │ Amsterdam     │   100510651158 │
│ US      │ Seattle       │    564608738 │
│ US      │ New York City │   801581758772 │
└─────────┴───────────────┴────────┴────────┴────────┘

As part of the syntax, we need to specify:

  • The column to pivot on (i.e. from which we take the names for the new columns)
  • The aggregate used to fill in the values of the columns (this must be an aggregate because we might have multiple matches)
  • (Optionally) the columns to group by - by default this is all columns not mentioned in the above clauses

Note that when using the new Pivot syntax we do not need to specify the values of the columns - they are automatically discovered from the table. We can still specify the columns, however, in case we want to pivot only a subset of values. Any values not specified will be ignored.

PIVOT Cities ON Year IN (2000, 2020) USING SUM(Population);
┌─────────┬───────────────┬────────┬────────┐
│ Country │     Name      │  20002020  │
│ varcharvarchar    │ int128 │ int128 │
├─────────┼───────────────┼────────┼────────┤
│ NL      │ Amsterdam     │   10051158 │
│ US      │ Seattle       │    564738 │
│ US      │ New York City │   80158772 │
└─────────┴───────────────┴────────┴────────┘

We can also specify an optional GROUP BY clause, in case we want to group by only a subset of the remaining columns. For example:

PIVOT Cities ON Year USING SUM(Population) GROUP BY Country;
┌─────────┬────────┬────────┬────────┐
│ Country │  200020102020  │
│ varchar │ int128 │ int128 │ int128 │
├─────────┼────────┼────────┼────────┤
│ NL      │   100510651158 │
│ US      │   857987839510 │
└─────────┴────────┴────────┴────────┘

Unpivot

Unpivot, also known as PIVOT_LONGER, takes the names of a set of columns and turns them into values. For example, if we start with the previously pivoted table:

CREATE TABLE PivotedCities AS PIVOT Cities ON Year USING SUM(Population);
┌─────────┬───────────────┬────────┬────────┬────────┐
│ Country │     Name      │  200020102020  │
│ varcharvarchar    │ int128 │ int128 │ int128 │
├─────────┼───────────────┼────────┼────────┼────────┤
│ NL      │ Amsterdam     │   100510651158 │
│ US      │ Seattle       │    564608738 │
│ US      │ New York City │   801581758772 │
└─────────┴───────────────┴────────┴────────┴────────┘

We can revert it back to the original table using the following UNPIVOT statement:

UNPIVOT PivotedCities ON 2000, 2010, 2020 INTO NAME Year VALUE Population;
┌─────────┬───────────────┬─────────┬────────────┐
│ Country │     Name      │  Year   │ Population │
│ varchar │    varchar    │ varchar │   int128   │
├─────────┼───────────────┼─────────┼────────────┤
│ NL      │ Amsterdam     │ 2000    │       1005 │
│ NL      │ Amsterdam     │ 2010    │       1065 │
│ NL      │ Amsterdam     │ 2020    │       1158 │
│ US      │ Seattle       │ 2000    │        564 │
│ US      │ Seattle       │ 2010    │        608 │
│ US      │ Seattle       │ 2020    │        738 │
│ US      │ New York City │ 2000    │       8015 │
│ US      │ New York City │ 2010    │       8175 │
│ US      │ New York City │ 2020    │       8772 │
└─────────┴───────────────┴─────────┴────────────┘

Using the SQL standard syntax, that looks like this:

FROM PivotedCities UNPIVOT(Population FOR Year IN (2000, 2010, 2020));

The unpivot clause takes the given set of columns and decomposes them into two columns: a NAME and a VALUE column. The NAME column contains the column names, and the VALUE column contains the column values.

Note that providing names for these columns is optional. We could also run the query without specifying these. In that case, the columns will be called name and value.

UNPIVOT PivotedCities ON 2000, 2010, 2020;
┌─────────┬───────────────┬─────────┬────────┐
│ Country │     Name      │ name    │ value  │
│ varchar │    varchar    │ varchar │ int128 │
├─────────┼───────────────┼─────────┼────────┤
│ NL      │ Amsterdam     │ 2000    │   1005 │
│ NL      │ Amsterdam     │ 2010    │   1065 │
│ NL      │ Amsterdam     │ 2020    │   1158 │
│ US      │ Seattle       │ 2000    │    564 │
│ US      │ Seattle       │ 2010    │    608 │
│ US      │ Seattle       │ 2020    │    738 │
│ US      │ New York City │ 2000    │   8015 │
│ US      │ New York City │ 2010    │   8175 │
│ US      │ New York City │ 2020    │   8772 │
└─────────┴───────────────┴─────────┴────────┘

We also allow COLUMNS and star expressions in the UNPIVOT. For example:

UNPIVOT PivotedCities ON COLUMNS('\d+') INTO NAME Year VALUE Population;
UNPIVOT PivotedCities ON * EXCLUDE (Country, Name) INTO NAME Year VALUE Population;

Internals

Pivoting and unpivoting are implemented entirely as rewrites into SQL queries. PIVOT is implemented as a set of aggregations with FILTER clauses, and UNPIVOT is implemented using lists and UNNEST statement.

Here are the equivalent SQL clauses:

Pivot
PIVOT Cities ON Year USING SUM(Population);

Becomes:

SELECT Country,
       Name,
       SUM(Population) FILTER (Year=2000) AS "2000",
       SUM(Population) FILTER (Year=2010) AS "2010",
       SUM(Population) FILTER (Year=2020) AS "2020"
FROM Cities
GROUP BY ALL;

Both providing the following result:

┌─────────┬───────────────┬────────┬────────┬────────┐
│ Country │     Name      │  200020102020  │
│ varcharvarchar    │ int128 │ int128 │ int128 │
├─────────┼───────────────┼────────┼────────┼────────┤
│ NL      │ Amsterdam     │   100510651158 │
│ US      │ Seattle       │    564608738 │
│ US      │ New York City │   801581758772 │
└─────────┴───────────────┴────────┴────────┴────────┘
Unpivot
UNPIVOT PivotedCities ON 2000, 2010, 2020 INTO NAME Year VALUE Population;

Becomes:

SELECT Country,
       Name,
       UNNEST(['2000', '2010', '2020']) AS "Year",
       UNNEST(["2000", "2010", "2020"]) AS "Population"
FROM PivotedCities;

Both providing the following result:

┌─────────┬───────────────┬─────────┬────────────┐
│ Country │     Name      │  Year   │ Population │
│ varchar │    varchar    │ varchar │   int128   │
├─────────┼───────────────┼─────────┼────────────┤
│ NL      │ Amsterdam     │ 2000    │       1005 │
│ NL      │ Amsterdam     │ 2010    │       1065 │
│ NL      │ Amsterdam     │ 2020    │       1158 │
│ US      │ Seattle       │ 2000    │        564 │
│ US      │ Seattle       │ 2010    │        608 │
│ US      │ Seattle       │ 2020    │        738 │
│ US      │ New York City │ 2000    │       8015 │
│ US      │ New York City │ 2010    │       8175 │
│ US      │ New York City │ 2020    │       8772 │
└─────────┴───────────────┴─────────┴────────────┘

Pivot without specifying the names

The true magic (or hack - depending on your perspective :)) in this PR is in allowing PIVOT without specifying the column names up front. DuckDB, like most SQL engines, is completely dependent on knowing all the column names and types within a single SQL statement up front. The magic - and what happens here - is that running a PIVOT without specifying the names will result in running multiple SQL statements.

Specifically, we (ab)use the enums for this purpose. Enums hold a set of values that are known at query compilation time, and can be created from a query.

So the following query:

PIVOT Cities ON Year USING SUM(Population);

Actually gets translated into the following queries:

CREATE TEMPORARY TYPE __pivot_enum_0_0 AS ENUM (SELECT DISTINCT Year::VARCHAR FROM Cities ORDER BY Year);
PIVOT Cities ON Year IN __pivot_enum_0_0 USING SUM(Population);

Using multiple pivots is also possible. For example, we can join on the results of a PIVOT statement with another PIVOT statement:

FROM
	(PIVOT Cities ON Year USING SUM(Population) GROUP BY Country)
JOIN
	(PIVOT Cities ON Name USING SUM(Population) GROUP BY Country)
USING (Country);
┌─────────┬────────┬────────┬────────┬───────────┬───────────────┬─────────┐
│ Country │  2000  │  2010  │  2020  │ Amsterdam │ New York City │ Seattle │
│ varchar │ int128 │ int128 │ int128 │  int128   │    int128     │ int128  │
├─────────┼────────┼────────┼────────┼───────────┼───────────────┼─────────┤
│ NL      │   1005 │   1065 │   1158 │      3228 │          NULL │    NULL │
│ US      │   8579 │   8783 │   9510 │      NULL │         24962 │    1910 │
└─────────┴────────┴────────┴────────┴───────────┴───────────────┴─────────┘

@jaredlander
Copy link

This looks really awesome and is something I've wanted to do for a long time. Since which direction pivot and unpivot move the data, can I ask that this is very explicit and the functions are called just pivot_longer and pivot_wider?

@parodayco
Copy link

I can't wait to see this in duckdb since we rely on the Snowflake pivot function to deal with large denormalized tables similar to your initial example. Our assignment tables include extraneous columns and numerous categorical values in the 'name' field, so I'm sharing a sample query here for consideration while you design the duckdb solution.

select * 
from users as u 
left join 
	(select userid, userpropertyname, userpropertyvalue from userpropertyassignments)
	pivot(
		max(userpropertyvalue) 
		for userpropertyname in ('Geography', 'Market', 'Business Group', 'Business Unit')
	) 
	as up (user_id, geography, market, business_group, business_unit) 
on up.user_id = u.id 

Where pivoting fails is when there are multiple userpropertyvalues for a single userid and userpropertyname. In this case, my query only returns a single "max" value. I know there's manual solutions but aggregated values would be a magnificent addition to any pivot function. No pressure, and thank you for adding pivot to duckdb.

@Mytherin
Copy link
Collaborator Author

I can't wait to see this in duckdb since we rely on the Snowflake pivot function to deal with large denormalized tables similar to your initial example. Our assignment tables include extraneous columns and numerous categorical values in the 'name' field, so I'm sharing a sample query here for consideration while you design the duckdb solution.

select * 
from users as u 
left join 
	(select userid, userpropertyname, userpropertyvalue from userpropertyassignments)
	pivot(
		max(userpropertyvalue) 
		for userpropertyname in ('Geography', 'Market', 'Business Group', 'Business Unit')
	) 
	as up (user_id, geography, market, business_group, business_unit) 
on up.user_id = u.id 

Where pivoting fails is when there are multiple userpropertyvalues for a single userid and userpropertyname. In this case, my query only returns a single "max" value. I know there's manual solutions but aggregated values would be a magnificent addition to any pivot function. No pressure, and thank you for adding pivot to duckdb.

You could use a LIST aggregate for this, e.g. with my above example:

D PIVOT Cities ON Year USING LIST(Population) GROUP BY Country;
┌─────────┬─────────────┬─────────────┬─────────────┐
│ Country │    200020102020     │
│ varchar │   int32[]   │   int32[]   │   int32[]   │
├─────────┼─────────────┼─────────────┼─────────────┤
│ NL      │ [1005]      │ [1065]      │ [1158]      │
│ US      │ [564, 8015] │ [608, 8175] │ [738, 8772] │
└─────────┴─────────────┴─────────────┴─────────────┘

@Mytherin
Copy link
Collaborator Author

This looks really awesome and is something I've wanted to do for a long time. Since which direction pivot and unpivot move the data, can I ask that this is very explicit and the functions are called just pivot_longer and pivot_wider?

Sure, I have added PIVOT_WIDER as an alias for PIVOT and PIVOT_LONGER as an alias for UNPIVOT.

@jaredlander
Copy link

Thanks! I think this will lead to a lot less confusion and many more happy users.

@Mytherin Mytherin merged commit 0a4bb30 into duckdb:feature Feb 21, 2023
@domoritz
Copy link
Contributor

This is amazing. Now DuckDB has all basic data transformations. Really cool!

@l1t1

This comment was marked as abuse.

@Mytherin
Copy link
Collaborator Author

Mytherin commented Mar 1, 2023

v0.7.1 was a bug-fix release - this will be in v0.8.0

@l1t1

This comment was marked as abuse.

@Alex-Monahan
Copy link
Contributor

Alex-Monahan commented Mar 9, 2023

This is super awesome! I am going to use the heck out of this feature. Thank you!!

I do have one request though @Mytherin... Sorry to even ask! However, it's the last piece that's needed for full Excel pivot table parity.

Is there a way that the high level pivot syntax could support multiple aggregation metrics / measures? It appears that the SQL standard syntax that you implemented can support it based on this test:

# multiple aggregations
query IIIIIIIII
SELECT year, q1_total, q1_avg, q2_total, q2_avg, q3_total, q3_avg, q4_total, q4_avg
    FROM (SELECT year, quarter, sales FROM sales) AS s
    PIVOT (sum(sales) AS total, avg(sales) AS avg
      FOR quarter
      IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4));

Could we have multiple aggregates in the USING clause, maybe something like this?

PIVOT Cities ON Year USING (SUM(Population) as sum_pop, count(population) as count_pop) GROUP BY Country;

or maybe even just without the aliases?

PIVOT Cities ON Year USING (SUM(Population), count(population)) GROUP BY Country;

I added parentheses, but maybe they wouldn't be needed either?

PIVOT Cities ON Year USING SUM(Population), count(population) GROUP BY Country;

@Mytherin
Copy link
Collaborator Author

Mytherin commented Mar 9, 2023

Yes, this is actually trivially implemented as it was already supported. Pushed it in #6644

@Alex-Monahan
Copy link
Contributor

Alex-Monahan commented Mar 9, 2023

GAME ON!! For those of you following along at home, that is a 10 minute turnaround time from the wizard himself, Dr. @Mytherin... Thank you!

@Mytherin
Copy link
Collaborator Author

Mytherin commented Mar 9, 2023

Dr. Mytherin please ;)

@Alex-Monahan
Copy link
Contributor

Dr. Mytherin please ;)

Gah!! Sorry! Fixed...

@Mytherin
Copy link
Collaborator Author

Mytherin commented Mar 9, 2023

Hah, no worries :) not a stickler for titles, but if I get an opportunity to reference Austin Powers I will :)

@snth snth mentioned this pull request Mar 12, 2023
@Mytherin Mytherin mentioned this pull request Mar 22, 2023
2 tasks
@Mytherin Mytherin mentioned this pull request Apr 3, 2023
@Mytherin Mytherin deleted the pivot branch April 24, 2023 13:35
@lben
Copy link

lben commented Apr 28, 2023

Hi, this is great! I have a question though and sorry in advance if it comes from ignorance.
My understanding is that DuckDB builds a query plan and performs things like projection pushdown but since this functionality seems to require knowledge about the ON column values (column Year in the example) does that mean that it forces DuckDB to execute its query plan up to that point and then perform these rewritten queries? Or -somehow- the rewritten queries are still part of the query plan and everything will be lazy evaluated until the very end?

I hope I was clear, if not feel free to ask any clarifying questions

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants