DuckDB pivot 行列转换太好用了

作者

digoal

日期

2024-05-02

背景

使用DuckDB pivot语法进行行列转换实在是太好用了.

手册:

https://duckdb.org/docs/sql/statements/pivot

语法:

PIVOT ⟨dataset⟩    
ON ⟨columns⟩    
USING ⟨values⟩    
GROUP BY ⟨rows⟩    
ORDER BY ⟨columns_with_order_directions⟩    
LIMIT ⟨number_of_rows⟩;

例子:

《DuckDB 语法糖: Dynamic PIVOT and UNPIVOT 动态行列转换》

示例数据:

CREATE TABLE Cities (Country VARCHAR, Name VARCHAR, Year INTEGER, Population INTEGER);    
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2000, 1005);    
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2010, 1065);    
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2020, 1158);    
INSERT INTO Cities VALUES ('US', 'Seattle', 2000, 564);    
INSERT INTO Cities VALUES ('US', 'Seattle', 2010, 608);    
INSERT INTO Cities VALUES ('US', 'Seattle', 2020, 738);    
INSERT INTO Cities VALUES ('US', 'New York City', 2000, 8015);    
INSERT INTO Cities VALUES ('US', 'New York City', 2010, 8175);    
INSERT INTO Cities VALUES ('US', 'New York City', 2020, 8772);

数据

Country	Name	Year	Population
NL	Amsterdam	2000	1005
NL	Amsterdam	2010	1065
NL	Amsterdam	2020	1158
US	Seattle	2000	564
US	Seattle	2010	608
US	Seattle	2020	738
US	New York City	2000	8015
US	New York City	2010	8175
US	New York City	2020	8772

sql1:

PIVOT Cities      
ON Year   -- 表示要多出来的字段 , 每个Year value一个字段       
USING sum(Population);  -- 多出来的字段填啥值呢? 就这个, 隐含 group by Cities ALL except(Year, Population)

结果, pivot输出字段:

不包含cities表的 ON指定的year和 USING指定的Population字段
加上ON指定的year字段的distinct值

Country	Name	2000	2010	2020
NL	Amsterdam	1005	1065	1158
US	Seattle	564	608	738
US	New York City	8015	8175	8772

sql2:

PIVOT Cities    
ON Year    
USING sum(Population)。    
GROUP BY Country;  -- 以上结果之上再 sum(Population) + group by country, year's value

结果, pivot输出字段:

不包含cities表的 ON指定的year和 USING指定的Population字段
加上ON指定的year字段的distinct值
以上结果之上再 group by country,year + sum(Population)

Country	2000	2010	2020
NL	1005	1065	1158
US	8579	8783	9510

sql3:

PIVOT Cities    
ON Year IN (2000, 2010)  -- 过滤year输出字段的值    
USING sum(Population)      
GROUP BY Country;

结果, pivot输出字段:

不包含cities表的 ON指定的year 和 USING指定的Population字段
加上ON指定的year字段的 IN (2000, 2010) 值
以上结果之上再 group by country,year + sum(Population)

Country	2000	2010
NL	1005	1065
US	8579	8783

sql4:

PIVOT Cities    
ON Country, Name   -- 多个字段的distinct值 组成 输出字段      
USING sum(Population);  -- 多出来的字段填啥值呢? 就这个, 隐含 group by Cities ALL except(Population, Country, Name)

相当于:

PIVOT Cities     
ON Country || '_' || Name     
USING sum(Population);

结果, pivot输出字段:

不包含cities表的 ON指定的Country, Name 和 USING指定的Population字段
加上ON指定的 Country, Name 字段的 distinct 值

Year	NL_Amsterdam	NL_New York City	NL_Seattle	US_Amsterdam	US_New York City	US_Seattle
2000	1005	NULL	NULL	NULL	8015	564
2010	1065	NULL	NULL	NULL	8175	608
2020	1158	NULL	NULL	NULL	8772	738

sql5:

PIVOT Cities    
ON Year      
USING sum(Population) AS total, max(Population) AS max  -- 多出来的on字段值, 与using的2个group进行组合  输出       
GROUP BY Country;

结果, pivot输出字段:

不包含cities表的 ON指定的Year 和 USING指定的Population字段
加上ON指定的 Year 字段的 distinct 值 & USING sum(Population) AS total, max(Population) AS max 的组合
以上结果之上再 group by country,year + sum(Population) AS total, max(Population) AS max

Country	2000_total	2000_max	2010_total	2010_max	2020_total	2020_max
US	8579	8015	8783	8175	9510	8772
NL	1005	1005	1065	1065	1158	1158

sql6:

-- 多个pivot就是多个table alias.  相当于 select * from ....      
FROM (PIVOT Cities ON Year USING sum(Population) GROUP BY Country) year_pivot    
JOIN (PIVOT Cities ON Name USING sum(Population) GROUP BY Country) name_pivot    
USING (Country);

结果, pivot 1 输出字段:

不包含cities表的 ON指定的Year 和 USING指定的Population字段
加上ON指定的 Year 字段的 distinct 值
以上结果之上再 group by country,Year + sum(Population)

结果, pivot 2 输出字段:

不包含cities表的 ON指定的Name 和 USING指定的Population字段
加上ON指定的 Name 字段的 distinct 值
以上结果之上再 group by country,Name + sum(Population)

Country	2000	2010	2020	Amsterdam	New York City	Seattle
NL	1005	1065	1158	3228	NULL	NULL
US	8579	8783	9510	NULL	24962	1910

DuckDB也支持sql标准的pivot语法:

FROM ⟨dataset⟩    
PIVOT (    
    ⟨values⟩    
    FOR    
        ⟨column_1⟩ IN (⟨in_list⟩)    
        ⟨column_2⟩ IN (⟨in_list⟩)    
        ...    
    GROUP BY ⟨rows⟩    
);

例子

sql1:

FROM Cities    
PIVOT (    
    sum(Population)    
    FOR    
        Year IN (2000, 2010, 2020)    
    GROUP BY Country    
);

Country	2000	2010	2020
NL	1005	1065	1158
US	8579	8783	9510

sql2:

FROM Cities    
PIVOT (    
    sum(Population) AS total,    
    count(Population) AS count    
    FOR    
        Year IN (2000, 2010)    
        Country in ('NL', 'US')    
);

Name	2000_NL_total	2000_NL_count	2000_US_total	2000_US_count	2010_NL_total	2010_NL_count	2010_US_total	2010_US_count
Amsterdam	1005	1	NULL	0	1065	1	NULL	0
Seattle	NULL	0	564	1	NULL	0	608	1
New York City	NULL	0	8015	1	NULL	0	8175	1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240502_01.md

20240502_01.md

DuckDB pivot 行列转换太好用了

作者

日期

标签

背景

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 开源数据库

PolarDB 学习图谱

购买PolarDB云服务折扣活动进行中, 55元起

PostgreSQL 解决方案集合

德哥 / digoal's Github - 公益是一辈子的事.

About 德哥

Files

20240502_01.md

Latest commit

History

20240502_01.md

File metadata and controls

DuckDB pivot 行列转换 太好用了

作者

日期

标签

背景

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 开源数据库

PolarDB 学习图谱

购买PolarDB云服务折扣活动进行中, 55元起

PostgreSQL 解决方案集合

德哥 / digoal's Github - 公益是一辈子的事.

About 德哥

DuckDB pivot 行列转换太好用了