digoal
2024-05-02
PostgreSQL , PolarDB , DuckDB , pivot , 行列转换 , 统计
使用DuckDB pivot语法进行行列转换实在是太好用了.
手册:
语法:
PIVOT ⟨dataset⟩
ON ⟨columns⟩
USING ⟨values⟩
GROUP BY ⟨rows⟩
ORDER BY ⟨columns_with_order_directions⟩
LIMIT ⟨number_of_rows⟩;
例子:
《DuckDB 语法糖: Dynamic PIVOT and UNPIVOT 动态行列转换》
示例数据:
CREATE TABLE Cities (Country VARCHAR, Name VARCHAR, Year INTEGER, Population INTEGER);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2000, 1005);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2010, 1065);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2020, 1158);
INSERT INTO Cities VALUES ('US', 'Seattle', 2000, 564);
INSERT INTO Cities VALUES ('US', 'Seattle', 2010, 608);
INSERT INTO Cities VALUES ('US', 'Seattle', 2020, 738);
INSERT INTO Cities VALUES ('US', 'New York City', 2000, 8015);
INSERT INTO Cities VALUES ('US', 'New York City', 2010, 8175);
INSERT INTO Cities VALUES ('US', 'New York City', 2020, 8772);
数据
Country | Name | Year | Population |
---|---|---|---|
NL | Amsterdam | 2000 | 1005 |
NL | Amsterdam | 2010 | 1065 |
NL | Amsterdam | 2020 | 1158 |
US | Seattle | 2000 | 564 |
US | Seattle | 2010 | 608 |
US | Seattle | 2020 | 738 |
US | New York City | 2000 | 8015 |
US | New York City | 2010 | 8175 |
US | New York City | 2020 | 8772 |
sql1:
PIVOT Cities
ON Year -- 表示要多出来的字段 , 每个Year value一个字段
USING sum(Population); -- 多出来的字段填啥值呢? 就这个, 隐含 group by Cities ALL except(Year, Population)
结果, pivot输出字段:
- 不包含cities表的 ON指定的year和 USING指定的Population字段
- 加上ON指定的year字段的distinct值
Country | Name | 2000 | 2010 | 2020 |
---|---|---|---|---|
NL | Amsterdam | 1005 | 1065 | 1158 |
US | Seattle | 564 | 608 | 738 |
US | New York City | 8015 | 8175 | 8772 |
sql2:
PIVOT Cities
ON Year
USING sum(Population)。
GROUP BY Country; -- 以上结果之上再 sum(Population) + group by country, year's value
结果, pivot输出字段:
- 不包含cities表的 ON指定的year和 USING指定的Population字段
- 加上ON指定的year字段的distinct值
- 以上结果之上再 group by country,year + sum(Population)
Country | 2000 | 2010 | 2020 |
---|---|---|---|
NL | 1005 | 1065 | 1158 |
US | 8579 | 8783 | 9510 |
sql3:
PIVOT Cities
ON Year IN (2000, 2010) -- 过滤year输出字段的值
USING sum(Population)
GROUP BY Country;
结果, pivot输出字段:
- 不包含cities表的 ON指定的year 和 USING指定的Population字段
- 加上ON指定的year字段的 IN (2000, 2010) 值
- 以上结果之上再 group by country,year + sum(Population)
Country | 2000 | 2010 |
---|---|---|
NL | 1005 | 1065 |
US | 8579 | 8783 |
sql4:
PIVOT Cities
ON Country, Name -- 多个字段的distinct值 组成 输出字段
USING sum(Population); -- 多出来的字段填啥值呢? 就这个, 隐含 group by Cities ALL except(Population, Country, Name)
相当于:
PIVOT Cities
ON Country || '_' || Name
USING sum(Population);
结果, pivot输出字段:
- 不包含cities表的 ON指定的Country, Name 和 USING指定的Population字段
- 加上ON指定的 Country, Name 字段的 distinct 值
Year | NL_Amsterdam | NL_New York City | NL_Seattle | US_Amsterdam | US_New York City | US_Seattle |
---|---|---|---|---|---|---|
2000 | 1005 | NULL | NULL | NULL | 8015 | 564 |
2010 | 1065 | NULL | NULL | NULL | 8175 | 608 |
2020 | 1158 | NULL | NULL | NULL | 8772 | 738 |
sql5:
PIVOT Cities
ON Year
USING sum(Population) AS total, max(Population) AS max -- 多出来的on字段值, 与using的2个group进行组合 输出
GROUP BY Country;
结果, pivot输出字段:
- 不包含cities表的 ON指定的Year 和 USING指定的Population字段
- 加上ON指定的 Year 字段的 distinct 值 & USING sum(Population) AS total, max(Population) AS max 的组合
- 以上结果之上再 group by country,year + sum(Population) AS total, max(Population) AS max
Country | 2000_total | 2000_max | 2010_total | 2010_max | 2020_total | 2020_max |
---|---|---|---|---|---|---|
US | 8579 | 8015 | 8783 | 8175 | 9510 | 8772 |
NL | 1005 | 1005 | 1065 | 1065 | 1158 | 1158 |
sql6:
-- 多个pivot就是多个table alias. 相当于 select * from ....
FROM (PIVOT Cities ON Year USING sum(Population) GROUP BY Country) year_pivot
JOIN (PIVOT Cities ON Name USING sum(Population) GROUP BY Country) name_pivot
USING (Country);
结果, pivot 1 输出字段:
- 不包含cities表的 ON指定的Year 和 USING指定的Population字段
- 加上ON指定的 Year 字段的 distinct 值
- 以上结果之上再 group by country,Year + sum(Population)
结果, pivot 2 输出字段:
- 不包含cities表的 ON指定的Name 和 USING指定的Population字段
- 加上ON指定的 Name 字段的 distinct 值
- 以上结果之上再 group by country,Name + sum(Population)
Country | 2000 | 2010 | 2020 | Amsterdam | New York City | Seattle |
---|---|---|---|---|---|---|
NL | 1005 | 1065 | 1158 | 3228 | NULL | NULL |
US | 8579 | 8783 | 9510 | NULL | 24962 | 1910 |
DuckDB也支持sql标准的pivot语法:
FROM ⟨dataset⟩
PIVOT (
⟨values⟩
FOR
⟨column_1⟩ IN (⟨in_list⟩)
⟨column_2⟩ IN (⟨in_list⟩)
...
GROUP BY ⟨rows⟩
);
例子
sql1:
FROM Cities
PIVOT (
sum(Population)
FOR
Year IN (2000, 2010, 2020)
GROUP BY Country
);
Country | 2000 | 2010 | 2020 |
---|---|---|---|
NL | 1005 | 1065 | 1158 |
US | 8579 | 8783 | 9510 |
sql2:
FROM Cities
PIVOT (
sum(Population) AS total,
count(Population) AS count
FOR
Year IN (2000, 2010)
Country in ('NL', 'US')
);
Name | 2000_NL_total | 2000_NL_count | 2000_US_total | 2000_US_count | 2010_NL_total | 2010_NL_count | 2010_US_total | 2010_US_count |
---|---|---|---|---|---|---|---|---|
Amsterdam | 1005 | 1 | NULL | 0 | 1065 | 1 | NULL | 0 |
Seattle | NULL | 0 | 564 | 1 | NULL | 0 | 608 | 1 |
New York City | NULL | 0 | 8015 | 1 | NULL | 0 | 8175 | 1 |