Skip to content

Latest commit

 

History

History
256 lines (197 loc) · 8.27 KB

20240502_01.md

File metadata and controls

256 lines (197 loc) · 8.27 KB

DuckDB pivot 行列转换 太好用了

作者

digoal

日期

2024-05-02

标签

PostgreSQL , PolarDB , DuckDB , pivot , 行列转换 , 统计


背景

使用DuckDB pivot语法进行行列转换实在是太好用了.

手册:

语法:

PIVOT ⟨dataset⟩    
ON ⟨columns⟩    
USING ⟨values⟩    
GROUP BY ⟨rows⟩    
ORDER BY ⟨columns_with_order_directions⟩    
LIMIT ⟨number_of_rows⟩;    

例子:

《DuckDB 语法糖: Dynamic PIVOT and UNPIVOT 动态行列转换》

示例数据:

CREATE TABLE Cities (Country VARCHAR, Name VARCHAR, Year INTEGER, Population INTEGER);    
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2000, 1005);    
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2010, 1065);    
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2020, 1158);    
INSERT INTO Cities VALUES ('US', 'Seattle', 2000, 564);    
INSERT INTO Cities VALUES ('US', 'Seattle', 2010, 608);    
INSERT INTO Cities VALUES ('US', 'Seattle', 2020, 738);    
INSERT INTO Cities VALUES ('US', 'New York City', 2000, 8015);    
INSERT INTO Cities VALUES ('US', 'New York City', 2010, 8175);    
INSERT INTO Cities VALUES ('US', 'New York City', 2020, 8772);    

数据

Country Name Year Population
NL Amsterdam 2000 1005
NL Amsterdam 2010 1065
NL Amsterdam 2020 1158
US Seattle 2000 564
US Seattle 2010 608
US Seattle 2020 738
US New York City 2000 8015
US New York City 2010 8175
US New York City 2020 8772

sql1:

PIVOT Cities      
ON Year   -- 表示要多出来的字段 , 每个Year value一个字段       
USING sum(Population);  -- 多出来的字段填啥值呢? 就这个, 隐含 group by Cities ALL except(Year, Population)         

结果, pivot输出字段:

  • 不包含cities表的 ON指定的year和 USING指定的Population字段
  • 加上ON指定的year字段的distinct值
Country Name 2000 2010 2020
NL Amsterdam 1005 1065 1158
US Seattle 564 608 738
US New York City 8015 8175 8772

sql2:

PIVOT Cities    
ON Year    
USING sum(Population)。    
GROUP BY Country;  -- 以上结果之上再 sum(Population) + group by country, year's value     

结果, pivot输出字段:

  • 不包含cities表的 ON指定的year和 USING指定的Population字段
  • 加上ON指定的year字段的distinct值
  • 以上结果之上再 group by country,year + sum(Population)
Country 2000 2010 2020
NL 1005 1065 1158
US 8579 8783 9510

sql3:

PIVOT Cities    
ON Year IN (2000, 2010)  -- 过滤year输出字段的值    
USING sum(Population)      
GROUP BY Country;      

结果, pivot输出字段:

  • 不包含cities表的 ON指定的year 和 USING指定的Population字段
  • 加上ON指定的year字段的 IN (2000, 2010) 值
  • 以上结果之上再 group by country,year + sum(Population)
Country 2000 2010
NL 1005 1065
US 8579 8783

sql4:

PIVOT Cities    
ON Country, Name   -- 多个字段的distinct值 组成 输出字段      
USING sum(Population);  -- 多出来的字段填啥值呢? 就这个, 隐含 group by Cities ALL except(Population, Country, Name)        

相当于:

PIVOT Cities     
ON Country || '_' || Name     
USING sum(Population);    

结果, pivot输出字段:

  • 不包含cities表的 ON指定的Country, Name 和 USING指定的Population字段
  • 加上ON指定的 Country, Name 字段的 distinct 值
Year NL_Amsterdam NL_New York City NL_Seattle US_Amsterdam US_New York City US_Seattle
2000 1005 NULL NULL NULL 8015 564
2010 1065 NULL NULL NULL 8175 608
2020 1158 NULL NULL NULL 8772 738

sql5:

PIVOT Cities    
ON Year      
USING sum(Population) AS total, max(Population) AS max  -- 多出来的on字段值, 与using的2个group进行组合  输出       
GROUP BY Country;      

结果, pivot输出字段:

  • 不包含cities表的 ON指定的Year 和 USING指定的Population字段
  • 加上ON指定的 Year 字段的 distinct 值 & USING sum(Population) AS total, max(Population) AS max 的组合
  • 以上结果之上再 group by country,year + sum(Population) AS total, max(Population) AS max
Country 2000_total 2000_max 2010_total 2010_max 2020_total 2020_max
US 8579 8015 8783 8175 9510 8772
NL 1005 1005 1065 1065 1158 1158

sql6:

-- 多个pivot就是多个table alias.  相当于 select * from ....      
FROM (PIVOT Cities ON Year USING sum(Population) GROUP BY Country) year_pivot    
JOIN (PIVOT Cities ON Name USING sum(Population) GROUP BY Country) name_pivot    
USING (Country);    

结果, pivot 1 输出字段:

  • 不包含cities表的 ON指定的Year 和 USING指定的Population字段
  • 加上ON指定的 Year 字段的 distinct 值
  • 以上结果之上再 group by country,Year + sum(Population)

结果, pivot 2 输出字段:

  • 不包含cities表的 ON指定的Name 和 USING指定的Population字段
  • 加上ON指定的 Name 字段的 distinct 值
  • 以上结果之上再 group by country,Name + sum(Population)
Country 2000 2010 2020 Amsterdam New York City Seattle
NL 1005 1065 1158 3228 NULL NULL
US 8579 8783 9510 NULL 24962 1910

DuckDB也支持sql标准的pivot语法:

FROM ⟨dataset⟩    
PIVOT (    
    ⟨values⟩    
    FOR    
        ⟨column_1⟩ IN (⟨in_list⟩)    
        ⟨column_2⟩ IN (⟨in_list⟩)    
        ...    
    GROUP BY ⟨rows⟩    
);    

例子

sql1:

FROM Cities    
PIVOT (    
    sum(Population)    
    FOR    
        Year IN (2000, 2010, 2020)    
    GROUP BY Country    
);    
Country 2000 2010 2020
NL 1005 1065 1158
US 8579 8783 9510

sql2:

FROM Cities    
PIVOT (    
    sum(Population) AS total,    
    count(Population) AS count    
    FOR    
        Year IN (2000, 2010)    
        Country in ('NL', 'US')    
);    
Name 2000_NL_total 2000_NL_count 2000_US_total 2000_US_count 2010_NL_total 2010_NL_count 2010_US_total 2010_US_count
Amsterdam 1005 1 NULL 0 1065 1 NULL 0
Seattle NULL 0 564 1 NULL 0 608 1
New York City NULL 0 8015 1 NULL 0 8175 1

digoal's wechat