Skip to content

Latest commit

 

History

History
279 lines (219 loc) · 10 KB

20240303_01.md

File metadata and controls

279 lines (219 loc) · 10 KB

DuckDB 语法糖macro,columns表达式,list lambda等结合例子: 动态聚合分析

作者

digoal

日期

2024-03-03

标签

PostgreSQL , PolarDB , DuckDB , 动态聚合 , macro , columns , list lambda


背景

DuckDB面向分析场景, 语法丰富对数据分析师来说可以大幅度降低开发周期, 快速从数据中获取目标信息.

因此DuckDB在语法方面的支持非常丰富灵活. 例如通过DuckDB 语法糖macro,columns表达式,list lambda等结合例子: 动态聚合分析

原文: https://duckdb.org/2024/03/01/sql-gymnastics

例子1

创建macro,

  • 从目标表example
  • 使用columns表达式获取included_columns作为要查询的列, 排除excluded_columns列.
  • 使用list_aggregate指定采用aggregate_function聚合函数聚合aggregated_columns列.
-- We use a table macro (or function) for reusability  
CREATE OR REPLACE MACRO dynamic_aggregates(  
        included_columns,  
        excluded_columns,  
        aggregated_columns,  
        aggregate_function  
    ) AS TABLE (  
    FROM example   
    SELECT   
        -- Use a COLUMNS expression to only select the columns  
        -- we include or do not exclude  
        COLUMNS(c -> (  
            -- If we are not using an input parameter (list is empty),  
            -- ignore it  
            (list_contains(included_columns, c) OR  
             len(included_columns) = 0)  
            AND  
            (NOT list_contains(excluded_columns, c) OR  
             len(excluded_columns) = 0)  
            )),  
        -- Use the list_aggregate function to apply an aggregate  
        -- function of our choice  
        list_aggregate(  
            -- Convert to a list (to enable the use of list_aggregate)  
            list(  
                -- Use a COLUMNS expression to choose which columns  
                -- to aggregate  
                COLUMNS(c -> list_contains(aggregated_columns, c))  
            ), aggregate_function  
        )  
    GROUP BY ALL -- Group by all selected but non-aggregated columns  
    ORDER BY ALL -- Order by each column from left to right   
);  

Select col3 and col4, and take the minimum values of col1 and col2:

FROM dynamic_aggregates(  
    ['col3', 'col4'], [], ['col1', 'col2'], 'min'  
);  

Select all columns except col1 and col2, and take the minimum values of col1 and col2:

FROM dynamic_aggregates(  
    [], ['col1', 'col2'], ['col1', 'col2'], 'min'  
);  

If the same column is in both the included and excluded list, it is excluded (exclusions win ties). If we include col2, col3, and col4, but we exclude col2, then it is as if we only included col3 and col4:

FROM dynamic_aggregates(  
    ['col2', 'col3', 'col4'], ['col2'], ['col1', 'col2'], 'min'  
);  

返回:

col3 col4 list_aggregate(list(example.col1), ‘min’) list_aggregate(list(example.col2), ‘min’)
0 1 2 0
1 1 1 0

例子2

上面的例子还不够灵活, 可以把目标表也变成动态的, 聚合函数也变成动态的.

目标表换成一个未定义的cte表, 聚合函数换成自定义的any_func, 需要的时候临时定义

CREATE OR REPLACE MACRO dynamic_aggregates_any_cte_any_func(  
    included_columns,  
    excluded_columns,  
    aggregated_columns  
    /* No more aggregate_function */  
) AS TABLE (  
    FROM any_cte -- No longer a fixed table!  
    SELECT   
        COLUMNS(c -> (  
            (list_contains(included_columns, c) OR  
            len(included_columns) = 0)  
            AND   
            (NOT list_contains(excluded_columns, c) OR  
            len(excluded_columns) = 0)  
            )),  
        -- We no longer convert to a list,   
        -- and we refer to the latest definition of any_func   
        any_func(COLUMNS(c -> list_contains(aggregated_columns, c)))   
    GROUP BY ALL   
    ORDER BY ALL   
);  

使用举例, 创建一个临时聚合函数

-- We can define or redefine any_func right before calling the macro   
CREATE OR REPLACE TEMP FUNCTION any_func(x)  
    AS 100.0 * sum(x) / count(x);  

使用cte查询如下, 返回'another_group', 'one_big_group', 不排除任何列, 分别聚合'id', 'my_group'列.

-- Any table structure is valid for this CTE!  
WITH any_cte AS (  
    SELECT  
        x % 11 AS id,  
        x % 5 AS my_group,  
        x % 2 AS another_group,  
        1 AS one_big_group  
    FROM range(1, 101) t(x)  
)  
-- 查询macro的结果:   
FROM dynamic_aggregates_any_cte_any_func(  
    ['another_group', 'one_big_group'], [], ['id', 'my_group']  
);  

返回:

another_group one_big_group any_func(any_cte.id) any_func(any_cte.my_group)
0 1 502.0 200.0
1 1 490.0 200.0

例子3

1、创建测试表.

CREATE OR REPLACE TABLE spotify_tracks AS (  
    FROM 'https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet?download=true'  
);  

2、创建Macro

CREATE OR REPLACE MACRO custom_summarize() AS TABLE (  
    WITH metrics AS (  
        FROM any_cte   
        SELECT   
            {  
                name: first(alias(COLUMNS(*))),  
                type: first(typeof(COLUMNS(*))),  
                max: max(COLUMNS(*))::VARCHAR,  
                min: min(COLUMNS(*))::VARCHAR,  
                approx_unique: approx_count_distinct(COLUMNS(*)),  
                nulls: count(*) - count(COLUMNS(*)),  
            }  
    ), stacked_metrics AS (  
        UNPIVOT metrics   
        ON COLUMNS(*)  
    )  
    SELECT value.* FROM stacked_metrics  
);  

3、使用Macro动态查询

WITH any_cte AS (FROM spotify_tracks)  
FROM custom_summarize();  

解释:

First let’s have a look at the metrics CTE and the shape of the data that is returned:

FROM any_cte   
SELECT   
    {  
        name: first(alias(COLUMNS(*))),  
        type: first(typeof(COLUMNS(*))),  
        max: max(COLUMNS(*))::VARCHAR,  
        min: min(COLUMNS(*))::VARCHAR,  
        approx_unique: approx_count_distinct(COLUMNS(*)),  
        nulls: count(*) - count(COLUMNS(*)),  
    }  

Next, the data is unpivoted to reshape the table from one row and multiple columns to two columns and multiple rows.

UNPIVOT metrics   
ON COLUMNS(*)  

The final step is the most gymnastics-like portion of this query. We explode the value column’s struct format so that each key becomes its own column using the STRUCT.* syntax. This is another way to make a query less reliant on column names – the split occurs automatically based on the keys in the struct.

SELECT value.*  
FROM stacked_metrics  

结果:

name type max min approx_unique nulls
Unnamed: 0 BIGINT 113999 0 114089 0
track_id VARCHAR 7zz7iNGIWhmfFE7zlXkMma 0000vdREvCVMxbQTkS888c 89815 0
artists VARCHAR 龍藏Ryuzo !nvite 31545 1

参考

https://duckdb.org/docs/sql/statements/create_macro

https://duckdb.org/docs/sql/functions/nested

《DuckDB 宏(MACRO) 的使用》

https://duckdb.org/2023/08/23/even-friendlier-sql#columns-with-lambda-functions

SELECT  
    episode_num,  
    COLUMNS(col -> col LIKE '%warp%')  
FROM trek_facts  
WHERE  
    COLUMNS(col -> col LIKE '%warp%') >= 2;  

《DuckDB 语法糖: List lambda functions》

《DuckDB 0.6.0 语法糖: Add Python-style list-comprehension syntax support to SQL》

《DuckDB lambda 函数使用 - list/array数据处理 - 元素的转换与过滤 (list_filter, list_transform)》

《DuckDB:list,struct,map 类型很强大(支持lambda计算) - PostgreSQL:数组、row|record、json字典》

《DuckDB 语法糖: List comprehensions 支持python语法, 过滤和转换同时支持》

《DuckDB 语法糖: Dynamic column selection 支持通配符,exclude,replace,lambda 等动态列选择|值替换|列选择》

《DuckDB 语法糖: Dynamic PIVOT and UNPIVOT 动态行列转换》

digoal's wechat