digoal
2024-03-03
PostgreSQL , PolarDB , DuckDB , 动态聚合 , macro , columns , list lambda
DuckDB面向分析场景, 语法丰富对数据分析师来说可以大幅度降低开发周期, 快速从数据中获取目标信息.
因此DuckDB在语法方面的支持非常丰富灵活. 例如通过DuckDB 语法糖macro,columns表达式,list lambda等结合例子: 动态聚合分析
原文: https://duckdb.org/2024/03/01/sql-gymnastics
创建macro,
- 从目标表example
- 使用columns表达式获取included_columns作为要查询的列, 排除excluded_columns列.
- 使用list_aggregate指定采用aggregate_function聚合函数聚合aggregated_columns列.
-- We use a table macro (or function) for reusability
CREATE OR REPLACE MACRO dynamic_aggregates(
included_columns,
excluded_columns,
aggregated_columns,
aggregate_function
) AS TABLE (
FROM example
SELECT
-- Use a COLUMNS expression to only select the columns
-- we include or do not exclude
COLUMNS(c -> (
-- If we are not using an input parameter (list is empty),
-- ignore it
(list_contains(included_columns, c) OR
len(included_columns) = 0)
AND
(NOT list_contains(excluded_columns, c) OR
len(excluded_columns) = 0)
)),
-- Use the list_aggregate function to apply an aggregate
-- function of our choice
list_aggregate(
-- Convert to a list (to enable the use of list_aggregate)
list(
-- Use a COLUMNS expression to choose which columns
-- to aggregate
COLUMNS(c -> list_contains(aggregated_columns, c))
), aggregate_function
)
GROUP BY ALL -- Group by all selected but non-aggregated columns
ORDER BY ALL -- Order by each column from left to right
);
Select col3 and col4, and take the minimum values of col1 and col2:
FROM dynamic_aggregates(
['col3', 'col4'], [], ['col1', 'col2'], 'min'
);
Select all columns except col1 and col2, and take the minimum values of col1 and col2:
FROM dynamic_aggregates(
[], ['col1', 'col2'], ['col1', 'col2'], 'min'
);
If the same column is in both the included and excluded list, it is excluded (exclusions win ties). If we include col2, col3, and col4, but we exclude col2, then it is as if we only included col3 and col4:
FROM dynamic_aggregates(
['col2', 'col3', 'col4'], ['col2'], ['col1', 'col2'], 'min'
);
返回:
col3 | col4 | list_aggregate(list(example.col1), ‘min’) |
list_aggregate(list(example.col2), ‘min’) |
---|---|---|---|
0 | 1 | 2 | 0 |
1 | 1 | 1 | 0 |
上面的例子还不够灵活, 可以把目标表也变成动态的, 聚合函数也变成动态的.
目标表换成一个未定义的cte表, 聚合函数换成自定义的any_func, 需要的时候临时定义
CREATE OR REPLACE MACRO dynamic_aggregates_any_cte_any_func(
included_columns,
excluded_columns,
aggregated_columns
/* No more aggregate_function */
) AS TABLE (
FROM any_cte -- No longer a fixed table!
SELECT
COLUMNS(c -> (
(list_contains(included_columns, c) OR
len(included_columns) = 0)
AND
(NOT list_contains(excluded_columns, c) OR
len(excluded_columns) = 0)
)),
-- We no longer convert to a list,
-- and we refer to the latest definition of any_func
any_func(COLUMNS(c -> list_contains(aggregated_columns, c)))
GROUP BY ALL
ORDER BY ALL
);
使用举例, 创建一个临时聚合函数
-- We can define or redefine any_func right before calling the macro
CREATE OR REPLACE TEMP FUNCTION any_func(x)
AS 100.0 * sum(x) / count(x);
使用cte查询如下, 返回'another_group', 'one_big_group'
, 不排除任何列, 分别聚合'id', 'my_group'
列.
-- Any table structure is valid for this CTE!
WITH any_cte AS (
SELECT
x % 11 AS id,
x % 5 AS my_group,
x % 2 AS another_group,
1 AS one_big_group
FROM range(1, 101) t(x)
)
-- 查询macro的结果:
FROM dynamic_aggregates_any_cte_any_func(
['another_group', 'one_big_group'], [], ['id', 'my_group']
);
返回:
another_group | one_big_group | any_func(any_cte.id) |
any_func(any_cte.my_group) |
---|---|---|---|
0 | 1 | 502.0 | 200.0 |
1 | 1 | 490.0 | 200.0 |
1、创建测试表.
CREATE OR REPLACE TABLE spotify_tracks AS (
FROM 'https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet?download=true'
);
2、创建Macro
CREATE OR REPLACE MACRO custom_summarize() AS TABLE (
WITH metrics AS (
FROM any_cte
SELECT
{
name: first(alias(COLUMNS(*))),
type: first(typeof(COLUMNS(*))),
max: max(COLUMNS(*))::VARCHAR,
min: min(COLUMNS(*))::VARCHAR,
approx_unique: approx_count_distinct(COLUMNS(*)),
nulls: count(*) - count(COLUMNS(*)),
}
), stacked_metrics AS (
UNPIVOT metrics
ON COLUMNS(*)
)
SELECT value.* FROM stacked_metrics
);
3、使用Macro动态查询
WITH any_cte AS (FROM spotify_tracks)
FROM custom_summarize();
解释:
First let’s have a look at the metrics CTE and the shape of the data that is returned:
FROM any_cte
SELECT
{
name: first(alias(COLUMNS(*))),
type: first(typeof(COLUMNS(*))),
max: max(COLUMNS(*))::VARCHAR,
min: min(COLUMNS(*))::VARCHAR,
approx_unique: approx_count_distinct(COLUMNS(*)),
nulls: count(*) - count(COLUMNS(*)),
}
Next, the data is unpivoted to reshape the table from one row and multiple columns to two columns and multiple rows.
UNPIVOT metrics
ON COLUMNS(*)
The final step is the most gymnastics-like portion of this query. We explode the value column’s struct format so that each key becomes its own column using the STRUCT.* syntax. This is another way to make a query less reliant on column names – the split occurs automatically based on the keys in the struct.
SELECT value.*
FROM stacked_metrics
结果:
name | type | max | min | approx_unique | nulls |
---|---|---|---|---|---|
Unnamed: 0 | BIGINT | 113999 | 0 | 114089 | 0 |
track_id | VARCHAR | 7zz7iNGIWhmfFE7zlXkMma | 0000vdREvCVMxbQTkS888c | 89815 | 0 |
artists | VARCHAR | 龍藏Ryuzo | !nvite | 31545 | 1 |
https://duckdb.org/docs/sql/statements/create_macro
https://duckdb.org/docs/sql/functions/nested
https://duckdb.org/2023/08/23/even-friendlier-sql#columns-with-lambda-functions
SELECT
episode_num,
COLUMNS(col -> col LIKE '%warp%')
FROM trek_facts
WHERE
COLUMNS(col -> col LIKE '%warp%') >= 2;
《DuckDB 语法糖: List lambda functions》
《DuckDB 0.6.0 语法糖: Add Python-style list-comprehension syntax support to SQL》
《DuckDB lambda 函数使用 - list/array数据处理 - 元素的转换与过滤 (list_filter, list_transform)》
《DuckDB:list,struct,map 类型很强大(支持lambda计算) - PostgreSQL:数组、row|record、json字典》
《DuckDB 语法糖: List comprehensions 支持python语法, 过滤和转换同时支持》
《DuckDB 语法糖: Dynamic column selection 支持通配符,exclude,replace,lambda 等动态列选择|值替换|列选择》