Skip to content

Latest commit

 

History

History
269 lines (213 loc) · 10.8 KB

20220808_03.md

File metadata and controls

269 lines (213 loc) · 10.8 KB

PostgreSQL 16 devel preview - 优化ORDER BY / DISTINCT aggreagtes聚合场景, 减少排序次数

作者

digoal

日期

2022-08-08

标签

PostgreSQL , 聚合 , order by , distinct , 排序 , incremental sort


背景

首先了解一下聚合的过程,

  • tuples-trans func-final func
  • tuples-trans func-combine func-final func

非并行模式的聚合流程大致如下:

循环    
sfunc( internal-state, next-data-values ) ---> next-internal-state    
    
最后调用一次(可选)    
ffunc( internal-state ) ---> aggregate-value    

pg-xc 的聚合与并行类似 cfunc可以理解为combine func:

sfunc( internal-state, next-data-values ) ---> next-internal-state  # 这个过程是在datanode节点完成的. input 是该datanode节点上的所有行(一次1行的进行调用).    
cfunc( internal-state, internal-state ) ---> next-internal-state  # 这个过程是在coordinator节点完成的. input是datanode节点的最终结果.    
ffunc( internal-state ) ---> aggregate-value # 这个过程是在coordinator节点完成的. input是cfunc的结果.    

《Postgres-XC customized aggregate introduction》

《PostgreSQL aggregate function customize》

《PostgreSQL 11 preview - 多阶段并行聚合array_agg, string_agg》

《PostgreSQL 10 自定义并行计算聚合函数的原理与实践 - (含array_agg合并多个数组为单个一元数组的例子)》

PostgreSQL 16的ORDER BY / DISTINCT aggregates性能优化解决了什么问题呢?

agg1(a ORDER BY a), agg2(a order by a,b), agg3(a order by c)  
agg1(distinct a), agg1(distinct a,b)  

以上两种聚合场景, 在未优化前, 每一个聚合函数都要单独排序, 不使用索引. 性能是不是很差?

PostgreSQL 16的优化方法比较暴力, 选出一种排序方法, 适合最多的agg, (可以采用索引). 如果覆盖的聚合排序数量一样多, 则选择第一种. 例如:

SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...  
  
would request the sort order to be {a, b} because {a} is a subset of the  
sort order of {a,b}, but;  
  
SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...  
  
would just pick a plan ordered by {a} (we give precedence to aggregates  
which are earlier in the targetlist).  
  
SELECT agg(a ORDER BY a),agg2(a ORDER BY b),agg3(a ORDER BY b) ...  
  
would choose to order by {b} since two aggregates suit that vs just one  
that requires input ordered by {a}.  

结合采用incremental sort的例子. 《PostgreSQL 11 preview - Incremental Sort(排序优化)》

 606 SELECT $$  
 607     SELECT col12, count(distinct a.col1), count(distinct a.col2), count(distinct b.col1), count(distinct b.col2), count(*)  
 608     FROM test_mark_restore a  
 609         JOIN test_mark_restore b USING(col12)  
 610     GROUP BY 1  
 611     HAVING count(*) > 1  
 612     ORDER BY 2 DESC, 1 DESC, 3 DESC, 4 DESC, 5 DESC, 6 DESC  
 613     LIMIT 10  
 614 $$ AS qry \gset  
 615 -- test mark/restore with in-memory sorts  
 616 EXPLAIN (COSTS OFF) :qry;  
 617                                                                                  QUERY PLAN                                                                                    
 618 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 619  Limit  
 620    ->  Sort  
 621          Sort Key: (count(DISTINCT a.col1)) DESC, a.col12 DESC, (count(DISTINCT a.col2)) DESC, (count(DISTINCT b.col1)) DESC, (count(DISTINCT b.col2)) DESC, (count(*)) DESC  
 622          ->  GroupAggregate  
 623                Group Key: a.col12  
 624                Filter: (count(*) > 1)  
 625                ->  Incremental Sort  
 626                      Sort Key: a.col12 DESC, a.col1  
 627                      Presorted Key: a.col12  
 628                      ->  Merge Join  
 629                            Merge Cond: (a.col12 = b.col12)  
 630                            ->  Sort  
 631                                  Sort Key: a.col12 DESC  
 632                                  ->  Seq Scan on test_mark_restore a  
 633                            ->  Sort  
 634                                  Sort Key: b.col12 DESC  
 635                                  ->  Seq Scan on test_mark_restore b  
 636 (17 rows)  

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1349d2790bf48a4de072931c722f39337e72055e

Improve performance of ORDER BY / DISTINCT aggregates  
author	David Rowley <drowley@postgresql.org>	  
Tue, 2 Aug 2022 11:11:45 +0000 (23:11 +1200)  
committer	David Rowley <drowley@postgresql.org>	  
Tue, 2 Aug 2022 11:11:45 +0000 (23:11 +1200)  
commit	1349d2790bf48a4de072931c722f39337e72055e  
tree	3b525f30da6d37513522cdb5ea34ce14b653de87	tree  
parent	a69959fab2f3633992b5cabec85acecbac6074c8	commit | diff  
Improve performance of ORDER BY / DISTINCT aggregates  
  
ORDER BY / DISTINCT aggreagtes have, since implemented in Postgres, been  
executed by always performing a sort in nodeAgg.c to sort the tuples in  
the current group into the correct order before calling the transition  
function on the sorted tuples.  This was not great as often there might be  
an index that could have provided pre-sorted input and allowed the  
transition functions to be called as the rows come in, rather than having  
to store them in a tuplestore in order to sort them once all the tuples  
for the group have arrived.   
  
Here we change the planner so it requests a path with a sort order which  
supports the most amount of ORDER BY / DISTINCT aggregate functions and  
add new code to the executor to allow it to support the processing of  
ORDER BY / DISTINCT aggregates where the tuples are already sorted in the  
correct order.  
  
Since there can be many ORDER BY / DISTINCT aggregates in any given query  
level, it's very possible that we can't find an order that suits all of  
these aggregates.  The sort order that the planner chooses is simply the  
one that suits the most aggregate functions.  We take the most strictly  
sorted variation of each order and see how many aggregate functions can  
use that, then we try again with the order of the remaining aggregates to  
see if another order would suit more aggregate functions.  For example:  
  
SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...  
  
would request the sort order to be {a, b} because {a} is a subset of the  
sort order of {a,b}, but;  
  
SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...  
  
would just pick a plan ordered by {a} (we give precedence to aggregates  
which are earlier in the targetlist).  
  
SELECT agg(a ORDER BY a),agg2(a ORDER BY b),agg3(a ORDER BY b) ...  
  
would choose to order by {b} since two aggregates suit that vs just one  
that requires input ordered by {a}.  
  
Author: David Rowley  
Reviewed-by: Ronan Dunklau, James Coleman, Ranier Vilela, Richard Guo, Tom Lane  
Discussion: https://postgr.es/m/CAApHDvpHzfo92%3DR4W0%2BxVua3BUYCKMckWAmo-2t_KiXN-wYH%3Dw%40mail.gmail.com  
+--  
+-- Test planner's selection of pathkeys for ORDER BY aggregates  
+--  
+-- Ensure we order by four.  This suits the most aggregate functions.  
+explain (costs off)  
+select sum(two order by two),max(four order by four), min(four order by four)  
+from tenk1;  
+          QUERY PLAN             
+-------------------------------  
+ Aggregate  
+   ->  Sort  
+         Sort Key: four  
+         ->  Seq Scan on tenk1  
+(4 rows)  
+  
+-- Ensure we order by two.  It's a tie between ordering by two and four but  
+-- we tiebreak on the aggregate's position.  
+explain (costs off)  
+select  
+  sum(two order by two), max(four order by four),  
+  min(four order by four), max(two order by two)  
+from tenk1;  
+          QUERY PLAN             
+-------------------------------  
+ Aggregate  
+   ->  Sort  
+         Sort Key: two  
+         ->  Seq Scan on tenk1  
+(4 rows)  
+  
+-- Similar to above, but tiebreak on ordering by four  
+explain (costs off)  
+select  
+  max(four order by four), sum(two order by two),  
+  min(four order by four), max(two order by two)  
+from tenk1;  
+          QUERY PLAN             
+-------------------------------  
+ Aggregate  
+   ->  Sort  
+         Sort Key: four  
+         ->  Seq Scan on tenk1  
+(4 rows)  
+  
+-- Ensure this one orders by ten since there are 3 aggregates that require ten  
+-- vs two that suit two and four.  
+explain (costs off)  
+select  
+  max(four order by four), sum(two order by two),  
+  min(four order by four), max(two order by two),  
+  sum(ten order by ten), min(ten order by ten), max(ten order by ten)  
+from tenk1;  
+          QUERY PLAN             
+-------------------------------  
+ Aggregate  
+   ->  Sort  
+         Sort Key: ten  
+         ->  Seq Scan on tenk1  
+(4 rows)  
+  
+-- Try a case involving a GROUP BY clause where the GROUP BY column is also  
+-- part of an aggregate's ORDER BY clause.  We want a sort order that works  
+-- for the GROUP BY along with the first and the last aggregate.  
+explain (costs off)  
+select  
+  sum(unique1 order by ten, two), sum(unique1 order by four),  
+  sum(unique1 order by two, four)  
+from tenk1  
+group by ten;  
+            QUERY PLAN              
+----------------------------------  
+ GroupAggregate  
+   Group Key: ten  
+   ->  Sort  
+         Sort Key: ten, two, four  
+         ->  Seq Scan on tenk1  
+(5 rows)  
+  

digoal's wechat