Skip to content

Latest commit

 

History

History
303 lines (238 loc) · 15.5 KB

20220826_01.md

File metadata and controls

303 lines (238 loc) · 15.5 KB

PostgreSQL join+order by limit的优化例子 - 说明数据分布与扫描方法对优化的关键作用

作者

digoal

日期

2022-08-26

标签

PostgreSQL , 慢SQL , 数据库优化 , join , 顺序 , 数据分布 , 扫描方法 , merge sort


背景

数据的组织+扫描方法决定了数据过滤多少, 性能怎么做到极限. 那就是索引, 精准定位, 并且要求完全没有filter.

《PostgreSQL 范围过滤 + 其他字段排序OFFSET LIMIT(多字段区间过滤)的优化与加速》

《PostgreSQL 优化case - where A字段范围 order by B字段排序 limit x》

创建一张表, 数据分布如下:

gid=1,10万条, gid=2,10万条, ... gid=10,10万条. 连续分布. crt_time则从头到尾按顺序写入, gid 10在表的最末尾.

然后我们要做的是查找gid=9,10的数据, 并且按crt_time排序.

这个例子主要说明: 扫描方法决定了数据过滤多少, 性能怎么做到极限. 那就是索引, 精准定位, 并且要求完全没有filter.

这个例子采用的方法是merge sort.

例子如下:

数据写入tbl表, 并创建好gid,crt_time的索引

create unlogged table tbl (gid int, crt_time timestamp, info text, v numeric);  
insert into tbl select g, clock_timestamp(), md5(random()::text), random()*1000 from generate_series(1,10) g, generate_series(1,100000) t;  
create index idx_tbl_1 on tbl (gid,crt_time);  
  
postgres=# select * from tbl limit 10;  
 gid |          crt_time          |               info               |        v           
-----+----------------------------+----------------------------------+------------------  
   1 | 2022-08-26 10:33:40.167693 | dca2a67612cfe0502a984690088f9d4e | 926.480833187289  
   1 | 2022-08-26 10:33:40.169892 | bfd8a1c94b4c8d7f7da639c005462f1e | 553.004215673141  
   1 | 2022-08-26 10:33:40.169905 | 7d9ecdd22f261b361402620f48a5ffde | 992.857001112929  
   1 | 2022-08-26 10:33:40.169907 | 4c646b30e9f2a1442f20689179d7dbdc | 882.250459518574  
   1 | 2022-08-26 10:33:40.169909 | b9efd4224ed1e3f5a7db32eaa10e7c21 | 348.748854964515  
   1 | 2022-08-26 10:33:40.169911 | 09c7dcd2846b54c7f623010a9499de84 | 569.219270037151  
   1 | 2022-08-26 10:33:40.169913 | 955f1f35934c79a8ca3558c68a39377b | 11.7484035042552  
   1 | 2022-08-26 10:33:40.169915 | b660a864dab7a28e409e92d2aa769e7a | 719.183417035431  
   1 | 2022-08-26 10:33:40.169916 | 14c4c30150371a5a9bbc10373c45680c | 809.989509768999  
   1 | 2022-08-26 10:33:40.169918 | 00532483f3e44f2dcde934fb5e2c960f | 753.914120371643  
(10 rows)  

要查的gid放在tbl_gid表

postgres=# create table tbl_gid (gid int);  
CREATE TABLE  
postgres=# insert into tbl_gid values (9),(10);  
INSERT 0 2  

查询语句如下

postgres=# explain (analyze,verbose,timing,costs,buffers) select tbl.* from tbl join tbl_gid on tbl.gid=tbl_gid.gid order by tbl.crt_time limit 10;  
                                                                        QUERY PLAN                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=543741.56..543741.59 rows=10 width=56) (actual time=332.189..332.197 rows=10 loops=1)  
   Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
   Buffers: shared hit=15206  
   ->  Sort  (cost=543741.56..575616.56 rows=12750000 width=56) (actual time=332.186..332.189 rows=10 loops=1)  
         Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
         Sort Key: tbl.crt_time  
         Sort Method: top-N heapsort  Memory: 26kB  
         Buffers: shared hit=15206  
         ->  Merge Join  (cost=142.03..224991.56 rows=12750000 width=56) (actual time=202.986..305.187 rows=200000 loops=1)  
               Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
               Merge Cond: (tbl.gid = tbl_gid.gid)  
               Buffers: shared hit=15203  
               ->  Index Scan using idx_tbl_1 on public.tbl  (cost=0.42..31099.96 rows=1000000 width=56) (actual time=0.038..176.019 rows=1000000 loops=1)  
                     Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
                     Buffers: shared hit=15198  
               ->  Sort  (cost=179.78..186.16 rows=2550 width=4) (actual time=0.053..14.009 rows=100001 loops=1)  
                     Output: tbl_gid.gid  
                     Sort Key: tbl_gid.gid  
                     Sort Method: quicksort  Memory: 25kB  
                     Buffers: shared hit=5  
                     ->  Seq Scan on public.tbl_gid  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.011..0.011 rows=2 loops=1)  
                           Output: tbl_gid.gid  
                           Buffers: shared hit=1  
 Planning:  
   Buffers: shared hit=199  
 Planning Time: 1.688 ms  
 Execution Time: 332.400 ms  
(27 rows)  

优化器使用了比较笨的方法, 表面上看外表是大表, 内表是小表, 而且没有用显排序. 看起来很聪明, 实际上很耗时.

postgres=# set enable_mergejoin =off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select tbl.* from tbl join tbl_gid on tbl.gid=tbl_gid.gid order by tbl.crt_time limit 10;  
                                                               QUERY PLAN                                                                 
----------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=788931.38..788931.40 rows=10 width=56) (actual time=236.632..236.637 rows=10 loops=1)  
   Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
   Buffers: shared hit=11365  
   ->  Sort  (cost=788931.38..820806.38 rows=12750000 width=56) (actual time=236.631..236.633 rows=10 loops=1)  
         Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
         Sort Key: tbl.crt_time  
         Sort Method: top-N heapsort  Memory: 26kB  
         Buffers: shared hit=11365  
         ->  Hash Join  (cost=67.38..470181.38 rows=12750000 width=56) (actual time=151.584..211.768 rows=200000 loops=1)  
               Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
               Hash Cond: (tbl.gid = tbl_gid.gid)  
               Buffers: shared hit=11365  
               ->  Seq Scan on public.tbl  (cost=0.00..21364.00 rows=1000000 width=56) (actual time=0.023..79.634 rows=1000000 loops=1)  
                     Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
                     Buffers: shared hit=11364  
               ->  Hash  (cost=35.50..35.50 rows=2550 width=4) (actual time=0.011..0.012 rows=2 loops=1)  
                     Output: tbl_gid.gid  
                     Buckets: 4096  Batches: 1  Memory Usage: 33kB  
                     Buffers: shared hit=1  
                     ->  Seq Scan on public.tbl_gid  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.006..0.007 rows=2 loops=1)  
                           Output: tbl_gid.gid  
                           Buffers: shared hit=1  
 Planning Time: 0.173 ms  
 Execution Time: 236.682 ms  
(24 rows)  
  
  
postgres=# set enable_hashjoin =off;  
SET  

关闭了hashjoin, mergejoin后, 反而变快了.

而且优化器选择了大表作为内表, 说明这样过滤性更好. PG这一点还是很好的, 如果是MySQL估计你得写成STRAIGHT_JOIN来固定大表作为内表. https://blog.huoding.com/2013/06/04/261

postgres=# explain (analyze,verbose,timing,costs,buffers) select tbl.* from tbl join tbl_gid on tbl.gid=tbl_gid.gid order by tbl.crt_time limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=7349107.95..7349107.98 rows=10 width=56) (actual time=96.025..96.028 rows=10 loops=1)  
   Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
   Buffers: shared hit=3048  
   ->  Sort  (cost=7349107.95..7380982.95 rows=12750000 width=56) (actual time=96.023..96.025 rows=10 loops=1)  
         Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
         Sort Key: tbl.crt_time  
         Sort Method: top-N heapsort  Memory: 26kB  
         Buffers: shared hit=3048  
         ->  Nested Loop  (cost=0.42..7030357.95 rows=12750000 width=56) (actual time=0.090..67.554 rows=200000 loops=1)  
               Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
               Buffers: shared hit=3048  
               ->  Seq Scan on public.tbl_gid  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.030..0.031 rows=2 loops=1)  
                     Output: tbl_gid.gid  
                     Buffers: shared hit=1  
               ->  Index Scan using idx_tbl_1 on public.tbl  (cost=0.42..1756.99 rows=100000 width=56) (actual time=0.032..17.225 rows=100000 loops=2)  
                     Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
                     Index Cond: (tbl.gid = tbl_gid.gid)  
                     Buffers: shared hit=3047  
 Planning Time: 0.219 ms  
 Execution Time: 96.072 ms  
(20 rows)  

以上数据搜索方法是从tbl_gid获得gid, 然后在tbl搜索gid的所有记录, 然后

你也许会说, 为什么不直接从gid的crT_time自动按索引顺序扫描, 然后找到匹配的gid, limit返回呢?
前面已经说了, 数据是线性分布的, 按着crt_time的顺序, 等你找到gid=9,10的数据需要过滤80万行没用的记录.

真正的优化是: 完全消除过滤. 所以可以采用union all代替join, 使用merge sort返回.

如果优化器未来能支持skip scan, 可能就不需要这种写法了.

  • 优化器的一小步, 可能就解决了某个行业某些场景的大问题. 这就是开源的魅力, 有需求就有动力改造. 不断按用户需求的方向发展.

生成SQL:

do language plpgsql $$  
declare  
  sql text := '';  
  i int;  
  u text := ' union all ';  
begin  
  sql := 'select * from ';  
  for i in select tbl_gid.gid from tbl_gid loop  
    sql := sql || format (' (select * from tbl where gid=%s order by crt_time) ', i);  
    sql := sql || u;  
  end loop;  
  sql := rtrim(sql, u) || ' order by crt_time limit 10;';  
  raise notice '%', sql;  
end;  
$$;  

执行时间, 100毫秒降低到0.几毫秒:

select * from   
(select * from tbl where gid=9 order by crt_time )   -- 里面不需要limit, 因为PG支持merge append  
union all  
(select * from tbl where gid=10 order by crt_time )  -- 里面不需要limit, 因为PG支持merge append  
order by crt_time limit 10;   
  
                                                                      QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.86..1.80 rows=10 width=56) (actual time=0.036..0.045 rows=10 loops=1)  
   Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
   Buffers: shared hit=9  
   ->  Merge Append  (cost=0.86..18461.52 rows=197233 width=56) (actual time=0.035..0.042 rows=10 loops=1)  
         Sort Key: tbl.crt_time  
         Buffers: shared hit=9  
         ->  Index Scan using idx_tbl_1 on public.tbl  (cost=0.42..8257.79 rows=99100 width=56) (actual time=0.022..0.026 rows=10 loops=1)  
               Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v  
               Index Cond: (tbl.gid = 9)  
               Buffers: shared hit=5  
         ->  Index Scan using idx_tbl_1 on public.tbl tbl_1  (cost=0.42..8231.38 rows=98133 width=56) (actual time=0.011..0.011 rows=1 loops=1)  
               Output: tbl_1.gid, tbl_1.crt_time, tbl_1.info, tbl_1.v  
               Index Cond: (tbl_1.gid = 10)  
               Buffers: shared hit=4  
 Planning Time: 0.165 ms  
 Execution Time: 0.069 ms  
(16 rows)  

改成函数调用可以方便使用:

create or replace function get_tbl_from_tbl_gid() returns TABLE(gid int,crt_time timestamp, info text, v numeric) as $$  
declare  
  sql text := '';  
  i int;  
  u text := ' union all ';  
begin  
  sql := 'select * from ';  
  for i in select tbl_gid.gid from tbl_gid loop  
    sql := sql || format (' (select * from tbl where gid=%s order by crt_time) ', i);  
    sql := sql || u;  
  end loop;  
  sql := rtrim(sql, u) || ' order by crt_time limit 10;';  
  return query execute sql ;   
end;   
$$ language plpgsql strict;   
  
  
  
  
postgres=# select * from get_tbl_from_tbl_gid();  
 gid |          crt_time          |               info               |        v           
-----+----------------------------+----------------------------------+------------------  
   9 | 2022-08-26 10:33:41.986023 | 9ae6e7501235704db4a06a2583fa3869 | 934.268270867721  
   9 | 2022-08-26 10:33:41.986025 | 2f7b856bf6118a2f705550e50046db1a | 444.519415857999  
   9 | 2022-08-26 10:33:41.986027 | 7bbfcba972224fbd1feaedae7f468391 | 224.418366498535  
   9 | 2022-08-26 10:33:41.986028 | ac93a0fdd1e753bec5d102b92a829510 |  977.99573846894  
   9 | 2022-08-26 10:33:41.98603  | 1fa1c6aac66c12b80c8c35a6032be0e7 | 741.469261203189  
   9 | 2022-08-26 10:33:41.986032 | 65c4e99b0fd2d818159f504cc3238e1b | 434.032166357292  
   9 | 2022-08-26 10:33:41.986033 | c3a0e0f437145b2c0ccbf569e713d457 | 107.788739293836  
   9 | 2022-08-26 10:33:41.986035 | d0a9af284773e7b3516ea2d3753afb0d | 718.387729379674  
   9 | 2022-08-26 10:33:41.986037 | 2c8a1ae7d26b896563b5bcd89f0d0782 |  254.24026094129  
   9 | 2022-08-26 10:33:41.986058 | 551224cd6c0359c6076f2ea78e01576f | 80.4301336515814  
(10 rows)  
  
Time: 0.724 ms  

现实中会不会遇到这样的数据分布场景?

传感器、feed数据、监控数据、流数据、时序数据.

当数据是离散的, 例如只有报警时才有数据, 时间分为报警时间, 数据写入时间两个字段.
如果某些传感器的数据到达和报警时间的差异非常大, 我们按报警时间搜索, 就可能出现这样的问题.

digoal's wechat