digoal
2022-08-26
PostgreSQL , 慢SQL , 数据库优化 , join , 顺序 , 数据分布 , 扫描方法 , merge sort
数据的组织+扫描方法决定了数据过滤多少, 性能怎么做到极限. 那就是索引, 精准定位, 并且要求完全没有filter.
《PostgreSQL 范围过滤 + 其他字段排序OFFSET LIMIT(多字段区间过滤)的优化与加速》
《PostgreSQL 优化case - where A字段范围 order by B字段排序 limit x》
创建一张表, 数据分布如下:
gid=1,10万条, gid=2,10万条, ... gid=10,10万条. 连续分布. crt_time则从头到尾按顺序写入, gid 10在表的最末尾.
然后我们要做的是查找gid=9,10的数据, 并且按crt_time排序.
这个例子主要说明: 扫描方法决定了数据过滤多少, 性能怎么做到极限. 那就是索引, 精准定位, 并且要求完全没有filter.
这个例子采用的方法是merge sort.
例子如下:
数据写入tbl表, 并创建好gid,crt_time的索引
create unlogged table tbl (gid int, crt_time timestamp, info text, v numeric);
insert into tbl select g, clock_timestamp(), md5(random()::text), random()*1000 from generate_series(1,10) g, generate_series(1,100000) t;
create index idx_tbl_1 on tbl (gid,crt_time);
postgres=# select * from tbl limit 10;
gid | crt_time | info | v
-----+----------------------------+----------------------------------+------------------
1 | 2022-08-26 10:33:40.167693 | dca2a67612cfe0502a984690088f9d4e | 926.480833187289
1 | 2022-08-26 10:33:40.169892 | bfd8a1c94b4c8d7f7da639c005462f1e | 553.004215673141
1 | 2022-08-26 10:33:40.169905 | 7d9ecdd22f261b361402620f48a5ffde | 992.857001112929
1 | 2022-08-26 10:33:40.169907 | 4c646b30e9f2a1442f20689179d7dbdc | 882.250459518574
1 | 2022-08-26 10:33:40.169909 | b9efd4224ed1e3f5a7db32eaa10e7c21 | 348.748854964515
1 | 2022-08-26 10:33:40.169911 | 09c7dcd2846b54c7f623010a9499de84 | 569.219270037151
1 | 2022-08-26 10:33:40.169913 | 955f1f35934c79a8ca3558c68a39377b | 11.7484035042552
1 | 2022-08-26 10:33:40.169915 | b660a864dab7a28e409e92d2aa769e7a | 719.183417035431
1 | 2022-08-26 10:33:40.169916 | 14c4c30150371a5a9bbc10373c45680c | 809.989509768999
1 | 2022-08-26 10:33:40.169918 | 00532483f3e44f2dcde934fb5e2c960f | 753.914120371643
(10 rows)
要查的gid放在tbl_gid表
postgres=# create table tbl_gid (gid int);
CREATE TABLE
postgres=# insert into tbl_gid values (9),(10);
INSERT 0 2
查询语句如下
postgres=# explain (analyze,verbose,timing,costs,buffers) select tbl.* from tbl join tbl_gid on tbl.gid=tbl_gid.gid order by tbl.crt_time limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=543741.56..543741.59 rows=10 width=56) (actual time=332.189..332.197 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Buffers: shared hit=15206
-> Sort (cost=543741.56..575616.56 rows=12750000 width=56) (actual time=332.186..332.189 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Sort Key: tbl.crt_time
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=15206
-> Merge Join (cost=142.03..224991.56 rows=12750000 width=56) (actual time=202.986..305.187 rows=200000 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Merge Cond: (tbl.gid = tbl_gid.gid)
Buffers: shared hit=15203
-> Index Scan using idx_tbl_1 on public.tbl (cost=0.42..31099.96 rows=1000000 width=56) (actual time=0.038..176.019 rows=1000000 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Buffers: shared hit=15198
-> Sort (cost=179.78..186.16 rows=2550 width=4) (actual time=0.053..14.009 rows=100001 loops=1)
Output: tbl_gid.gid
Sort Key: tbl_gid.gid
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=5
-> Seq Scan on public.tbl_gid (cost=0.00..35.50 rows=2550 width=4) (actual time=0.011..0.011 rows=2 loops=1)
Output: tbl_gid.gid
Buffers: shared hit=1
Planning:
Buffers: shared hit=199
Planning Time: 1.688 ms
Execution Time: 332.400 ms
(27 rows)
优化器使用了比较笨的方法, 表面上看外表是大表, 内表是小表, 而且没有用显排序. 看起来很聪明, 实际上很耗时.
postgres=# set enable_mergejoin =off;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select tbl.* from tbl join tbl_gid on tbl.gid=tbl_gid.gid order by tbl.crt_time limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=788931.38..788931.40 rows=10 width=56) (actual time=236.632..236.637 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Buffers: shared hit=11365
-> Sort (cost=788931.38..820806.38 rows=12750000 width=56) (actual time=236.631..236.633 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Sort Key: tbl.crt_time
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=11365
-> Hash Join (cost=67.38..470181.38 rows=12750000 width=56) (actual time=151.584..211.768 rows=200000 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Hash Cond: (tbl.gid = tbl_gid.gid)
Buffers: shared hit=11365
-> Seq Scan on public.tbl (cost=0.00..21364.00 rows=1000000 width=56) (actual time=0.023..79.634 rows=1000000 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Buffers: shared hit=11364
-> Hash (cost=35.50..35.50 rows=2550 width=4) (actual time=0.011..0.012 rows=2 loops=1)
Output: tbl_gid.gid
Buckets: 4096 Batches: 1 Memory Usage: 33kB
Buffers: shared hit=1
-> Seq Scan on public.tbl_gid (cost=0.00..35.50 rows=2550 width=4) (actual time=0.006..0.007 rows=2 loops=1)
Output: tbl_gid.gid
Buffers: shared hit=1
Planning Time: 0.173 ms
Execution Time: 236.682 ms
(24 rows)
postgres=# set enable_hashjoin =off;
SET
关闭了hashjoin, mergejoin后, 反而变快了.
而且优化器选择了大表作为内表, 说明这样过滤性更好. PG这一点还是很好的, 如果是MySQL估计你得写成STRAIGHT_JOIN来固定大表作为内表. https://blog.huoding.com/2013/06/04/261
postgres=# explain (analyze,verbose,timing,costs,buffers) select tbl.* from tbl join tbl_gid on tbl.gid=tbl_gid.gid order by tbl.crt_time limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=7349107.95..7349107.98 rows=10 width=56) (actual time=96.025..96.028 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Buffers: shared hit=3048
-> Sort (cost=7349107.95..7380982.95 rows=12750000 width=56) (actual time=96.023..96.025 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Sort Key: tbl.crt_time
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=3048
-> Nested Loop (cost=0.42..7030357.95 rows=12750000 width=56) (actual time=0.090..67.554 rows=200000 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Buffers: shared hit=3048
-> Seq Scan on public.tbl_gid (cost=0.00..35.50 rows=2550 width=4) (actual time=0.030..0.031 rows=2 loops=1)
Output: tbl_gid.gid
Buffers: shared hit=1
-> Index Scan using idx_tbl_1 on public.tbl (cost=0.42..1756.99 rows=100000 width=56) (actual time=0.032..17.225 rows=100000 loops=2)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Index Cond: (tbl.gid = tbl_gid.gid)
Buffers: shared hit=3047
Planning Time: 0.219 ms
Execution Time: 96.072 ms
(20 rows)
以上数据搜索方法是从tbl_gid获得gid, 然后在tbl搜索gid的所有记录, 然后
你也许会说, 为什么不直接从gid的crT_time自动按索引顺序扫描, 然后找到匹配的gid, limit返回呢?
前面已经说了, 数据是线性分布的, 按着crt_time的顺序, 等你找到gid=9,10的数据需要过滤80万行没用的记录.
真正的优化是: 完全消除过滤. 所以可以采用union all代替join, 使用merge sort返回.
如果优化器未来能支持skip scan, 可能就不需要这种写法了.
- 优化器的一小步, 可能就解决了某个行业某些场景的大问题. 这就是开源的魅力, 有需求就有动力改造. 不断按用户需求的方向发展.
生成SQL:
do language plpgsql $$
declare
sql text := '';
i int;
u text := ' union all ';
begin
sql := 'select * from ';
for i in select tbl_gid.gid from tbl_gid loop
sql := sql || format (' (select * from tbl where gid=%s order by crt_time) ', i);
sql := sql || u;
end loop;
sql := rtrim(sql, u) || ' order by crt_time limit 10;';
raise notice '%', sql;
end;
$$;
执行时间, 100毫秒降低到0.几毫秒:
select * from
(select * from tbl where gid=9 order by crt_time ) -- 里面不需要limit, 因为PG支持merge append
union all
(select * from tbl where gid=10 order by crt_time ) -- 里面不需要limit, 因为PG支持merge append
order by crt_time limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.86..1.80 rows=10 width=56) (actual time=0.036..0.045 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Buffers: shared hit=9
-> Merge Append (cost=0.86..18461.52 rows=197233 width=56) (actual time=0.035..0.042 rows=10 loops=1)
Sort Key: tbl.crt_time
Buffers: shared hit=9
-> Index Scan using idx_tbl_1 on public.tbl (cost=0.42..8257.79 rows=99100 width=56) (actual time=0.022..0.026 rows=10 loops=1)
Output: tbl.gid, tbl.crt_time, tbl.info, tbl.v
Index Cond: (tbl.gid = 9)
Buffers: shared hit=5
-> Index Scan using idx_tbl_1 on public.tbl tbl_1 (cost=0.42..8231.38 rows=98133 width=56) (actual time=0.011..0.011 rows=1 loops=1)
Output: tbl_1.gid, tbl_1.crt_time, tbl_1.info, tbl_1.v
Index Cond: (tbl_1.gid = 10)
Buffers: shared hit=4
Planning Time: 0.165 ms
Execution Time: 0.069 ms
(16 rows)
改成函数调用可以方便使用:
create or replace function get_tbl_from_tbl_gid() returns TABLE(gid int,crt_time timestamp, info text, v numeric) as $$
declare
sql text := '';
i int;
u text := ' union all ';
begin
sql := 'select * from ';
for i in select tbl_gid.gid from tbl_gid loop
sql := sql || format (' (select * from tbl where gid=%s order by crt_time) ', i);
sql := sql || u;
end loop;
sql := rtrim(sql, u) || ' order by crt_time limit 10;';
return query execute sql ;
end;
$$ language plpgsql strict;
postgres=# select * from get_tbl_from_tbl_gid();
gid | crt_time | info | v
-----+----------------------------+----------------------------------+------------------
9 | 2022-08-26 10:33:41.986023 | 9ae6e7501235704db4a06a2583fa3869 | 934.268270867721
9 | 2022-08-26 10:33:41.986025 | 2f7b856bf6118a2f705550e50046db1a | 444.519415857999
9 | 2022-08-26 10:33:41.986027 | 7bbfcba972224fbd1feaedae7f468391 | 224.418366498535
9 | 2022-08-26 10:33:41.986028 | ac93a0fdd1e753bec5d102b92a829510 | 977.99573846894
9 | 2022-08-26 10:33:41.98603 | 1fa1c6aac66c12b80c8c35a6032be0e7 | 741.469261203189
9 | 2022-08-26 10:33:41.986032 | 65c4e99b0fd2d818159f504cc3238e1b | 434.032166357292
9 | 2022-08-26 10:33:41.986033 | c3a0e0f437145b2c0ccbf569e713d457 | 107.788739293836
9 | 2022-08-26 10:33:41.986035 | d0a9af284773e7b3516ea2d3753afb0d | 718.387729379674
9 | 2022-08-26 10:33:41.986037 | 2c8a1ae7d26b896563b5bcd89f0d0782 | 254.24026094129
9 | 2022-08-26 10:33:41.986058 | 551224cd6c0359c6076f2ea78e01576f | 80.4301336515814
(10 rows)
Time: 0.724 ms
现实中会不会遇到这样的数据分布场景?
传感器、feed数据、监控数据、流数据、时序数据.
当数据是离散的, 例如只有报警时才有数据, 时间分为报警时间, 数据写入时间两个字段.
如果某些传感器的数据到达和报警时间的差异非常大, 我们按报警时间搜索, 就可能出现这样的问题.