digoal
2020-06-15
PostgreSQL , 窗口查询 , window , 递归 , 索引scan , skip scan
构造数据 :
create or replace function gen_res() returns setof numeric as $$
declare
s numeric := random();
begin
return query select s from generate_series(1,100);
end;
$$ language plpgsql strict;
create table a (uid int, score numeric, class text);
insert into a select id, gen_res(), md5(ceil(random()*100)::text) from generate_series(1,100000) t(id);
构造score相同的uid 100个.
with a1 as (select max(score) score from a)
insert into a select id, score , md5(ceil(random()*100)::text) from generate_series(100001,100100) t(id) , (select score from a1,generate_series(1,100)) t1;
postgres=# select * from a limit 10;
uid | score | class
-----+-------------------+----------------------------------
2 | 0.979678114877171 | 44f683a84163b3523afe57c2e008bc8c
2 | 0.979678114877171 | d645920e395fedad7bbbed0eca3fe2e0
2 | 0.979678114877171 | fe9fc289c3ff0af142b6d3bead98a923
2 | 0.979678114877171 | d09bf41544a3365a46c9077ebb5e35c3
2 | 0.979678114877171 | 1679091c5a880faf6fb5e6087eb1b2dc
2 | 0.979678114877171 | 34173cb38f07f89ddbebc2ac9128303f
2 | 0.979678114877171 | 70efdf2ec9b086079795c442636b55fb
2 | 0.979678114877171 | fc490ca45c00b1249bbe3554a4fdf6fb
2 | 0.979678114877171 | fe9fc289c3ff0af142b6d3bead98a923
2 | 0.979678114877171 | a5771bce93e200c36f7cd9dfd0e5deaa
(10 rows)
同一个uid有多条记录, 每条记录的score相同.
不同UID的score大多数情况下不同, 但是也有少许uid的score相同.
按score倒序, 查询UID. 每个UID 返回一条.
最简单直接的方法是使用窗口查询, 窗口: 按score和uid分组(不能仅按score分组, 因为有的uid 可能score一样), 然后按score倒序.
如下:
create index idx_a_1 on a (score desc,uid);
explain (analyze,verbose,timing,costs,buffers)
select uid,score from (
select row_number() over (partition by score,uid order by score desc) as rn,
uid, score from a) t
where t.rn=1 limit 50;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..607.79 rows=50 width=36) (actual time=0.081..4.117 rows=50 loops=1)
Output: t.uid, t.score
Buffers: shared hit=4613
-> Subquery Scan on t (cost=0.56..607839.84 rows=50050 width=36) (actual time=0.081..4.111 rows=50 loops=1)
Output: t.uid, t.score
Filter: (t.rn = 1)
Rows Removed by Filter: 4851
Buffers: shared hit=4613
-> WindowAgg (cost=0.56..482714.24 rows=10010048 width=44) (actual time=0.080..3.853 rows=4901 loops=1)
Output: row_number() OVER (?), a.uid, a.score
Buffers: shared hit=4613
-> Index Only Scan using idx_a_1 on public.a (cost=0.56..282513.28 rows=10010048 width=36) (actual time=0.018..1.532 rows=5001 loops=1)
Output: a.uid, a.score
Heap Fetches: 5001
Buffers: shared hit=4613
Planning Time: 0.074 ms
Execution Time: 4.135 ms
(17 rows)
性能看起来是不错的, 但是你会发现这条sql虽然只需要返回50条记录, 却扫描了5001行(因为窗口查询使用索引进行有序遍历, 而不是跳跃扫描的方式), 能不能只扫描50行呢?
结果如下:
uid | score
--------+-------------------
31536 | 0.999998891470238
100001 | 0.999998891470238
100002 | 0.999998891470238
100003 | 0.999998891470238
100004 | 0.999998891470238
100005 | 0.999998891470238
100006 | 0.999998891470238
100007 | 0.999998891470238
100008 | 0.999998891470238
100009 | 0.999998891470238
100010 | 0.999998891470238
100011 | 0.999998891470238
100012 | 0.999998891470238
100013 | 0.999998891470238
100014 | 0.999998891470238
100015 | 0.999998891470238
100016 | 0.999998891470238
100017 | 0.999998891470238
100018 | 0.999998891470238
100019 | 0.999998891470238
100020 | 0.999998891470238
100021 | 0.999998891470238
100022 | 0.999998891470238
100023 | 0.999998891470238
100024 | 0.999998891470238
100025 | 0.999998891470238
100026 | 0.999998891470238
100027 | 0.999998891470238
100028 | 0.999998891470238
100029 | 0.999998891470238
100030 | 0.999998891470238
100031 | 0.999998891470238
100032 | 0.999998891470238
100033 | 0.999998891470238
100034 | 0.999998891470238
100035 | 0.999998891470238
100036 | 0.999998891470238
100037 | 0.999998891470238
100038 | 0.999998891470238
100039 | 0.999998891470238
100040 | 0.999998891470238
100041 | 0.999998891470238
100042 | 0.999998891470238
100043 | 0.999998891470238
100044 | 0.999998891470238
100045 | 0.999998891470238
100046 | 0.999998891470238
100047 | 0.999998891470238
100048 | 0.999998891470238
100049 | 0.999998891470238
(50 rows)
如果想让这个查询只扫描50行, 需要使用跳跃扫描, 不能遍历整段索引. 可以使用递归实现:
直接按score排序limit递归会丢失相同score的UID:
create index idx_a_3 on a (score desc) include (uid) where score is not null;
with recursive tmp as (
(
select array[uid::numeric, score] as r from a
WHERE score is not null
order by score desc limit 1
)
union all
(
select
(select array[uid::numeric, score] as r from a t1
where t1.score < (tmp.r)[2]
and t1.score is not null
order by t1.score desc limit 1
)
from tmp where (tmp.r)[2] is not null
)
)
select (tmp.r)[1],(tmp.r)[2] from tmp
where tmp.* is not null
limit 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=55.78..56.79 rows=50 width=64) (actual time=0.013..0.539 rows=50 loops=1)
Output: (tmp.r[1]), (tmp.r[2])
Buffers: shared hit=241
CTE tmp
-> Recursive Union (cost=0.50..55.78 rows=101 width=32) (actual time=0.009..0.507 rows=50 loops=1)
Buffers: shared hit=241
-> Subquery Scan on "*SELECT* 1" (cost=0.50..0.52 rows=1 width=32) (actual time=0.009..0.009 rows=1 loops=1)
Output: "*SELECT* 1".r
Buffers: shared hit=5
-> Limit (cost=0.50..0.51 rows=1 width=43) (actual time=0.008..0.008 rows=1 loops=1)
Output: (ARRAY[(a.uid)::numeric, a.score]), a.score
Buffers: shared hit=5
-> Index Only Scan using idx_a_3 on public.a (cost=0.50..125126.42 rows=10009993 width=43) (actual time=0.008..0.008 rows=1 loops=1)
Output: ARRAY[(a.uid)::numeric, a.score], a.score
Heap Fetches: 0
Buffers: shared hit=5
-> WorkTable Scan on tmp tmp_1 (cost=0.00..5.33 rows=10 width=32) (actual time=0.010..0.010 rows=1 loops=49)
Output: (SubPlan 1)
Filter: (tmp_1.r[2] IS NOT NULL)
Buffers: shared hit=236
SubPlan 1
-> Limit (cost=0.50..0.51 rows=1 width=43) (actual time=0.009..0.009 rows=1 loops=49)
Output: (ARRAY[(t1.uid)::numeric, t1.score]), t1.score
Buffers: shared hit=236
-> Index Only Scan using idx_a_3 on public.a t1 (cost=0.50..41709.81 rows=3336664 width=43) (actual time=0.009..0.009 rows=1 loops=49)
Output: ARRAY[(t1.uid)::numeric, t1.score], t1.score
Index Cond: (t1.score < (tmp_1.r)[2])
Heap Fetches: 0
Buffers: shared hit=236
-> CTE Scan on tmp (cost=0.00..2.02 rows=100 width=64) (actual time=0.012..0.533 rows=50 loops=1)
Output: tmp.r[1], tmp.r[2]
Filter: (tmp.* IS NOT NULL)
Buffers: shared hit=241
Planning Time: 0.090 ms
Execution Time: 0.559 ms
(35 rows)
结果与窗口查询不同, 丢失了score相同的uid, 每个score只返回了一个uid, 与业务逻辑不相符
r | r
-------+-------------------
31536 | 0.999998891470238
83720 | 0.999991911982715
55722 | 0.999975578407255
23814 | 0.999972620235976
67381 | 0.999961911696577
29892 | 0.999943494870404
90079 | 0.999925486089911
21214 | 0.999922738312673
59001 | 0.99991736006243
87204 | 0.999917295777077
7030 | 0.999915557360705
47014 | 0.999902190202047
32264 | 0.999901911139158
38647 | 0.999860249737811
32674 | 0.999837253269899
87613 | 0.999832584158813
71232 | 0.9998249229434
41469 | 0.999812869635608
95274 | 0.999804819593745
70312 | 0.999779352434143
70923 | 0.999774257144811
26243 | 0.999741038483439
79093 | 0.999711161650342
51332 | 0.999708612428986
70293 | 0.999701119867272
749 | 0.999676152526778
82356 | 0.999673161844026
24750 | 0.99965793726258
24520 | 0.999657836562118
94013 | 0.999634740712228
32113 | 0.999620645922992
32524 | 0.999576429034658
64496 | 0.999547735428802
99351 | 0.999533980413151
74897 | 0.999506182824039
57650 | 0.999505328313486
12643 | 0.999502105912729
62484 | 0.999499621461251
98690 | 0.99949818874131
78253 | 0.999480342172379
12805 | 0.999474363236803
24470 | 0.999473037063549
42317 | 0.999456625347026
17058 | 0.999453040472755
67604 | 0.999448579359846
14985 | 0.99943607450459
72078 | 0.999425638849996
43290 | 0.999415581298248
10890 | 0.999408682154407
33988 | 0.999400768473588
(50 rows)
解决办法, 在score上加权, 增加一个值, 这个值要求每个uid都不一样, 并且对最终order by score desc的结果没有影响.
分数前几位保持不变, 在后面追加. 例如score有效位置为小数后15位, uid共10位, 那么将UID向小数点后只少移动25位即可, 不影响最终的排序结果.
如果你觉得这个移动太长了, 也可以使用一个简单(但是可能依旧有重复, 只是概率非常非常低)的方法, 将UID按999取模, 然后向后移动至少18位. order by (score::numeric + (mod(uid,999)/1e25)) desc
例子如下:
create index idx_a_2 on a ((score + (mod(uid,999)/1e25))) include (uid) where (score + (mod(uid,999)/1e25)) is not null ;
with recursive tmp as (
(
select array[uid::numeric, score, (score::numeric + (mod(uid,999)/1e25))] as r from a
WHERE (score + (mod(uid,999)/1e25)) is not null
order by (score + (mod(uid,999)/1e25)) desc limit 1
)
union all
(
select
(select array[uid::numeric, score, (score + (mod(uid,999)/1e25))] as r from a t1
where (t1.score + (mod(t1.uid,999)/1e25)) < (tmp.r)[3]
and (t1.score + (mod(t1.uid,999)/1e25)) is not null
order by (t1.score + (mod(t1.uid,999)/1e25)) desc limit 1
)
from tmp where (tmp.r)[3] is not null
)
)
select (tmp.r)[1],(tmp.r)[2] from tmp
where tmp.* is not null
limit 50;
此时得到了与窗口查询一样的结果, 重复的score没有丢失UID
r | r
--------+-------------------
31536 | 0.999998891470238
100100 | 0.999998891470238
100099 | 0.999998891470238
100098 | 0.999998891470238
100097 | 0.999998891470238
100096 | 0.999998891470238
100095 | 0.999998891470238
100094 | 0.999998891470238
100093 | 0.999998891470238
100092 | 0.999998891470238
100091 | 0.999998891470238
100090 | 0.999998891470238
100089 | 0.999998891470238
100088 | 0.999998891470238
100087 | 0.999998891470238
100086 | 0.999998891470238
100085 | 0.999998891470238
100084 | 0.999998891470238
100083 | 0.999998891470238
100082 | 0.999998891470238
100081 | 0.999998891470238
100080 | 0.999998891470238
100079 | 0.999998891470238
100078 | 0.999998891470238
100077 | 0.999998891470238
100076 | 0.999998891470238
100075 | 0.999998891470238
100074 | 0.999998891470238
100073 | 0.999998891470238
100072 | 0.999998891470238
100071 | 0.999998891470238
100070 | 0.999998891470238
100069 | 0.999998891470238
100068 | 0.999998891470238
100067 | 0.999998891470238
100066 | 0.999998891470238
100065 | 0.999998891470238
100064 | 0.999998891470238
100063 | 0.999998891470238
100062 | 0.999998891470238
100061 | 0.999998891470238
100060 | 0.999998891470238
100059 | 0.999998891470238
100058 | 0.999998891470238
100057 | 0.999998891470238
100056 | 0.999998891470238
100055 | 0.999998891470238
100054 | 0.999998891470238
100053 | 0.999998891470238
100052 | 0.999998891470238
(50 rows)
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=60.64..61.65 rows=50 width=64) (actual time=0.016..0.478 rows=50 loops=1)
Output: (tmp.r[1]), (tmp.r[2])
Buffers: shared hit=250
CTE tmp
-> Recursive Union (cost=0.50..60.64 rows=101 width=32) (actual time=0.012..0.445 rows=50 loops=1)
Buffers: shared hit=250
-> Subquery Scan on "*SELECT* 1" (cost=0.50..0.55 rows=1 width=32) (actual time=0.011..0.012 rows=1 loops=1)
Output: "*SELECT* 1".r
Buffers: shared hit=5
-> Limit (cost=0.50..0.54 rows=1 width=64) (actual time=0.011..0.011 rows=1 loops=1)
Output: (ARRAY[(a.uid)::numeric, a.score, (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))]), ((a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric)))
Buffers: shared hit=5
-> Index Scan Backward using idx_a_2 on public.a (cost=0.50..417341.60 rows=9959943 width=64) (actual time=0.011..0.011 rows=1 loops=1)
Output: ARRAY[(a.uid)::numeric, a.score, (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))], (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))
Buffers: shared hit=5
-> WorkTable Scan on tmp tmp_1 (cost=0.00..5.81 rows=10 width=32) (actual time=0.008..0.008 rows=1 loops=49)
Output: (SubPlan 1)
Filter: (tmp_1.r[3] IS NOT NULL)
Buffers: shared hit=245
SubPlan 1
-> Limit (cost=0.50..0.56 rows=1 width=64) (actual time=0.008..0.008 rows=1 loops=49)
Output: (ARRAY[(t1.uid)::numeric, t1.score, (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))]), ((t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric)))
Buffers: shared hit=245
-> Index Scan Backward using idx_a_2 on public.a t1 (cost=0.50..201702.36 rows=3319981 width=64) (actual time=0.008..0.008 rows=1 loops=49)
Output: ARRAY[(t1.uid)::numeric, t1.score, (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))], (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))
Index Cond: ((t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric)) < (tmp_1.r)[3])
Buffers: shared hit=245
-> CTE Scan on tmp (cost=0.00..2.02 rows=100 width=64) (actual time=0.015..0.473 rows=50 loops=1)
Output: tmp.r[1], tmp.r[2]
Filter: (tmp.* IS NOT NULL)
Buffers: shared hit=250
Planning Time: 0.127 ms
Execution Time: 0.502 ms
(33 rows)
完美, 递归后从4.1毫秒降低到0.5毫秒, limit 50只需要扫描50行。
《PostgreSQL 排序去重limit查询优化 - 递归 vs group分组 (loop降到极限, block scan降到极限)》
《PostgreSQL 家族图谱、社交图谱、树状关系、藤状分佣、溯源、等场景实践 - 递归,with recursive query (有向无环 , 有向有环)》
《累加链条件过滤 - 递归、窗口、UDF、游标、模拟递归、scan 剪切》
《PostgreSQL 递归应用实践 - 非“传销”的高并发实时藤、树状佣金分配体系》
《PostgreSQL 递归妙用案例 - 分组数据去重与打散》
《PostgreSQL Oracle 兼容性之 - INDEX SKIP SCAN (递归查询变态优化) 非驱动列索引扫描优化》
《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.