Skip to content

Latest commit

 

History

History
791 lines (750 loc) · 33.7 KB

20210326_02.md

File metadata and controls

791 lines (750 loc) · 33.7 KB

PostgreSQL 14 preview - BRIN (典型IoT 时序场景) 块级索引支持 bloom filter - 随机,大量distinct value, 等值查询

作者

digoal

日期

2021-03-26

标签

PostgreSQL , brin , 索引 , 块级索引 , bloom , 等值查询


背景

等值查询一般有些什么索引支持?

btree: 精准定位
hash: 精准定位
bloom: 超集定位, 支持任意组合等值查询, 少量bit占位符表示是否包含某个value, 由于不同value的bits不同, 当插入大量value后占位bit可能被大量设置为1, 导致超集出现, 等值查询真不一定为真, 假一定为假.
gin: 倒排索引
brin: 块级索引, 保存一个heap blocks段存储的被索引字段value 的范围. min,max

目前BRIN变种出现, 存储的不再是min, max, 而是支持了bloom filter, 也就是说它存储的是占位bits. 每个连续heap blocks, 存储一个占位bits, 被索引字段的hash value经过再次bloom hash填充占位bit.

brin bloom 的应用场景:

离散值较多时: 等值查询, 任意字段组合等值查询.

PostgreSQL 开发者确实很精致, 赞!!!

最佳设置实践:

1、false_positive_rate: false(0) bits 占比不能太低, 越低, 失真越高(都是1的占位bit, 意思是它包含任意value, 所以就没有过滤意义)
2、n_distinct_per_range: 一连串blocks包含多少个被索引字段的唯一值, 这个值越大越容易失真, 因为占位组合会更多
3、通过调整pages_per_range 来控制 n_distinct_per_range

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=77b88cd1bb9041a735f24072150cacfa06c699a3

BRIN bloom indexes  
author	Tomas Vondra <tomas.vondra@postgresql.org>	  
Fri, 26 Mar 2021 12:35:29 +0000 (13:35 +0100)  
committer	Tomas Vondra <tomas.vondra@postgresql.org>	  
Fri, 26 Mar 2021 12:35:32 +0000 (13:35 +0100)  
commit	77b88cd1bb9041a735f24072150cacfa06c699a3  
tree	be9ca84d673e3aa17e0e75ec579be414ae7eac18	tree  
parent	a681e3c107aa97eb554f118935c4d2278892c3dd	commit | diff  
BRIN bloom indexes  
  
Adds a BRIN opclass using a Bloom filter to summarize the range. Indexes  
using the new opclasses allow only equality queries (similar to hash  
indexes), but that works fine for data like UUID, MAC addresses etc. for  
which range queries are not very common. This also means the indexes  
work for data that is not well correlated to physical location within  
the table, or perhaps even entirely random (which is a common issue with  
existing BRIN minmax opclasses).  
  
It's possible to specify opclass parameters with the usual Bloom filter  
parameters, i.e. the desired false-positive rate and the expected number  
of distinct values per page range.  
  
  CREATE TABLE t (a int);  
  CREATE INDEX ON t  
   USING brin (a int4_bloom_ops(false_positive_rate = 0.05,  
                                n_distinct_per_range = 100));  
  
The opclasses do not operate on the indexed values directly, but compute  
a 32-bit hash first, and the Bloom filter is built on the hash value.  
Collisions should not be a huge issue though, as the number of distinct  
values in a page ranges is usually fairly small.  
  
Bump catversion, due to various catalog changes.  
  
Author: Tomas Vondra <tomas.vondra@postgresql.org>  
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>  
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>  
Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru>  
Reviewed-by: Nico Williams <nico@cryptonector.com>  
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>  
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com  
Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com  

用法

   1 CREATE TABLE brintest_bloom (byteacol bytea,  
   2     charcol "char",  
   3     namecol name,  
   4     int8col bigint,  
   5     int2col smallint,  
   6     int4col integer,  
   7     textcol text,  
   8     oidcol oid,  
   9     float4col real,  
  10     float8col double precision,  
  11     macaddrcol macaddr,  
  12     inetcol inet,  
  13     cidrcol cidr,  
  14     bpcharcol character,  
  15     datecol date,  
  16     timecol time without time zone,  
  17     timestampcol timestamp without time zone,  
  18     timestamptzcol timestamp with time zone,  
  19     intervalcol interval,  
  20     timetzcol time with time zone,  
  21     numericcol numeric,  
  22     uuidcol uuid,  
  23     lsncol pg_lsn  
  24 ) WITH (fillfactor=10);  
  25 INSERT INTO brintest_bloom SELECT  
  26     repeat(stringu1, 8)::bytea,  
  27     substr(stringu1, 1, 1)::"char",  
  28     stringu1::name, 142857 * tenthous,  
  29     thousand,  
  30     twothousand,  
  31     repeat(stringu1, 8),  
  32     unique1::oid,  
  33     (four + 1.0)/(hundred+1),  
  34     odd::float8 / (tenthous + 1),  
  35     format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,  
  36     inet '10.2.3.4/24' + tenthous,  
  37     cidr '10.2.3/24' + tenthous,  
  38     substr(stringu1, 1, 1)::bpchar,  
  39     date '1995-08-15' + tenthous,  
  40     time '01:20:30' + thousand * interval '18.5 second',  
  41     timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',  
  42     timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',  
  43     justify_days(justify_hours(tenthous * interval '12 minutes')),  
  44     timetz '01:30:20+02' + hundred * interval '15 seconds',  
  45     tenthous::numeric(36,30) * fivethous * even / (hundred + 1),  
  46     format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,  
  47     format('%s/%s%s', odd, even, tenthous)::pg_lsn  
  48 FROM tenk1 ORDER BY unique2 LIMIT 100;  
  49 -- throw in some NULL's and different values  
  50 INSERT INTO brintest_bloom (inetcol, cidrcol) SELECT  
  51     inet 'fe80::6e40:8ff:fea9:8c46' + tenthous,  
  52     cidr 'fe80::6e40:8ff:fea9:8c46' + tenthous  
  53 FROM tenk1 ORDER BY thousand, tenthous LIMIT 25;  
  54 -- test bloom specific index options  
  55 -- ndistinct must be >= -1.0  
  56 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (  
  57     byteacol bytea_bloom_ops(n_distinct_per_range = -1.1)  
  58 );  
  59 ERROR:  value -1.1 out of bounds for option "n_distinct_per_range"  
  60 DETAIL:  Valid values are between "-1.000000" and "2147483647.000000".  
  61 -- false_positive_rate must be between 0.0001 and 0.25  
  62 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (  
  63     byteacol bytea_bloom_ops(false_positive_rate = 0.00009)  
  64 );  
  65 ERROR:  value 0.00009 out of bounds for option "false_positive_rate"  
  66 DETAIL:  Valid values are between "0.000100" and "0.250000".  
  67 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (  
  68     byteacol bytea_bloom_ops(false_positive_rate = 0.26)  
  69 );  
  70 ERROR:  value 0.26 out of bounds for option "false_positive_rate"  
  71 DETAIL:  Valid values are between "0.000100" and "0.250000".  
  72 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (  
  73     byteacol bytea_bloom_ops,  
  74     charcol char_bloom_ops,  
  75     namecol name_bloom_ops,  
  76     int8col int8_bloom_ops,  
  77     int2col int2_bloom_ops,  
  78     int4col int4_bloom_ops,  
  79     textcol text_bloom_ops,  
  80     oidcol oid_bloom_ops,  
  81     float4col float4_bloom_ops,  
  82     float8col float8_bloom_ops,  
  83     macaddrcol macaddr_bloom_ops,  
  84     inetcol inet_bloom_ops,  
  85     cidrcol inet_bloom_ops,  
  86     bpcharcol bpchar_bloom_ops,  
  87     datecol date_bloom_ops,  
  88     timecol time_bloom_ops,  
  89     timestampcol timestamp_bloom_ops,  
  90     timestamptzcol timestamptz_bloom_ops,  
  91     intervalcol interval_bloom_ops,  
  92     timetzcol timetz_bloom_ops,  
  93     numericcol numeric_bloom_ops,  
  94     uuidcol uuid_bloom_ops,  
  95     lsncol pg_lsn_bloom_ops  
  96 ) with (pages_per_range = 1);  
  97 CREATE TABLE brinopers_bloom (colname name, typ text,  
  98     op text[], value text[], matches int[],  
  99     check (cardinality(op) = cardinality(value)),  
 100     check (cardinality(op) = cardinality(matches)));  
 101 INSERT INTO brinopers_bloom VALUES  
 102     ('byteacol', 'bytea',  
 103      '{=}',  
 104      '{BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA}',  
 105      '{1}'),  
 106     ('charcol', '"char"',  
 107      '{=}',  
 108      '{M}',  
 109      '{6}'),  
 110     ('namecol', 'name',  
 111      '{=}',  
 112      '{MAAAAA}',  
 113      '{2}'),  
 114     ('int2col', 'int2',  
 115      '{=}',  
 116      '{800}',  
 117      '{1}'),  
 118     ('int4col', 'int4',  
 119      '{=}',  
 120      '{800}',  
 121      '{1}'),  
 122     ('int8col', 'int8',  
 123      '{=}',  
 124      '{1257141600}',  
 125      '{1}'),  
 126     ('textcol', 'text',  
 127      '{=}',  
 128      '{BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA}',  
 129      '{1}'),  
 130     ('oidcol', 'oid',  
 131      '{=}',  
 132      '{8800}',  
 133      '{1}'),  
 134     ('float4col', 'float4',  
 135      '{=}',  
 136      '{1}',  
 137      '{4}'),  
 138     ('float8col', 'float8',  
 139      '{=}',  
 140      '{0}',  
 141      '{1}'),  
 142     ('macaddrcol', 'macaddr',  
 143      '{=}',  
 144      '{2c:00:2d:00:16:00}',  
 145      '{2}'),  
 146     ('inetcol', 'inet',  
 147      '{=}',  
 148      '{10.2.14.231/24}',  
 149      '{1}'),  
 150     ('inetcol', 'cidr',  
 151      '{=}',  
 152      '{fe80::6e40:8ff:fea9:8c46}',  
 153      '{1}'),  
 154     ('cidrcol', 'inet',  
 155      '{=}',  
 156      '{10.2.14/24}',  
 157      '{2}'),  
 158     ('cidrcol', 'inet',  
 159      '{=}',  
 160      '{fe80::6e40:8ff:fea9:8c46}',  
 161      '{1}'),  
 162     ('cidrcol', 'cidr',  
 163      '{=}',  
 164      '{10.2.14/24}',  
 165      '{2}'),  
 166     ('cidrcol', 'cidr',  
 167      '{=}',  
 168      '{fe80::6e40:8ff:fea9:8c46}',  
 169      '{1}'),  
 170     ('bpcharcol', 'bpchar',  
 171      '{=}',  
 172      '{W}',  
 173      '{6}'),  
 174     ('datecol', 'date',  
 175      '{=}',  
 176      '{2009-12-01}',  
 177      '{1}'),  
 178     ('timecol', 'time',  
 179      '{=}',  
 180      '{02:28:57}',  
 181      '{1}'),  
 182     ('timestampcol', 'timestamp',  
 183      '{=}',  
 184      '{1964-03-24 19:26:45}',  
 185      '{1}'),  
 186     ('timestamptzcol', 'timestamptz',  
 187      '{=}',  
 188      '{1972-10-19 09:00:00-07}',  
 189      '{1}'),  
 190     ('intervalcol', 'interval',  
 191      '{=}',  
 192      '{1 mons 13 days 12:24}',  
 193      '{1}'),  
 194     ('timetzcol', 'timetz',  
 195      '{=}',  
 196      '{01:35:50+02}',  
 197      '{2}'),  
 198     ('numericcol', 'numeric',  
 199      '{=}',  
 200      '{2268164.347826086956521739130434782609}',  
 201      '{1}'),  
 202     ('uuidcol', 'uuid',  
 203      '{=}',  
 204      '{52225222-5222-5222-5222-522252225222}',  
 205      '{1}'),  
 206     ('lsncol', 'pg_lsn',  
 207      '{=, IS, IS NOT}',  
 208      '{44/455222, NULL, NULL}',  
 209      '{1, 25, 100}');  
 210 DO $x$  
 211 DECLARE  
 212     r record;  
 213     r2 record;  
 214     cond text;  
 215     idx_ctids tid[];  
 216     ss_ctids tid[];  
 217     count int;  
 218     plan_ok bool;  
 219     plan_line text;  
 220 BEGIN  
 221     FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers_bloom, unnest(op) WITH ORDINALITY AS oper LOOP  
 222   
 223         -- prepare the condition  
 224         IF r.value IS NULL THEN  
 225             cond := format('%I %s %L', r.colname, r.oper, r.value);  
 226         ELSE  
 227             cond := format('%I %s %L::%s', r.colname, r.oper, r.value, r.typ);  
 228         END IF;  
 229   
 230         -- run the query using the brin index  
 231         SET enable_seqscan = 0;  
 232         SET enable_bitmapscan = 1;  
 233   
 234         plan_ok := false;  
 235         FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) LOOP  
 236             IF plan_line LIKE '%Bitmap Heap Scan on brintest_bloom%' THEN  
 237                 plan_ok := true;  
 238             END IF;  
 239         END LOOP;  
 240         IF NOT plan_ok THEN  
 241             RAISE WARNING 'did not get bitmap indexscan plan for %', r;  
 242         END IF;  
 243   
 244         EXECUTE format($y$SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond)  
 245             INTO idx_ctids;  
 246   
 247         -- run the query using a seqscan  
 248         SET enable_seqscan = 1;  
 249         SET enable_bitmapscan = 0;  
 250   
 251         plan_ok := false;  
 252         FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) LOOP  
 253             IF plan_line LIKE '%Seq Scan on brintest_bloom%' THEN  
 254                 plan_ok := true;  
 255             END IF;  
 256         END LOOP;  
 257         IF NOT plan_ok THEN  
 258             RAISE WARNING 'did not get seqscan plan for %', r;  
 259         END IF;  
 260   
 261         EXECUTE format($y$SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond)  
 262             INTO ss_ctids;  
 263   
 264         -- make sure both return the same results  
 265         count := array_length(idx_ctids, 1);  
 266   
 267         IF NOT (count = array_length(ss_ctids, 1) AND  
 268                 idx_ctids @> ss_ctids AND  
 269                 idx_ctids <@ ss_ctids) THEN  
 270             -- report the results of each scan to make the differences obvious  
 271             RAISE WARNING 'something not right in %: count %', r, count;  
 272             SET enable_seqscan = 1;  
 273             SET enable_bitmapscan = 0;  
 274             FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest_bloom WHERE ' || cond LOOP  
 275                 RAISE NOTICE 'seqscan: %', r2;  
 276             END LOOP;  
 277   
 278             SET enable_seqscan = 0;  
 279             SET enable_bitmapscan = 1;  
 280             FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest_bloom WHERE ' || cond LOOP  
 281                 RAISE NOTICE 'bitmapscan: %', r2;  
 282             END LOOP;  
 283         END IF;  
 284   
 285         -- make sure we found expected number of matches  
 286         IF count != r.matches THEN RAISE WARNING 'unexpected number of results % for %', count, r; END IF;  
 287     END LOOP;  
 288 END;  
 289 $x$;  
 290 RESET enable_seqscan;  
 291 RESET enable_bitmapscan;  
 292 INSERT INTO brintest_bloom SELECT  
 293     repeat(stringu1, 42)::bytea,  
 294     substr(stringu1, 1, 1)::"char",  
 295     stringu1::name, 142857 * tenthous,  
 296     thousand,  
 297     twothousand,  
 298     repeat(stringu1, 42),  
 299     unique1::oid,  
 300     (four + 1.0)/(hundred+1),  
 301     odd::float8 / (tenthous + 1),  
 302     format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,  
 303     inet '10.2.3.4' + tenthous,  
 304     cidr '10.2.3/24' + tenthous,  
 305     substr(stringu1, 1, 1)::bpchar,  
 306     date '1995-08-15' + tenthous,  
 307     time '01:20:30' + thousand * interval '18.5 second',  
 308     timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',  
 309     timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',  
 310     justify_days(justify_hours(tenthous * interval '12 minutes')),  
 311     timetz '01:30:20' + hundred * interval '15 seconds',  
 312     tenthous::numeric(36,30) * fivethous * even / (hundred + 1),  
 313     format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,  
 314     format('%s/%s%s', odd, even, tenthous)::pg_lsn  
 315 FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET 5;  
 316 SELECT brin_desummarize_range('brinidx_bloom', 0);  
 317  brin_desummarize_range   
 318 ------------------------  
 319    
 320 (1 row)  
 321   
 322 VACUUM brintest_bloom;  -- force a summarization cycle in brinidx  
 323 UPDATE brintest_bloom SET int8col = int8col * int4col;  
 324 UPDATE brintest_bloom SET textcol = '' WHERE textcol IS NOT NULL;  
 325 -- Tests for brin_summarize_new_values  
 326 SELECT brin_summarize_new_values('brintest_bloom'); -- error, not an index  
 327 ERROR:  "brintest_bloom" is not an index  
 328 SELECT brin_summarize_new_values('tenk1_unique1'); -- error, not a BRIN index  
 329 ERROR:  "tenk1_unique1" is not a BRIN index  
 330 SELECT brin_summarize_new_values('brinidx_bloom'); -- ok, no change expected  
 331  brin_summarize_new_values   
 332 ---------------------------  
 333                          0  
 334 (1 row)  
 335   
 336 -- Tests for brin_desummarize_range  
 337 SELECT brin_desummarize_range('brinidx_bloom', -1); -- error, invalid range  
 338 ERROR:  block number out of range: -1  
 339 SELECT brin_desummarize_range('brinidx_bloom', 0);  
 340  brin_desummarize_range   
 341 ------------------------  
 342    
 343 (1 row)  
 344   
 345 SELECT brin_desummarize_range('brinidx_bloom', 0);  
 346  brin_desummarize_range   
 347 ------------------------  
 348    
 349 (1 row)  
 350   
 351 SELECT brin_desummarize_range('brinidx_bloom', 100000000);  
 352  brin_desummarize_range   
 353 ------------------------  
 354    
 355 (1 row)  
 356   
 357 -- Test brin_summarize_range  
 358 CREATE TABLE brin_summarize_bloom (  
 359     value int  
 360 ) WITH (fillfactor=10, autovacuum_enabled=false);  
 361 CREATE INDEX brin_summarize_bloom_idx ON brin_summarize_bloom USING brin (value) WITH (pages_per_range=2);  
 362 -- Fill a few pages  
 363 DO $$  
 364 DECLARE curtid tid;  
 365 BEGIN  
 366   LOOP  
 367     INSERT INTO brin_summarize_bloom VALUES (1) RETURNING ctid INTO curtid;  
 368     EXIT WHEN curtid > tid '(2, 0)';  
 369   END LOOP;  
 370 END;  
 371 $$;  
 372 -- summarize one range  
 373 SELECT brin_summarize_range('brin_summarize_bloom_idx', 0);  
 374  brin_summarize_range   
 375 ----------------------  
 376                     0  
 377 (1 row)  
 378   
 379 -- nothing: already summarized  
 380 SELECT brin_summarize_range('brin_summarize_bloom_idx', 1);  
 381  brin_summarize_range   
 382 ----------------------  
 383                     0  
 384 (1 row)  
 385   
 386 -- summarize one range  
 387 SELECT brin_summarize_range('brin_summarize_bloom_idx', 2);  
 388  brin_summarize_range   
 389 ----------------------  
 390                     1  
 391 (1 row)  
 392   
 393 -- nothing: page doesn't exist in table  
 394 SELECT brin_summarize_range('brin_summarize_bloom_idx', 4294967295);  
 395  brin_summarize_range   
 396 ----------------------  
 397                     0  
 398 (1 row)  
 399   
 400 -- invalid block number values  
 401 SELECT brin_summarize_range('brin_summarize_bloom_idx', -1);  
 402 ERROR:  block number out of range: -1  
 403 SELECT brin_summarize_range('brin_summarize_bloom_idx', 4294967296);  
 404 ERROR:  block number out of range: 4294967296  
 405 -- test brin cost estimates behave sanely based on correlation of values  
 406 CREATE TABLE brin_test_bloom (a INT, b INT);  
 407 INSERT INTO brin_test_bloom SELECT x/100,x%100 FROM generate_series(1,10000) x(x);  
 408 CREATE INDEX brin_test_bloom_a_idx ON brin_test_bloom USING brin (a) WITH (pages_per_range = 2);  
 409 CREATE INDEX brin_test_bloom_b_idx ON brin_test_bloom USING brin (b) WITH (pages_per_range = 2);  
 410 VACUUM ANALYZE brin_test_bloom;  
 411 -- Ensure brin index is used when columns are perfectly correlated  
 412 EXPLAIN (COSTS OFF) SELECT * FROM brin_test_bloom WHERE a = 1;  
 413                     QUERY PLAN                      
 414 --------------------------------------------------  
 415  Bitmap Heap Scan on brin_test_bloom  
 416    Recheck Cond: (a = 1)  
 417    ->  Bitmap Index Scan on brin_test_bloom_a_idx  
 418          Index Cond: (a = 1)  
 419 (4 rows)  
 420   
 421 -- Ensure brin index is not used when values are not correlated  
 422 EXPLAIN (COSTS OFF) SELECT * FROM brin_test_bloom WHERE b = 1;  
 423          QUERY PLAN            
 424 -----------------------------  
 425  Seq Scan on brin_test_bloom  
 426    Filter: (b = 1)  
 427 (2 rows)  

原理:

   1 /*  
   2  * brin_bloom.c  
   3  *      Implementation of Bloom opclass for BRIN  
   4  *  
   5  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group  
   6  * Portions Copyright (c) 1994, Regents of the University of California  
   7  *  
   8  *  
   9  * A BRIN opclass summarizing page range into a bloom filter.  
  10  *  
  11  * Bloom filters allow efficient testing whether a given page range contains  
  12  * a particular value. Therefore, if we summarize each page range into a small  
  13  * bloom filter, we can easily (and cheaply) test whether it contains values  
  14  * we get later.  
  15  *  
  16  * The index only supports equality operators, similarly to hash indexes.  
  17  * Bloom indexes are however much smaller, and support only bitmap scans.  
  18  *  
  19  * Note: Don't confuse this with bloom indexes, implemented in a contrib  
  20  * module. That extension implements an entirely new AM, building a bloom  
  21  * filter on multiple columns in a single row. This opclass works with an  
  22  * existing AM (BRIN) and builds bloom filter on a column.  
  23  *  
  24  *  
  25  * values vs. hashes  
  26  * -----------------  
  27  *  
  28  * The original column values are not used directly, but are first hashed  
  29  * using the regular type-specific hash function, producing a uint32 hash.  
  30  * And this hash value is then added to the summary - i.e. it's hashed  
  31  * again and added to the bloom filter.  
  32  *  
  33  * This allows the code to treat all data types (byval/byref/...) the same  
  34  * way, with only minimal space requirements, because we're working with  
  35  * hashes and not the original values. Everything is uint32.  
  36  *  
  37  * Of course, this assumes the built-in hash function is reasonably good,  
  38  * without too many collisions etc. But that does seem to be the case, at  
  39  * least based on past experience. After all, the same hash functions are  
  40  * used for hash indexes, hash partitioning and so on.  
  41  *  
  42  *  
  43  * hashing scheme  
  44  * --------------  
  45  *  
  46  * Bloom filters require a number of independent hash functions. There are  
  47  * different schemes how to construct them - for example we might use  
  48  * hash_uint32_extended with random seeds, but that seems fairly expensive.  
  49  * We use a scheme requiring only two functions described in this paper:  
  50  *  
  51  * Less Hashing, Same Performance:Building a Better Bloom Filter  
  52  * Adam Kirsch, Michael Mitzenmacher†, Harvard School of Engineering and  
  53  * Applied Sciences, Cambridge, Massachusetts [DOI 10.1002/rsa.20208]  
  54  *  
  55  * The two hash functions h1 and h2 are calculated using hard-coded seeds,  
  56  * and then combined using (h1 + i * h2) to generate the hash functions.  
  57  *  
  58  *  
  59  * sizing the bloom filter  
  60  * -----------------------  
  61  *  
  62  * Size of a bloom filter depends on the number of distinct values we will  
  63  * store in it, and the desired false positive rate. The higher the number  
  64  * of distinct values and/or the lower the false positive rate, the larger  
  65  * the bloom filter. On the other hand, we want to keep the index as small  
  66  * as possible - that's one of the basic advantages of BRIN indexes.  
  67  *  
  68  * Although the number of distinct elements (in a page range) depends on  
  69  * the data, we can consider it fixed. This simplifies the trade-off to  
  70  * just false positive rate vs. size.  
  71  *  
  72  * At the page range level, false positive rate is a probability the bloom  
  73  * filter matches a random value. For the whole index (with sufficiently  
  74  * many page ranges) it represents the fraction of the index ranges (and  
  75  * thus fraction of the table to be scanned) matching the random value.  
  76  *  
  77  * Furthermore, the size of the bloom filter is subject to implementation  
  78  * limits - it has to fit onto a single index page (8kB by default). As  
  79  * the bitmap is inherently random (when "full" about half the bits is set  
  80  * to 1, randomly), compression can't help very much.  
  81  *  
  82  * To reduce the size of a filter (to fit to a page), we have to either  
  83  * accept higher false positive rate (undesirable), or reduce the number  
  84  * of distinct items to be stored in the filter. We can't alter the input  
  85  * data, of course, but we may make the BRIN page ranges smaller - instead  
  86  * of the default 128 pages (1MB) we may build index with 16-page ranges,  
  87  * or something like that. This should reduce the number of distinct values  
  88  * in the page range, making the filter smaller (with fixed false positive  
  89  * rate). Even for random data sets this should help, as the number of rows  
  90  * per heap page is limited (to ~290 with very narrow tables, likely ~20  
  91  * in practice).  
  92  *  
  93  * Of course, good sizing decisions depend on having the necessary data,  
  94  * i.e. number of distinct values in a page range (of a given size) and  
  95  * table size (to estimate cost change due to change in false positive  
  96  * rate due to having larger index vs. scanning larger indexes). We may  
  97  * not have that data - for example when building an index on empty table  
  98  * it's not really possible. And for some data we only have estimates for  
  99  * the whole table and we can only estimate per-range values (ndistinct).  
 100  *  
 101  * Another challenge is that while the bloom filter is per-column, it's  
 102  * the whole index tuple that has to fit into a page. And for multi-column  
 103  * indexes that may include pieces we have no control over (not necessarily  
 104  * bloom filters, the other columns may use other BRIN opclasses). So it's  
 105  * not entirely clear how to distribute the space between those columns.  
 106  *  
 107  * The current logic, implemented in brin_bloom_get_ndistinct, attempts to  
 108  * make some basic sizing decisions, based on the size of BRIN ranges, and  
 109  * the maximum number of rows per range.  
 110  *  
 111  *  
 112  * IDENTIFICATION  
 113  *    src/backend/access/brin/brin_bloom.c  
 114  */  
 115 #include "postgres.h"  
 116   
 117 #include "access/genam.h"  
 118 #include "access/brin.h"  
 119 #include "access/brin_internal.h"  
 120 #include "access/brin_page.h"  
 121 #include "access/brin_tuple.h"  
 122 #include "access/hash.h"  
 123 #include "access/htup_details.h"  
 124 #include "access/reloptions.h"  
 125 #include "access/stratnum.h"  
 126 #include "catalog/pg_type.h"  
 127 #include "catalog/pg_amop.h"  
 128 #include "utils/builtins.h"  
 129 #include "utils/datum.h"  
 130 #include "utils/lsyscache.h"  
 131 #include "utils/rel.h"  
 132 #include "utils/syscache.h"  
 133   
 134 #include <math.h>  
 135   
 136 #define BloomEqualStrategyNumber    1  
 137   
 138 /*  
 139  * Additional SQL level support functions. We only have one, which is  
 140  * used to calculate hash of the input value.  
 141  *  
 142  * Procedure numbers must not use values reserved for BRIN itself; see  
 143  * brin_internal.h.  
 144  */  
 145 #define     BLOOM_MAX_PROCNUMS      1   /* maximum support procs we need */  
 146 #define     PROCNUM_HASH            11  /* required */  
 147   
 148 /*  
 149  * Subtract this from procnum to obtain index in BloomOpaque arrays  
 150  * (Must be equal to minimum of private procnums).  
 151  */  
 152 #define     PROCNUM_BASE            11  
 153   
 154 /*  
 155  * Storage type for BRIN's reloptions.  
 156  */  
 157 typedef struct BloomOptions  
 158 {  
 159     int32       vl_len_;        /* varlena header (do not touch directly!) */  
 160     double      nDistinctPerRange;  /* number of distinct values per range */  
 161     double      falsePositiveRate;  /* false positive for bloom filter */  
 162 } BloomOptions;  
 163   
 164 /*  
 165  * The current min value (16) is somewhat arbitrary, but it's based  
 166  * on the fact that the filter header is ~20B alone, which is about  
 167  * the same as the filter bitmap for 16 distinct items with 1% false  
 168  * positive rate. So by allowing lower values we'd not gain much. In  
 169  * any case, the min should not be larger than MaxHeapTuplesPerPage  
 170  * (~290), which is the theoretical maximum for single-page ranges.  
 171  */  
 172 #define     BLOOM_MIN_NDISTINCT_PER_RANGE       16  
 173   
 174 /*  
 175  * Used to determine number of distinct items, based on the number of rows  
 176  * in a page range. The 10% is somewhat similar to what estimate_num_groups  
 177  * does, so we use the same factor here.  
 178  */  
 179 #define     BLOOM_DEFAULT_NDISTINCT_PER_RANGE   -0.1    /* 10% of values */  
 180   
 181 /*  
 182  * Allowed range and default value for the false positive range. The exact  
 183  * values are somewhat arbitrary, but were chosen considering the various  
 184  * parameters (size of filter vs. page size, etc.).  
 185  *  
 186  * The lower the false-positive rate, the more accurate the filter is, but  
 187  * it also gets larger - at some point this eliminates the main advantage  
 188  * of BRIN indexes, which is the tiny size. At 0.01% the index is about  
 189  * 10% of the table (assuming 290 distinct values per 8kB page).  
 190  *  
 191  * On the other hand, as the false-positive rate increases, larger part of  
 192  * the table has to be scanned due to mismatches - at 25% we're probably  
 193  * close to sequential scan being cheaper.  
 194  */  
 195 #define     BLOOM_MIN_FALSE_POSITIVE_RATE   0.0001  /* 0.01% fp rate */  
 196 #define     BLOOM_MAX_FALSE_POSITIVE_RATE   0.25    /* 25% fp rate */  
 197 #define     BLOOM_DEFAULT_FALSE_POSITIVE_RATE   0.01    /* 1% fp rate */  
 198   
 199 #define BloomGetNDistinctPerRange(opts) \  
 200     ((opts) && (((BloomOptions *) (opts))->nDistinctPerRange != 0) ? \  
 201      (((BloomOptions *) (opts))->nDistinctPerRange) : \  
 202      BLOOM_DEFAULT_NDISTINCT_PER_RANGE)  
 203   
 204 #define BloomGetFalsePositiveRate(opts) \  
 205     ((opts) && (((BloomOptions *) (opts))->falsePositiveRate != 0.0) ? \  
 206      (((BloomOptions *) (opts))->falsePositiveRate) : \  
 207      BLOOM_DEFAULT_FALSE_POSITIVE_RATE)  
 208   
 209 /*  
 210  * And estimate of the largest bloom we can fit onto a page. This is not  
 211  * a perfect guarantee, for a couple of reasons. For example, the row may  
 212  * be larger because the index has multiple columns.  
 213  */  
 214 #define BloomMaxFilterSize \  
 215     MAXALIGN_DOWN(BLCKSZ - \  
 216                   (MAXALIGN(SizeOfPageHeaderData + \  
 217                             sizeof(ItemIdData)) + \  
 218                    MAXALIGN(sizeof(BrinSpecialSpace)) + \  
 219                    SizeOfBrinTuple))  
 220   
 221 /*  
 222  * Seeds used to calculate two hash functions h1 and h2, which are then used  
 223  * to generate k hashes using the (h1 + i * h2) scheme.  
 224  */  
 225 #define BLOOM_SEED_1    0x71d924af  
 226 #define BLOOM_SEED_2    0xba48b314  
 227   
 228 /*  
 229  * Bloom Filter  
 230  *  
 231  * Represents a bloom filter, built on hashes of the indexed values. That is,  
 232  * we compute a uint32 hash of the value, and then store this hash into the  
 233  * bloom filter (and compute additional hashes on it).  
 234  *  
 235  * XXX We could implement "sparse" bloom filters, keeping only the bytes that  
 236  * are not entirely 0. But while indexes don't support TOAST, the varlena can  
 237  * still be compressed. So this seems unnecessary, because the compression  
 238  * should do the same job.  
 239  *  
 240  * XXX We can also watch the number of bits set in the bloom filter, and then  
 241  * stop using it (and not store the bitmap, to save space) when the false  
 242  * positive rate gets too high. But even if the false positive rate exceeds the  
 243  * desired value, it still can eliminate some page ranges.  
 244  */  
 245 typedef struct BloomFilter  
 246 {  

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat