Skip to content

Latest commit

 

History

History
983 lines (765 loc) · 43.8 KB

20160528_01.md

File metadata and controls

983 lines (765 loc) · 43.8 KB

深入浅出PostgreSQL B-Tree索引结构

作者

digoal

日期

2016-05-28

标签

PostgreSQL , b-tree , 索引结构


背景

PostgreSQL B-Tree是一种变种(high-concurrency B-tree management algorithm),算法详情请参考

src/backend/access/nbtree/README

PostgreSQL 的B-Tree索引页分为几种类别

meta page      
root page         #  btpo_flags=2      
branch page    #  btpo_flags=0      
leaf page         #  btpo_flags=1      
      
如果即是leaf又是root则  btpo_flags=3。        

其中meta page和root page是必须有的,meta page需要一个页来存储,表示指向root page的page id。

随着记录数的增加,一个root page可能存不下所有的heap item,就会有leaf page,甚至branch page,甚至多层的branch page。

一共有几层branch 和 leaf,就用btree page元数据的 level 来表示。

4

我们可以使用pageinspect插件,内窥B-Tree的结构。

层次可以从bt_page_stats的btpo得到,代表当前index page所处的层级。

注意层级并不是唯一的,例如btpo=3的层级,可能有分几个档。

打个比喻,腾讯的技术岗位级别T3,对应T3这个级别又有几个小的档位。和这里的含义差不多,只是没有区分小档位的值,但是后面我们能看到它的存在。

btpo=0级表示最底层,处于这个层级的index pages存储的items(ctid)是指向heap page的。

类别和层级不挂钩,类别里面又可以有多个层级,但是只有层级=0的index page存储的ctid内容才是指向heap page的; 其他层级index page存储的ctid内容都是指向同层级其他index page(双向链表),或者指下级的index page。

1.

0层结构,只有meta和root页。

root页最多可以存储的item数,取决于索引字段数据的长度、以及索引页的大小。

1

例子

postgres=# create extension pageinspect;      
      
postgres=# create table tab1(id int primary key, info text);      
CREATE TABLE      
postgres=# insert into tab1 select generate_series(1,100), md5(random()::text);      
INSERT 0 100      
postgres=# vacuum analyze tab1;      
VACUUM      

查看meta page,可以看到root page id = 1 。

索引的level = 0, 说明没有branch和leaf page。

postgres=# select * from bt_metap('tab1_pkey');      
 magic  | version | root | level | fastroot | fastlevel       
--------+---------+------+-------+----------+-----------      
 340322 |       2 |    1 |     0 |        1 |         0      
(1 row)      

根据root page id = 1查看root page的stats

btpo=0 说明已经到了最底层

btpo_flags=3,说明它既是leaf又是root页。

postgres=# select * from bt_page_stats('tab1_pkey',1);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     1 | l    |        100 |          0 |            16 |      8192 |      6148 |         0 |         0 |    0 |          3      
(1 row)      

btpo_prev和btpo_next分别表示该页的相邻页(branch page是双向链表)。

btpo_flags 可以在代码中查看(src/include/access/nbtree.h),一共有几个

/* Bits defined in btpo_flags */      
#define BTP_LEAF                (1 << 0)        /* leaf page, i.e. not internal page */      
#define BTP_ROOT                (1 << 1)        /* root page (has no parent) */      
#define BTP_DELETED             (1 << 2)        /* page has been deleted from tree */      
#define BTP_META                (1 << 3)        /* meta-page */      
#define BTP_HALF_DEAD   (1 << 4)        /* empty, but still in tree */      
#define BTP_SPLIT_END   (1 << 5)        /* rightmost page of split group */      
#define BTP_HAS_GARBAGE (1 << 6)        /* page has LP_DEAD tuples */      
#define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */      

查看0级 page存储的ctid (即items)

0级ctid 表示存储的是 heap页的寻址。 (如果是多层结构,那么branch page中的ctid, 它表示的是同级btree页(链条项ctid)或者下级btree页的寻址) 。

当ctid指向heap时, data是对应的列值。(多级结构的data意义不一样,后面会讲)

postgres=# select * from bt_page_items('tab1_pkey',1);      
 itemoffset |  ctid   | itemlen | nulls | vars |          data                 
------------+---------+---------+-------+------+-------------------------      
          1 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00      
          2 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00      
...      
         99 | (0,99)  |      16 | f     | f    | 63 00 00 00 00 00 00 00      
        100 | (0,100) |      16 | f     | f    | 64 00 00 00 00 00 00 00      
(100 rows)      

根据ctid 查看heap记录

postgres=# select * from tab1 where ctid='(0,100)';      
 id  |               info                     
-----+----------------------------------      
 100 | 68b63c269ee8cc2d99fe204f04d0ffcb      
(1 row)      

2.

1层结构,包括meta page, root page, leaf page.

2

例子

postgres=# truncate tab1;      
TRUNCATE TABLE      
postgres=# insert into tab1 select generate_series(1,1000), md5(random()::text);      
INSERT 0 1000      
postgres=# vacuum analyze tab1;      
VACUUM      

查看meta page,可以看到root page id = 3, 索引的level = 1。

level = 1 表示包含了leaf page。

postgres=# select * from bt_metap('tab1_pkey');      
 magic  | version | root | level | fastroot | fastlevel       
--------+---------+------+-------+----------+-----------      
 340322 |       2 |    3 |     1 |        3 |         1      
(1 row)      

根据root page id 查看root page的stats

btpo = 1 说明还没有到最底层(最底层btpo=0, 这种页里面存储的ctid才代表指向heap page的地址)

btpo_flags=2 说明这个页是root page

postgres=# select * from bt_page_stats('tab1_pkey',3);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2      
(1 row)      

查看root page存储的 leaf page items (指向leaf page)

一共3个leaf pages, data存储的是这个leaf page存储的最小值。

postgres=# select * from bt_page_items('tab1_pkey',3);      
 itemoffset | ctid  | itemlen | nulls | vars |          data                 
------------+-------+---------+-------+------+-------------------------      
          1 | (1,1) |       8 | f     | f    |       
          2 | (2,1) |      16 | f     | f    | 6f 01 00 00 00 00 00 00      
          3 | (4,1) |      16 | f     | f    | dd 02 00 00 00 00 00 00      
(3 rows)      

第一条为空,是因为这个leaf page是最左边的PAGE,不存最小值。

对于有右leaf page的leaf page,第一条存储的heap item为该页的右链路。

第二条才是起始ITEM。

另外需要注意,虽然在item里面只存储右链,leaf page还是双向链表,在stats能看到它的prev 和next page。

根据leaf page id查看stats

最左leaf page = 1

prev btpo 指向meta page

可以看到btpo = 0了,说明这个页是底层页。        
btpo_flags=1 说明是leaf page        
postgres=# select * from bt_page_stats('tab1_pkey',1);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1      
(1 row)      

next btpo 指向meta page

最右leaf page = 4

btpo_flags=1 说明是leaf page

postgres=# select * from bt_page_stats('tab1_pkey',4);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     4 | l    |        268 |          0 |            16 |      8192 |      2788 |         2 |         0 |    0 |          1      
(1 row)      

中间leaf page = 2

btpo_flags=1 说明是leaf page

postgres=# select * from bt_page_stats('tab1_pkey',2);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     2 | l    |        367 |          0 |            16 |      8192 |       808 |         1 |         4 |    0 |          1      
(1 row)      

查看leaf page存储的 heap ctid (即heap items)

含右页的例子, index page 1

第一条为右链表的第一条item, 第二条才是起始item

postgres=# select * from bt_page_items('tab1_pkey',1);      
 itemoffset |  ctid   | itemlen | nulls | vars |          data                 
------------+---------+---------+-------+------+-------------------------      
          1 | (3,7)   |      16 | f     | f    | 6f 01 00 00 00 00 00 00      
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00      
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00      
...      
        367 | (3,6)   |      16 | f     | f    | 6e 01 00 00 00 00 00 00      
(367 rows)      

不含右页的例子, index page 4

第一条就是起始ctid (即items)

postgres=# select * from bt_page_items('tab1_pkey',4);      
 itemoffset |  ctid   | itemlen | nulls | vars |          data                 
------------+---------+---------+-------+------+-------------------------      
          1 | (6,13)  |      16 | f     | f    | dd 02 00 00 00 00 00 00      
          2 | (6,14)  |      16 | f     | f    | de 02 00 00 00 00 00 00      
...      
        268 | (8,40)  |      16 | f     | f    | e8 03 00 00 00 00 00 00      
(268 rows)      

根据ctid 查看heap记录

postgres=#              select * from tab1 where ctid='(0,1)';      
 id |               info                     
----+----------------------------------      
  1 | 6ebc6b77aebf5dd11621a2ed846c08c4      
(1 row)      

3.

记录数超过1层结构的索引可以存储的记录数时,会分裂为2层结构,除了meta page和root page,还可能包含1层branch page以及1层leaf page。

如果是边界页(branch or leaf),那么其中一个方向没有PAGE,这个方向的链表信息都统一指向meta page。

3

例子

create table tbl1(id int primary key, info text);        
postgres=# select 285^2;      
 ?column?       
----------      
    81225      
(1 row)      
postgres=# insert into tab2 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint tab2_pkey do nothing;      
INSERT 0 951379      
postgres=# vacuum analyze tab2;      
VACUUM      

查看meta page,可以看到root page id = 412, 索引的level=2,即包括1级 branch 和 1级 leaf。

postgres=# select * from bt_metap('tab2_pkey');      
 magic  | version | root | level | fastroot | fastlevel       
--------+---------+------+-------+----------+-----------      
 340322 |       2 |  412 |     2 |      412 |         2      
(1 row)      

根据root page id 查看root page的stats

btpo = 2 当前在第二层,另外还表示下层是1

btpo_flags = 2 说明是root page

postgres=# select * from bt_page_stats('tab2_pkey', 412);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
   412 | r    |         11 |          0 |            15 |      8192 |      7936 |         0 |         0 |    2 |          2      
(1 row)      

查看root page存储的 branch page items (指向branch page)

postgres=# select * from bt_page_items('tab2_pkey', 412);      
 itemoffset |   ctid   | itemlen | nulls | vars |          data                 
------------+----------+---------+-------+------+-------------------------      
          1 | (3,1)    |       8 | f     | f    |       
          2 | (2577,1) |      16 | f     | f    | e1 78 0b 00 00 00 00 00      
          3 | (1210,1) |      16 | f     | f    | ec 3a 18 00 00 00 00 00      
          4 | (2316,1) |      16 | f     | f    | de 09 25 00 00 00 00 00      
          5 | (574,1)  |      16 | f     | f    | aa e8 33 00 00 00 00 00      
          6 | (2278,1) |      16 | f     | f    | 85 90 40 00 00 00 00 00      
          7 | (1093,1) |      16 | f     | f    | f6 e9 4e 00 00 00 00 00      
          8 | (2112,1) |      16 | f     | f    | a3 60 5c 00 00 00 00 00      
          9 | (411,1)  |      16 | f     | f    | b2 ea 6b 00 00 00 00 00      
         10 | (2073,1) |      16 | f     | f    | db de 79 00 00 00 00 00      
         11 | (1392,1) |      16 | f     | f    | df b0 8a 00 00 00 00 00      
(11 rows)      

根据branch page id查看stats

btpo = 1 当前在第一层 ,另外还表示下层是0

btpo_flags = 0 说明是branch page

postgres=# select * from bt_page_stats('tab2_pkey', 3);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     3 | i    |        254 |          0 |            15 |      8192 |      3076 |         0 |      2577 |    1 |          0      
(1 row)      

查看branch page存储的 leaf page ctid (指向leaf page)

只要不是最右边的页,第一条都代表右页的起始item。

第二条才是当前页的起始ctid

注意所有branch page的起始item对应的data都是空的。

也就是说它不存储当前branch page包含的所有leaf pages的索引字段内容的最小值。

postgres=# select * from bt_page_items('tab2_pkey', 3);      
 itemoffset |   ctid   | itemlen | nulls | vars |          data                 
------------+----------+---------+-------+------+-------------------------      
          1 | (735,1)  |      16 | f     | f    | e1 78 0b 00 00 00 00 00      
          2 | (1,1)    |       8 | f     | f    |       
          3 | (2581,1) |      16 | f     | f    | a8 09 00 00 00 00 00 00      
          4 | (1202,1) |      16 | f     | f    | f8 13 00 00 00 00 00 00      
...      
        254 | (3322,1) |      16 | f     | f    | ee 6f 0b 00 00 00 00 00      
(254 rows)      

根据ctid 查看leaf page

btpo = 0 当前在第0层,即最底层,这里存储的是heap ctid

btpo_flags = 1 说明是leaf page

postgres=# select * from bt_page_stats('tab2_pkey', 1);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     1 | l    |        242 |          0 |            16 |      8192 |      3308 |         0 |      2581 |    0 |          1      
(1 row)      
      
postgres=# select * from bt_page_items('tab2_pkey', 1);      
 itemoffset |    ctid    | itemlen | nulls | vars |          data                 
------------+------------+---------+-------+------+-------------------------      
          1 | (4985,16)  |      16 | f     | f    | a8 09 00 00 00 00 00 00      
          2 | (7305,79)  |      16 | f     | f    | 01 00 00 00 00 00 00 00      
          3 | (2757,120) |      16 | f     | f    | 09 00 00 00 00 00 00 00      
...      
        242 | (1329,101) |      16 | f     | f    | a0 09 00 00 00 00 00 00      
(242 rows)      

查看leaf page中包含的heap page items。

如果我们根据索引页结构的原理,能推算出来(7305,79)是最小值,取它就没错了。

postgres=# select * from tab2 where ctid='(7305,79)';      
 id |               info                     
----+----------------------------------      
  1 | 18aaeb74c359355311ac825ae2aeb22a      
(1 row)      
      
postgres=# select min(id) from tab2;      
 min       
-----      
   1      
(1 row)      

4.

多层结构,除了meta page,还可能包含多层branch page,以及一层leaf page。

4

例子

postgres=# create table tab3(id int primary key, info text);      
CREATE TABLE      
postgres=# insert into tab3 select generate_series(1, 100000000), md5(random()::text);        

查看meta page, 注意level,已经是3级了。

meta page      
postgres=# select * from bt_metap('tab3_pkey');      
 magic  | version |  root  | level | fastroot | fastlevel       
--------+---------+--------+-------+----------+-----------      
 340322 |       2 | 116816 |     3 |   116816 |         3      
(1 row)      

btpo_flags=2 代表 root page

btpo = 3 代表第3层

postgres=# select * from bt_page_stats('tab3_pkey', 116816);      
 blkno  | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
--------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
 116816 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    3 |          2      
(1 row)      
      
postgres=# select * from bt_page_items('tab3_pkey', 116816);      
 itemoffset |    ctid    | itemlen | nulls | vars |          data                 
------------+------------+---------+-------+------+-------------------------      
          1 | (412,1)    |       8 | f     | f    |       
          2 | (116815,1) |      16 | f     | f    | 5f 9e c5 01 00 00 00 00      
          3 | (198327,1) |      16 | f     | f    | bd 3c 8b 03 00 00 00 00      
(3 rows)      

btpo_flags=0 代表 branch page

btpo = 2 代表第2层

postgres=# select * from bt_page_stats('tab3_pkey', 412);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
   412 | i    |        286 |          0 |            15 |      8192 |      2436 |         0 |    116815 |    2 |          0      
(1 row)      
      
postgres=# select * from bt_page_items('tab3_pkey', 412);      
 itemoffset |   ctid    | itemlen | nulls | vars |          data                 
------------+-----------+---------+-------+------+-------------------------      
          1 | (81636,1) |      16 | f     | f    | 5f 9e c5 01 00 00 00 00  -- 这是指向当前层级右页的ctid      
          2 | (3,1)     |       8 | f     | f    |    -- 注意第一条初始值是这      
          3 | (411,1)   |      16 | f     | f    | 77 97 01 00 00 00 00 00      
          4 | (698,1)   |      16 | f     | f    | ed 2e 03 00 00 00 00 00      
...      
        286 | (81350,1) |      16 | f     | f    | e9 06 c4 01 00 00 00 00      
(286 rows)      

btpo_flags=0 代表 branch page

btpo = 1 代表第1层

postgres=# select * from bt_page_stats('tab3_pkey', 3);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     3 | i    |        286 |          0 |            15 |      8192 |      2436 |         0 |       411 |    1 |          0      
(1 row)      
      
postgres=# select * from bt_page_items('tab3_pkey', 3);      
 itemoffset |  ctid   | itemlen | nulls | vars |          data                 
------------+---------+---------+-------+------+-------------------------      
          1 | (287,1) |      16 | f     | f    | 77 97 01 00 00 00 00 00      
          2 | (1,1)   |       8 | f     | f    |       
          3 | (2,1)   |      16 | f     | f    | 6f 01 00 00 00 00 00 00      
          4 | (4,1)   |      16 | f     | f    | dd 02 00 00 00 00 00 00      
...      
        286 | (286,1) |      16 | f     | f    | 09 96 01 00 00 00 00 00      
(286 rows)      

btpo_flags=1 代表 leaf page

btpo = 0 代表第0层

postgres=# select * from bt_page_stats('tab3_pkey', 1);      
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags       
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------      
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1      
(1 row)      
      
postgres=# select * from bt_page_items('tab3_pkey', 1);      
 itemoffset |  ctid   | itemlen | nulls | vars |          data                 
------------+---------+---------+-------+------+-------------------------      
          1 | (3,7)   |      16 | f     | f    | 6f 01 00 00 00 00 00 00      
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00      
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00      
...      
        367 | (3,6)   |      16 | f     | f    | 6e 01 00 00 00 00 00 00      
(367 rows)      

通过第0层的ctid就可以获取到heap了.

heap tuple例子

postgres=# select * from tab3 where ctid='(0,1)';      
 id |               info                     
----+----------------------------------      
  1 | 370ee1989a2b7f5d8a5b43243596d91f      
(1 row)      

如何解释explain analyze中的扫描了多少个btree page

实战例子1

postgres=# create table tbl1(id int primary key, info text);      
CREATE TABLE      
postgres=# insert into tbl1 select trunc(random()*10000000), md5(random()::text) from generate_series(1,5000000) on conflict on constraint tbl1_pkey do nothing;      
INSERT 0 3934875      
postgres=# select ctid,* from tbl1 limit 10;      
  ctid  |   id    |               info                     
--------+---------+----------------------------------      
 (0,1)  | 2458061 | 5c91812b54bdcae602321dceaf22e276      
 (0,2)  | 8577271 | fe8e7a8be0d71a94e13b1b5a7786010b      
 (0,3)  | 4612744 | 56983e47f044b5a4655300e1868d2850      
 (0,4)  | 3690167 | 4a5ec8abf67bc018dcc113be829a59da      
 (0,5)  | 2646638 | 7686b47dcb94e56c11d69ec04d6017f3      
 (0,6)  | 6023272 | 4779d9a849c8287490be9d37a27b4637      
 (0,7)  | 7163674 | 35af37f479f48caa65033a5ef56cd75e      
 (0,8)  | 4049257 | 12fa110d927c88dce0773b546cc600c6      
 (0,9)  | 5815903 | 69ed9770ede59917d15ac2373ca8c797      
 (0,10) | 4068194 | 738595f73670da7ede40aefa8cb3d00c      
(10 rows)      
postgres=# vacuum analyze tbl1;      
VACUUM      

首先我们需要了解索引的level,才能正确的判断需要扫描多少个index page才能取出1条记录。

postgres=# select * from bt_metap('tbl1_pkey');      
 magic  | version | root | level | fastroot | fastlevel       
--------+---------+------+-------+----------+-----------      
 340322 |       2 |  412 |     2 |      412 |         2      
(1 row)      

level = 2的btree应该长这样

6

1. 以下查询,命中了1条记录,并且走的是index only scan。

读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page. 1个heap visibility map page

postgres=#  explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;      
                                                         QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------      
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.42..1.44 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=1)      
   Output: id      
   Index Cond: (tbl1.id = 1)      
   Heap Fetches: 0      
   Buffers: shared hit=4      
 Planning time: 0.072 ms      
 Execution time: 0.072 ms      
(7 rows)      

2. 以下查询,命中了0条记录,并且走的是index only scan。

读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page. 0个heap visibility map page

但是explain只算了3个,因为rows=0, 没有匹配的行。不需要查询visibility map文件。

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (3);      
                                                         QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------      
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..1.45 rows=1 width=4) (actual time=0.010..0.010 rows=0 loops=1)      
   Output: id      
   Index Cond: (tbl1.id = 3)      
   Heap Fetches: 0      
   Buffers: shared hit=3      
 Planning time: 0.073 ms      
 Execution time: 0.031 ms      
(7 rows)      

3. 以下查询,命中了7条记录,并且走的是index only scan。

读了22个INDEX PAGE,

1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22

也就是说,每个value都扫了root,branch,leaf。

x个heap visibility map page

postgres=#  explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (1,2,3,4,100,1000,10000);      
                                                         QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------      
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.42..10.10 rows=7 width=4) (actual time=0.018..0.033 rows=7 loops=1)      
   Output: id      
   Index Cond: (tbl1.id = ANY ('{1,2,3,4,100,1000,10000}'::integer[]))      
   Heap Fetches: 0      
   Buffers: shared hit=22      
 Planning time: 0.083 ms      
 Execution time: 0.056 ms      
(7 rows)      

4. 以下查询,命中了2条记录,并且走的是index only scan。

读了22个INDEX PAGE,

1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22

也就是说,每个value都扫了root,branch,leaf。

x个heap visibility map page

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (1,2,3,4,5,6,7);      
                                                         QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------      
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..10.13 rows=7 width=4) (actual time=0.039..0.046 rows=2 loops=1)      
   Output: id      
   Index Cond: (tbl1.id = ANY ('{1,2,3,4,5,6,7}'::integer[]))      
   Heap Fetches: 0      
   Buffers: shared hit=22      
 Planning time: 0.232 ms      
 Execution time: 0.086 ms      
(7 rows)      

5. 以下查询结果和以上查询一样,也命中了3条记录,并且走的是index only scan。

但是只读了4个INDEX PAGE,

1 meta page + 1 root + 1 branch + 1 leaf

x个heap visibility map page

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id>0 and id <=7;      
                                                         QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------      
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..1.49 rows=3 width=4) (actual time=0.008..0.009 rows=2 loops=1)      
   Output: id      
   Index Cond: ((tbl1.id > 0) AND (tbl1.id <= 7))      
   Heap Fetches: 0      
   Buffers: shared hit=4      
 Planning time: 0.127 ms      
 Execution time: 0.028 ms      
(7 rows)      

对于第四个查询,扫描了22个块,这个查询,优化器有优化的空间,比如找到1和7作为边界值,在查询到第一个值时,就可以取到leaf page的下一个page的最小值,从而得到1,2,3,4,5,6,7的值在当前page就可以完全取到,不需要去重复扫描。

hit计数为什么感觉不正确?: 算上visibility map时.

src/include/common/relpath.h

/*  
 * Stuff for fork names.  
 *  
 * The physical storage of a relation consists of one or more forks.  
 * The main fork is always created, but in addition to that there can be  
 * additional forks for storing various metadata. ForkNumber is used when  
 * we need to refer to a specific fork in a relation.  
 */  
typedef enum ForkNumber  
{  
        InvalidForkNumber = -1,  
        MAIN_FORKNUM = 0,  
        FSM_FORKNUM,  
        VISIBILITYMAP_FORKNUM,  
        INIT_FORKNUM  
  
        /*  
         * NOTE: if you add a new fork, change MAX_FORKNUM and possibly  
         * FORKNAMECHARS below, and update the forkNames array in  
         * src/common/relpath.c  
         */  
} ForkNumber;  
  
#define MAX_FORKNUM             INIT_FORKNUM  

src/backend/storage/buffer/bufmgr.c

/*  
 * ReadBuffer_common -- common logic for all ReadBuffer variants  
 *  
 * *hit is set to true if the request was satisfied from shared buffer cache.  
 */  
static Buffer  
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,  
                                  BlockNumber blockNum, ReadBufferMode mode,  
                                  BufferAccessStrategy strategy, bool *hit)  
{  
        BufferDesc *bufHdr;  
        Block           bufBlock;  
        bool            found;  
        bool            isExtend;  
        bool            isLocalBuf = SmgrIsTemp(smgr);  
  
        *hit = false;  
... ...  
        /* Substitute proper block number if caller asked for P_NEW */  
        if (isExtend)  
                blockNum = smgrnblocks(smgr, forkNum);  
  
        if (isLocalBuf)  
        {  
                bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);  
                if (found)  
                        pgBufferUsage.local_blks_hit++;  
                else if (isExtend)  
                        pgBufferUsage.local_blks_written++;  
                else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||  
                                 mode == RBM_ZERO_ON_ERROR)  
                        pgBufferUsage.local_blks_read++;  
        }  
        else  
        {  
                /*  
                 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is  
                 * not currently in memory.  
                 */  
                bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,  
                                                         strategy, &found);  
                if (found)  
                        pgBufferUsage.shared_blks_hit++;  
                else if (isExtend)  
                        pgBufferUsage.shared_blks_written++;  
                else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||  
                                 mode == RBM_ZERO_ON_ERROR)  
                        pgBufferUsage.shared_blks_read++;  
        }  

src/backend/commands/explain.c

/*  
 * Show buffer usage details.  
 */  
static void  
show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)  
{  
  
...  
                /* Show only positive counter values. */  
                if (has_shared || has_local || has_temp)  
                {  
                        ExplainIndentText(es);  
                        appendStringInfoString(es->str, "Buffers:");  
  
                        if (has_shared)  
                        {  
                                appendStringInfoString(es->str, " shared");  
                                if (usage->shared_blks_hit > 0)  
                                        appendStringInfo(es->str, " hit=%ld",  
                                                                         usage->shared_blks_hit);  
create extension pageinspect;  
create extension pg_buffercache;  

创建解析page data的函数. 倒转, 二进制转int

create or replace function idx2int(text) returns int as $$    
declare  
  res text := '';   
  res1 int8;   
  x text := '';  
begin  
  for x in select regexp_split_to_table($1,' ')   
  loop  
    res := x||res;  
  end loop;  
  execute format($_$ select x'%s'::int8 $_$, res) into res1;  
  return res1::int;   
end;  
$$ language plpgsql strict;  
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 5;   
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..1.55 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1)  
   Output: id  
   Index Cond: (tbl1.id = 5)  
   Heap Fetches: 0  
   Buffers: shared hit=4  
 Planning Time: 0.071 ms  
 Execution Time: 0.046 ms  
(7 rows)  
  
postgres=#  SELECT * FROM bt_metap('tbl1_pkey');  -- meta   
 magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage   
--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------  
 340322 |       4 |  412 |     2 |      412 |         2 |           0 |                 3934241 | t  
(1 row)  
  
postgres=# select *,idx2int(data) from bt_page_items('tbl1_pkey', 412);  -- root  
 itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead | htid | tids | idx2int   
------------+----------+---------+-------+------+-------------------------+------+------+------+---------  
          1 | (3,0)    |       8 | f     | f    |                         |      |      |      |       0  
          2 | (9202,1) |      16 | f     | f    | 2b 5e 03 00 00 00 00 00 |      |      |      |  220715  
          3 | (4616,1) |      16 | f     | f    | ac b1 06 00 00 00 00 00 |      |      |      |  438700  
          4 | (9227,1) |      16 | f     | f    | 7a 4c 0a 00 00 00 00 00 |      |      |      |  674938  
...  
  
postgres=# select *,idx2int(data) from bt_page_items('tbl1_pkey', 3);  -- branch  
 itemoffset |   ctid    | itemlen | nulls | vars |          data           | dead | htid | tids | idx2int   
------------+-----------+---------+-------+------+-------------------------+------+------+------+---------  
          1 | (4521,1)  |      16 | f     | f    | 2b 5e 03 00 00 00 00 00 |      |      |      |  220715  
          2 | (1,0)     |       8 | f     | f    |                         |      |      |      |       0  
          3 | (10555,1) |      16 | f     | f    | 56 02 00 00 00 00 00 00 |      |      |      |     598  
          4 | (5423,1)  |      16 | f     | f    | 05 05 00 00 00 00 00 00 |      |      |      |    1285  
...  
  
postgres=# select *,idx2int(data) from bt_page_items('tbl1_pkey', 1);  -- leaf  
 itemoffset |    ctid     | itemlen | nulls | vars |          data           | dead |    htid     | tids | idx2int   
------------+-------------+---------+-------+------+-------------------------+------+-------------+------+---------  
          1 | (25828,1)   |      16 | f     | f    | 56 02 00 00 00 00 00 00 |      |             |      |     598  
          2 | (594,78)    |      16 | f     | f    | 00 00 00 00 00 00 00 00 | f    | (594,78)    |      |       0  
          3 | (19518,66)  |      16 | f     | f    | 05 00 00 00 00 00 00 00 | f    | (19518,66)  |      |       5  
          4 | (32751,68)  |      16 | f     | f    | 06 00 00 00 00 00 00 00 | f    | (32751,68)  |      |       6  
          5 | (20678,31)  |      16 | f     | f    | 07 00 00 00 00 00 00 00 | f    | (20678,31)  |      |       7  
...  
  
postgres=# select ctid,* from tbl1 where ctid='(19518,66)';  -- heap   
    ctid    | id |               info                 
------------+----+----------------------------------  
 (19518,66) |  5 | 01113e164911cf0eaf5db51b4c6e086b  
(1 row)  

计算heap page 19518 的2 bit位属于那个vm page num

postgres=# show block_size ;  
 block_size   
------------  
 8192    
(1 row)  
  
postgres=# select 2*19518/8.0;  
       ?column?          
-----------------------  
 4879.5000000000000000  -- bytes    
(1 row)  

访问了第1个vm数据块, 0号块

postgres=# select * from pg_buffercache  where relfilenode=pg_relation_filenode('tbl1'::regclass) and relforknumber=2 ;  -- table vm form  
 bufferid | relfilenode | reltablespace | reldatabase | relforknumber | relblocknumber | isdirty | usagecount | pinning_backends   
----------+-------------+---------------+-------------+---------------+----------------+---------+------------+------------------  
   402016 |     1093455 |          1663 |       14174 |             2 |              0 | f       |          5 |                0  -- 访问这个     
   402017 |     1093455 |          1663 |       14174 |             2 |              1 | f       |          4 |                0  
(2 rows)  

存在的值, Buffers: shared hit=4 -- (meta+root+branch+leaf) + vm

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 5;   
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..1.55 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1)  
   Output: id  
   Index Cond: (tbl1.id = 5)  
   Heap Fetches: 0  
   Buffers: shared hit=4  
 Planning Time: 0.071 ms  
 Execution Time: 0.046 ms  
(7 rows)  

不存在的值, Buffers: shared hit=3 -- (meta+root+branch+leaf)

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;   
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..1.55 rows=1 width=4) (actual time=0.025..0.066 rows=0 loops=1)  
   Output: id  
   Index Cond: (tbl1.id = 1)  
   Heap Fetches: 0  
   Buffers: shared hit=3  
 Planning Time: 0.086 ms  
 Execution Time: 0.082 ms  
(7 rows)  

存在、不存在唯一的差别是: 不存在时不需要访问vm.

所以从现象看, shared hit统计不准确: 可能的解释是 bt index meta page 没有被count? 有兴趣的同学可以从代码侧再分析一下。

还有访问多次时vm page是否被重复计算?

此前有对比过bitmap index scan和index scan, 在离散扫描上index sacn会重复算每次的heap page扫描, 二bitmap index scan只算一次heap scan.

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat