Skip to content

Latest commit

 

History

History
1135 lines (867 loc) · 40.5 KB

20130109_01.md

File metadata and controls

1135 lines (867 loc) · 40.5 KB

PostgreSQL large row|column performance tuning case - 包括大对象

作者

digoal

日期

2013-01-09

标签

PostgreSQL , 大字段优化 , 大行优化 , toast , 大对象 , MVCC , 多版本 , IO放大


背景

这篇BLOG主要针对如下场景进行优化 :

1. 表的行较大, 例如1KB以上.

2. 索引较多, 如4个以上.

3. 表的记录大部分都要被更新, 例如1个1000万的表每天有700W条记录要被更新.

4. 表包含大字段, 如text类型的字段, 单个字段可能占用1KB左右. 这种字段也有更新, 但是较少量.

5. 大多数字段需要被更新.

由于PG通过多版本来实现数据快照和事务隔离,因此更新会产生新的版本,如果更新频繁的话,那么建议将频繁更新的字段和不频繁更新的字段隔开。 比如使用TOAST技术。

正文

首先构建一下测试环境 :

测试表 :

                                   Table "digoal.test"    
 Column |            Type             | Modifiers | Storage  | Stats target | Description     
--------+-----------------------------+-----------+----------+--------------+-------------    
 id     | bigint                      | not null  | plain    |              |     
 col1   | name                        |           | plain    |              |     
 col2   | text                        |           | extended |              |     
 col3   | text                        |           | extended |              |     
 col4   | timestamp without time zone |           | plain    |              |     
Indexes:    
    "test_pkey" PRIMARY KEY, btree (id)    
    "idx_test_col1" btree (col1)    
    "idx_test_col2" btree (col2)    
    "idx_test_col3" btree (col3)    
    "idx_test_col4" btree (col4)    
Has OIDs: no    

测试数据 :

digoal=> insert into test select generate_series(1,10000),repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 3000), clock_timestamp();    
INSERT 0 10000    
Time: 92796.813 ms    
    
digoal=> select id,col1,substr(col2,1,32),substr(col3,1,32),col4 from test limit 2;    
-[ RECORD 1 ]-----------------------------------------------------------    
id     | 1    
col1   | 6d4d815f31340d18a6126cdc0527a0556d4d815f31340d18a6126cdc0527a05    
substr | 52a7ab1474b77d03cfb496f6c461ab69    
substr | 4c56be81579819e034764a562648929d    
col4   | 2013-01-09 20:37:38.081669    
-[ RECORD 2 ]-----------------------------------------------------------    
id     | 2    
col1   | 33db862bf58a9c7d080a879efbf9af8c33db862bf58a9c7d080a879efbf9af8    
substr | 41663453b02f0dc57932feb1f2388949    
substr | fbb5fe7f75f86fb2406f340073a0ca3c    
col4   | 2013-01-09 20:37:38.084888    
Time: 0.722 ms    

平均每条记录(包括头信息等)占用1638字节 :

digoal=> select oid,reltoastrelid,relname from pg_class where relname='test';    
  oid   | reltoastrelid | relname     
--------+---------------+---------    
 108687 |        108690 | test    
(1 row)    
Time: 0.658 ms    
    
digoal=> \c digoal postgres    
You are now connected to database "digoal" as user "postgres".    
    
digoal=> select pg_relation_size(108687::regclass)+pg_relation_size(108690::regclass);    
 ?column?     
----------    
 16384000    
(1 row)    
Time: 0.750 ms    
digoal=# select 16384000/10000;    
 ?column?     
----------    
     1638    
(1 row)    
Time: 0.668 ms    

简单测试 :

1. INSERT

如上10000条记录插入时间92796.813 ms

2. UPDATE

digoal=> update test set col1=repeat(md5(clock_timestamp()::text), 10);    
UPDATE 10000    
Time: 78812.367 ms    

3. SELECT

digoal=> select count(*) from test ;    
 count     
-------    
 10000    
(1 row)    
Time: 2.420 ms    

4. DELETE

digoal=> delete from test ;    
DELETE 10000    
Time: 93.877 ms    

UPDATE性能瓶颈主要来自索引的更新, 特别是col3上的索引, 详细见后面.

优化举例

方案1

1. 既然大部分的开销是在索引上, 那么减少索引的更新是一种优化办法.

按列拆表, 创建视图. 通过创建视图触发器(用于插入,更新,删除操作). 如下 :

拆表后更新操作可以减少对索引的更新. 拆表的原则见后面.

把前面用到的test表拆成4个表, 这4个表通过id关联起来. 原表的col1,col2,col3,col4上的索引在分表中也要加上.

digoal=> \d+ t1    
                         Table "digoal.t1"    
 Column |  Type  | Modifiers | Storage | Stats target | Description     
--------+--------+-----------+---------+--------------+-------------    
 id     | bigint | not null  | plain   |              |     
 col1   | name   |           | plain   |              |     
Indexes:    
    "t1_pkey" PRIMARY KEY, btree (id)    
    "idx_col1" btree (col1)    
Has OIDs: no    
    
digoal=> \d+ t2    
                          Table "digoal.t2"    
 Column |  Type  | Modifiers | Storage  | Stats target | Description     
--------+--------+-----------+----------+--------------+-------------    
 id     | bigint | not null  | plain    |              |     
 col2   | text   |           | extended |              |     
Indexes:    
    "t2_pkey" PRIMARY KEY, btree (id)    
    "idx_col2" btree (col2)    
Has OIDs: no    
    
digoal=> \d+ t3    
                          Table "digoal.t3"    
 Column |  Type  | Modifiers | Storage  | Stats target | Description     
--------+--------+-----------+----------+--------------+-------------    
 id     | bigint | not null  | plain    |              |     
 col3   | text   |           | extended |              |     
Indexes:    
    "t3_pkey" PRIMARY KEY, btree (id)    
    "idx_col3" btree (col3)    
Has OIDs: no    
    
digoal=> \d+ t4    
                                    Table "digoal.t4"    
 Column |            Type             | Modifiers | Storage | Stats target | Description     
--------+-----------------------------+-----------+---------+--------------+-------------    
 id     | bigint                      | not null  | plain   |              |     
 col4   | timestamp without time zone |           | plain   |              |     
Indexes:    
    "t4_pkey" PRIMARY KEY, btree (id)    
    "idx_col4" btree (col4)    
Has OIDs: no    

拆成4个表, 每个表都包含ID字段, 也就是原test表的PK. 通过ID字段关联.

具体的拆表原则见末尾小结部分.

创建一下视图 :

digoal=> create or replace view v_t as select t1.id,t1.col1,t2.col2,t3.col3,t4.col4 from t1 inner join t2 using (id) inner join t3 using (id) inner join t4 using (id);    
CREATE VIEW    

结构如下 :

digoal=> \d+ v_t    
                             View "digoal.v_t"    
 Column |            Type             | Modifiers | Storage  | Description     
--------+-----------------------------+-----------+----------+-------------    
 id     | bigint                      |           | plain    |     
 col1   | name                        |           | plain    |     
 col2   | text                        |           | extended |     
 col3   | text                        |           | extended |     
 col4   | timestamp without time zone |           | plain    |     
View definition:    
 SELECT t1.id, t1.col1, t2.col2, t3.col3, t4.col4    
   FROM t1    
   JOIN t2 USING (id)    
   JOIN t3 USING (id)    
   JOIN t4 USING (id);    

接下来要创建触发器函数了, 为了程序操作方便, 插入, 更新, 删除, 查询都直接操作这个视图, 那样和操作test表就没有分别了.

CREATE OR REPLACE FUNCTION digoal.tg_v_t()    
 RETURNS trigger    
 LANGUAGE plpgsql    
AS $function$                                  
declare    
begin    
  IF (TG_OP = 'UPDATE') THEN    
    UPDATE t1 set id=NEW.id,col1=NEW.col1 where id=NEW.id and col1<>NEW.col1;    
    UPDATE t2 set id=NEW.id,col2=NEW.col2 where id=NEW.id and col2<>NEW.col2;    
    UPDATE t3 set id=NEW.id,col3=NEW.col3 where id=NEW.id and col3<>NEW.col3;    
    UPDATE t4 set id=NEW.id,col4=NEW.col4 where id=NEW.id and col4<>NEW.col4;    
    return NEW;    
  ELSIF (TG_OP = 'INSERT') THEN    
    insert into t1 (id, col1) values (NEW.id,NEW.col1);    
    insert into t2 (id, col2) values (NEW.id,NEW.col2);    
    insert into t3 (id, col3) values (NEW.id,NEW.col3);    
    insert into t4 (id, col4) values (NEW.id,NEW.col4);    
    return NEW;    
  ELSIF (TG_OP = 'DELETE') THEN    
    delete from t1 where id=OLD.id;    
    delete from t2 where id=OLD.id;    
    delete from t3 where id=OLD.id;    
    delete from t4 where id=OLD.id;    
    return OLD;    
  END IF;    
  return null;    
end;    
$function$;    

创建触发器 :

digoal=> create trigger tg_v_t INSTEAD OF INSERT OR DELETE OR UPDATE ON v_t FOR EACH ROW EXECUTE PROCEDURE tg_v_t();    
CREATE TRIGGER    

触发器函数可以分开INSERT,UPDATE,DELETE建, 去除TG_OP的判断, 亦能提升一点性能.

数据校验测试 :

digoal=> insert into v_t values (10001,'abc','test','test1',now());    
INSERT 0 1    
Time: 30.599 ms    
digoal=> update v_t set col1='digoal',col2='digoal',col3='digoal',col4=now() where id > 9995;    
UPDATE 6    
Time: 17.493 ms    
digoal=> delete from v_t where col1='digoal' and id in (9997,9999);    
DELETE 2    
Time: 3.197 ms    
digoal=> select * from v_t where id>9995;    
  id   |  col1  |  col2  |  col3  |            col4                
-------+--------+--------+--------+----------------------------    
  9996 | digoal | digoal | digoal | 2013-01-09 15:53:04.247144    
  9998 | digoal | digoal | digoal | 2013-01-09 15:53:04.247144    
 10000 | digoal | digoal | digoal | 2013-01-09 15:53:04.247144    
 10001 | digoal | digoal | digoal | 2013-01-09 15:53:04.247144    
(4 rows)    
Time: 2.265 ms    

拆表后的性能测试 :

1. INSERT

digoal=> insert into v_t select generate_series(1,10000),repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 3000), clock_timestamp();    
INSERT 0 10000    
Time: 95542.662 ms    

2. UPDATE

digoal=> update v_t set col1=repeat(md5(clock_timestamp()::text), 10);    
UPDATE 10000    
Time: 6956.229 ms    

3. SELECT

digoal=> select count(*) from v_t;    
 count     
-------    
 10000    
(1 row)    
Time: 30.902 ms    

4. DELETE

digoal=> delete from v_t;    
DELETE 10000    
Time: 1250.829 ms    

pgbench测试

10W测试数据, 如下 :

digoal=> insert into test select generate_series(1,100000),repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 3000), clock_timestamp();    
digoal=> insert into v_t select generate_series(1,100000),repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 10), repeat(md5(clock_timestamp()::text), 3000), clock_timestamp();    
场景1

2个非大字段更新

拆表前测试

ocz@db-172-16-3-150-> vi t.sql    
\setrandom id 1 100000    
update test set col1=repeat(md5(clock_timestamp()::text), 10), col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 45262    
tps = 754.215032 (including connections establishing)    
tps = 754.324088 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003502        \setrandom id 1 100000    
        10.598039       update test set col1=repeat(md5(clock_timestamp()::text), 10), col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    

拆表后视图更新测试

ocz@db-172-16-3-150-> vi t.sql    
\setrandom id 1 100000    
update v_t set col1=repeat(md5(clock_timestamp()::text), 10), col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 394536    
tps = 6575.466080 (including connections establishing)    
tps = 6576.538308 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003344        \setrandom id 1 100000    
        1.209849        update v_t set col1=repeat(md5(clock_timestamp()::text), 10), col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    

拆表后事务中单表更新测试

ocz@db-172-16-3-150-> vi t.sql    
\setrandom id 1 100000    
begin    
update t1 set col1=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
update t2 set col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
end;    

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 734025    
tps = 12227.678754 (including connections establishing)    
tps = 12229.395011 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.002815        \setrandom id 1 100000    
        0.071055        begin    
        0.229502        update t1 set col1=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
        0.242445        update t2 set col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
        0.101776        end;    

从场景1测试来看拆分后通过视图更新减少列col3和col4的更新, 给性能带来了极大的提升.

场景2

1个大字段,1个小字段综合更新

拆表前测试

ocz@db-172-16-3-150-> vi t.sql    
\setrandom id 1 100000    
update test set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 35627    
tps = 593.663136 (including connections establishing)    
tps = 593.742321 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003783        \setrandom id 1 100000    
        13.465606       update test set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    

拆表后视图更新测试

ocz@db-172-16-3-150-> vi t.sql    
\setrandom id 1 100000    
update v_t set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 34437    
tps = 573.834822 (including connections establishing)    
tps = 573.932045 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.004346        \setrandom id 1 100000    
        13.929713       update v_t set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    

注意大列字段的更新性能没有得到提升. 但是由于大列更新不频繁, 所以总体来说是可以接受的.

拆表后事务中单表更新测试

ocz@db-172-16-3-150-> vi t.sql    
\setrandom id 1 100000    
begin    
update t1 set col1=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
update t3 set col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
end;    

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 35155    
tps = 585.795505 (including connections establishing)    
tps = 585.875121 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003436        \setrandom id 1 100000    
        0.089372        begin    
        0.253391        update t1 set col1=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
        13.160805       update t3 set col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
        0.139259        end;    

场景2测试中发现, 视图中虽然缺少了col1和col2的更新, 但是性能却和拆表前差不多, 比较差.

因此大字段的更新才是性能瓶颈的关键.

场景3

将大列col3上的索引删除后, 测试结果如下 :

ocz@db-172-16-3-150-> psql     
postgres=# \c digoal digoal    
digoal=> drop index idx_col3;    
DROP INDEX    
digoal=> drop index idx_test_col3;    
DROP INDEX    
    
ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 269946    
tps = 4498.935114 (including connections establishing)    
tps = 4499.613875 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003341        \setrandom id 1 100000    
        1.771405        update test set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 196676    
tps = 3277.755460 (including connections establishing)    
tps = 3278.233032 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003820        \setrandom id 1 100000    
        2.432612        update v_t set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 237624    
tps = 3959.030176 (including connections establishing)    
tps = 3959.582280 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003228        \setrandom id 1 100000    
        0.058514        begin    
        0.205088        update t1 set col1=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
        1.660293        update t3 set col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
        0.086006        end;    

场景3和场景2的测试结果进行对比后可以看出大字段上的索引带来了极大的性能损耗.

优化方案2

1. 删除大字段上的索引.

删除后的性能如上面的场景3.

2. 或者不使用全索引, 例如只索引前面32个字符. substr(col3,1,32).

digoal=> create index idx_test_3 on test (substr(col3,1,32));    
digoal=> create index idx_col3 on t3 (substr(col3,1,32));    

测试结果如下 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 191874    
tps = 3197.762123 (including connections establishing)    
tps = 3198.203949 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003331        \setrandom id 1 100000    
        2.494849        update test set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 150292    
tps = 2504.685327 (including connections establishing)    
tps = 2505.064421 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003861        \setrandom id 1 100000    
        3.185736        update v_t set col1=repeat(md5(clock_timestamp()::text), 10), col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 174549    
tps = 2908.974298 (including connections establishing)    
tps = 2909.398826 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003254        \setrandom id 1 100000    
        0.060880        begin    
        0.199939        update t1 set col1=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
        2.391802        update t3 set col3=repeat(md5(clock_timestamp()::text), 3000) where id=:id;    
        0.086923        end;    

对于更新量较少的大字段, 这样优化还是可行的.

优化方案3

col3使用大对象存储.

digoal=> alter table test drop column col3;    
ALTER TABLE    
digoal=> alter table test add column col3 oid;    
ALTER TABLE    
digoal=> \d test    
               Table "digoal.test"    
 Column |            Type             | Modifiers     
--------+-----------------------------+-----------    
 id     | bigint                      | not null    
 col1   | name                        |     
 col2   | text                        |     
 col4   | timestamp without time zone |     
 col3   | oid                         |     
Indexes:    
    "test_pkey" PRIMARY KEY, btree (id)    
    "idx_test_col1" btree (col1)    
    "idx_test_col2" btree (col2)    
    "idx_test_col4" btree (col4)    
digoal=> truncate test ;    
TRUNCATE TABLE    

创建写大对象的函数, 打开文件FLAG参考 src/include/libpq/libpq-fs.h :

create or replace function write_lo (i_bytea bytea) returns oid as $$    
declare    
  oid_new_lo oid;    
  fd_new_lo int;    
begin    
  select lo_creat(-1) into oid_new_lo;    
  select lo_open(oid_new_lo, 131072) into fd_new_lo;    
  perform lowrite(fd_new_lo, i_bytea);    
  perform lo_close(fd_new_lo);    
  return oid_new_lo;    
end;    
$$ language plpgsql;    

创建读大对象的函数, 打开文件FLAG参考 src/include/libpq/libpq-fs.h :

create or replace function read_lo (i_lo_oid oid, i_size int4) returns bytea as $$    
declare    
  result bytea;    
  fd_lo int;    
begin    
  select lo_open(i_lo_oid, 262144) into fd_lo;    
  select loread(fd_lo, i_size) into result;    
  perform lo_close(fd_lo);    
  return result;    
end;    
$$ language plpgsql;    

插入数据 :

digoal=> insert into test (id,col1,col2,col3,col4) select     
digoal->   generate_series(1,100000),     
digoal->   repeat(md5(clock_timestamp()::text), 10),     
digoal->   repeat(md5(clock_timestamp()::text), 10),     
digoal->   write_lo(decode(repeat(md5(clock_timestamp()::text), 3000), 'base64')),     
digoal->   clock_timestamp();    
INSERT 0 100000    
Time: 273653.042 ms    

text和bytea的转换可以使用decode和encode函数.

用法可参见 :

http://www.postgresql.org/docs/9.2/static/functions-binarystring.html

http://www.postgresql.org/docs/9.2/static/functions-string.html

src/backend/utils/adt/encode.c

pgbench测试, 更新非大字段, 也就是前面场景1的测试 :

ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 1337021    
tps = 22283.174163 (including connections establishing)    
tps = 22286.160069 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.002805        \setrandom id 1 100000    
        0.353505        update test set col1=repeat(md5(clock_timestamp()::text), 10), col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    

性能从754tps提升到22286 tps

大对象存储在pg_largeobject中, 因此大对象的条数限制是oid的上限(无符号32位整型), 2^32-1. 一个数据库中没有办法存储超出2^32-1条大对象.

1个大对象被分成多个data段存储, 如下.

digoal=# \d pg_largeobject    
Table "pg_catalog.pg_largeobject"    
 Column |  Type   | Modifiers     
--------+---------+-----------    
 loid   | oid     | not null    
 pageno | integer | not null    
 data   | bytea   |     
Indexes:    
    "pg_largeobject_loid_pn_index" UNIQUE, btree (loid, pageno)    

不需要的大对象的清除方法 :

select lo_unlink(col3) from test;    

切记, 删除数据前要清除大对象, 否则这些数据会一直存在pg_largeobject中, 类似内存泄露.

小结

1. 首先, 不管大字段要不要更新, 首先把大字段拆出去. 这样更新性能可以从754tps 上升到 6576tps(更新视图) 或 12229tps(直接更新拆分后的底层表).

其次, 大字段上尽量不要使用索引, 无索引的情况下更新大字段的性能可以从593tps 上升到 4499tps.

再次, 如果大字段上必须要索引, 可以根据实际情况减少索引的长度, 例如只索引前32个字符(函数索引 | 表达式索引). 这样更新大字段的性能可以从593tps 上升到 3198tps.

最后, 可以考虑将大字段类型text改为oid, 也就是存储在大对象中(text最大存储1GB, 大对象可以存储2GB). 但是需要注意一个数据库的大对象条数是有限制的, 也就是OID的上限. 如果要突破这个限制还是考虑拆表使用text或者bytea类型吧. 使用大对象的情况下更新非大字段的性能可以从754tps 提升到22286 tps .

2. 拆表后的优势: 降低索引更新概率, 提高处理速度. 拆表后查询性能会变差. 但是如果不查大字段, 性能会更好, 因为扫描的HEAP块更少了.

3. 拆表后需要更多的存储来存储PK. 拆得越细, 耗费越大. 例如本例int8占用8字节, 拆表后比原来多了3个int8, 1000W条记录将多耗费24*10000KB=240MB. 但是换来了极好的更新性能.

4. 拆表的原则

4.1 根据索引拆, 如果包含col1,col2的联合索引, 那么拆成id,col1,col2

4.2 根据大小拆, 如果col3这个字段一般都存储1KB大小的内容, 较大. 所以可以拆成id, col3

4.3 以上都拆掉后根据经常更新的列拆, 如col4经常被更新, 那么拆成id, col4

经过以上拆分后test就拆成3个表了. 然后用id把这几个表关联起来.

5. 拆表后如果需要truncate表, 要在底层表上执行truncate, 因为view上不能建truncate触发器. 同时需要注意这些truncate要放在一个事务中处理.

6. 为什么拆表后更新性能可以得到提升呢?

6.1. 因为PostgreSQL 更新记录的操作实际上是新增1个tuple版本, 例如更新1KB的行, 会新产生1个1KB的行. 老的tuple在事务结束后由autovacuum进程去擦除. 或者手工执行vacuum.

所以行越大, 更新的开销越大. 拆表后更新的粒度变小, 性能自然可以提高.

6.2 未拆表时, 一行更新的话, 所有的索引都会被更新, 除非HOT, 但是大行的表HOT的概率非常低.

因此每次更新所有的索引都要被更新. 大大加大了IO和CPU的开销.

(HOT请参见 : src/backend/access/heap/README.HOT)

拆表后, 更新索引的概率会大大降低了.

7. 本例真正的性能瓶颈是大字段的索引耗费的CPU, 去除索引后性能将突飞猛进.

8. 更新的另一个主要瓶颈还可能来自不规则的更新导致的索引页迁移等IO和CPU开销。

可参见 :

http://blog.163.com/digoal@126/blog/static/16387704020129249646421/

9. http://www.postgresql.org/docs/9.2/static/largeobjects.html

打开文件可选flag

src/include/libpq/libpq-fs.h

/*    
 *      Read/write mode flags for inversion (large object) calls    
 */    
#define INV_WRITE               0x00020000    
#define INV_READ                0x00040000    

其他

如果tuple大到需要存储到TOAST表了, 那性能又会发生翻天覆地的变化, 原因和前面提到的large object类似. 在HEAP中存储的是一个指向TOAST存储的虚拟值(类似指针).

TOAST具体可参考 src/include/access/tuptoaster.h .

下面是一个测试 :

digoal=> alter table test drop column col3;    
ALTER TABLE    
digoal=> truncate table test;    
alTRUNCATE TABLE    
digoal=> alter table test add column col3 text;    
ALTER TABLE    
digoal=> \d test    
               Table "digoal.test"    
 Column |            Type             | Modifiers     
--------+-----------------------------+-----------    
 id     | bigint                      | not null    
 col1   | name                        |     
 col2   | text                        |     
 col4   | timestamp without time zone |     
 col3   | text                        |     
Indexes:    
    "test_pkey" PRIMARY KEY, btree (id)    
    "idx_test_col1" btree (col1)    
    "idx_test_col2" btree (col2)    
    "idx_test_col4" btree (col4)    
digoal=> create index idx_test_col3 on test(substr(col3,1,8192));    
CREATE INDEX    
digoal=> insert into test (id,col1,col2,col3,col4) select     
digoal->    generate_series(1,10),     
digoal->    repeat(md5(clock_timestamp()::text), 10),     
digoal->    repeat(md5(clock_timestamp()::text), 10),     
digoal->    repeat(md5(clock_timestamp()::text), 30000),     
digoal->    clock_timestamp();    
INSERT 0 10    
digoal=> select pg_column_size(col3) from test limit 1;    
 pg_column_size     
----------------    
          11031    
(1 row)    

单条记录达到11KB. 已经超过HEAP PAGE的8KB, 必然是存储在 TOAST中的.

digoal=> select reltoastrelid from pg_class where relname='test';    
 reltoastrelid     
---------------    
        108690    
(1 row)    
digoal=> select pg_relation_size(108690::regclass);    
 pg_relation_size     
------------------    
           139264    
(1 row)    
digoal=> truncate test ;    
TRUNCATE TABLE    
digoal=> insert into test (id,col1,col2,col3,col4) select     
digoal->    generate_series(1,100000),     
digoal->    repeat(md5(clock_timestamp()::text), 10),     
digoal->    repeat(md5(clock_timestamp()::text), 10),     
digoal->    repeat(md5(clock_timestamp()::text), 30000),     
digoal->    clock_timestamp();    
INSERT 0 100000    
ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 126582    
tps = 2109.062220 (including connections establishing)    
tps = 2109.413805 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.003987        \setrandom id 1 100000    
        3.783014        update test set col1=repeat(md5(clock_timestamp()::text), 10), col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    
digoal=> drop index idx_test_col3;    
DROP INDEX    
ocz@db-172-16-3-150-> pgbench -M prepared -f ./t.sql -n -r -c 8 -j 4 -T 60 -h 127.0.0.1 -U digoal digoal    
transaction type: Custom query    
scaling factor: 1    
query mode: prepared    
number of clients: 8    
number of threads: 4    
duration: 60 s    
number of transactions actually processed: 1436122    
tps = 23934.814969 (including connections establishing)    
tps = 23938.109376 (excluding connections establishing)    
statement latencies in milliseconds:    
        0.002752        \setrandom id 1 100000    
        0.328780        update test set col1=repeat(md5(clock_timestamp()::text), 10), col2=repeat(md5(clock_timestamp()::text), 10) where id=:id;    

8KB的BLOCKSIZE默认阈值大概是2KB左右, 如何修改字段存储到TOAST的阈值, 参考如下 :

《HOW to Change PostgreSQL's TOAST_TUPLE_THRESHOLD》

toast改小,尽量切片存储,带来的好处是减少不更新的可变字段带来的IO放大

1、长字段存在HEAP BLOCK时,更新性能差,IO放大效果明显。

create table test(id int primary key, c1 text);  
insert into test select id, repeat(md5(clock_timestamp()::text), 10000) from generate_series(1,1000) t(id);  
vi test.sql  
\set id random(1,1000)  
update test set id=:id where id=:id;  

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 100

progress: 1.0 s, 125435.6 tps, lat 0.250 ms stddev 0.438  
progress: 2.0 s, 130494.6 tps, lat 0.245 ms stddev 0.419  
progress: 3.0 s, 120103.6 tps, lat 0.266 ms stddev 1.848  
progress: 4.0 s, 143489.7 tps, lat 0.223 ms stddev 0.120  
progress: 5.0 s, 153019.4 tps, lat 0.209 ms stddev 0.095  
progress: 6.0 s, 154066.3 tps, lat 0.208 ms stddev 0.095  
progress: 7.0 s, 144861.8 tps, lat 0.221 ms stddev 0.280  
progress: 8.0 s, 77655.0 tps, lat 0.412 ms stddev 2.466  
progress: 9.0 s, 66832.0 tps, lat 0.461 ms stddev 2.815  
progress: 10.0 s, 67559.0 tps, lat 0.483 ms stddev 3.153  
 24  17  58   1   0   0|   0   628M| 198B  366B|   0     0 | 512k  930k  
 22  16  61   1   0   0|   0   617M| 108B  252B|   0     0 | 507k  934k  
 26  17  56   1   0   0|   0   483M| 108B  324B|   0     0 | 566k 1044k  
 27  17  55   1   0   0|4096B  483M| 108B  152B|   0     0 | 566k 1048k  
 26  17  56   1   0   0|4096B  483M| 108B  340B|   0     0 | 552k 1013k  
 18  13  69   1   0   0|   0   598M| 150B  414B|   0     0 | 361k  630k  
 13  10  76   1   0   0|4096B  609M| 144B  462B|   0     0 | 256k  432k  

2、当变长字段放在TOAST时,如果不更新它,实际上是没什么影响的,没有IO放大,性能好。

create table test(id int primary key, c1 text, c2 text, c3 text, c4 text, c5 text);  
  
insert into test select id,   
repeat(md5(clock_timestamp()::text), 30000),   
repeat(md5(clock_timestamp()::text), 30000),   
repeat(md5(clock_timestamp()::text), 30000),   
repeat(md5(clock_timestamp()::text), 30000),   
repeat(md5(clock_timestamp()::text), 30000)   
from generate_series(1,1000) t(id);  
progress: 1.0 s, 195992.1 tps, lat 0.160 ms stddev 0.070  
progress: 2.0 s, 204398.9 tps, lat 0.157 ms stddev 0.087  
progress: 3.0 s, 203810.0 tps, lat 0.157 ms stddev 0.063  
progress: 4.0 s, 206567.5 tps, lat 0.155 ms stddev 0.063  
progress: 5.0 s, 211927.1 tps, lat 0.151 ms stddev 0.056  
 28  18  54   0   0   0|   0   126M|  54B  276B|   0     0 | 671k 1242k  
 28  18  54   0   0   0|   0   102M| 144B  322B|   0     0 | 668k 1235k  
 28  18  54   0   0   0|   0   102M|  54B  324B|   0     0 | 665k 1229k  
 28  18  54   0   0   0|   0   118M|  54B  268B|   0     0 | 665k 1229k  
 29  18  53   0   0   0|   0   104M|  54B  308B|   0     0 | 684k 1266k  

3、当TOAST内的内容被更新时,才会影响性能。

vi test.sql  
\set id random(1,1000)  
update test set id=:id,c1=c2 where id=:id;  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 100  
progress: 1.0 s, 18838.8 tps, lat 1.663 ms stddev 1.745  
progress: 2.0 s, 18938.1 tps, lat 1.688 ms stddev 1.248  
progress: 3.0 s, 18339.9 tps, lat 1.745 ms stddev 1.936  
progress: 4.0 s, 19444.1 tps, lat 1.646 ms stddev 1.777  
progress: 5.0 s, 19555.0 tps, lat 1.638 ms stddev 1.762  
progress: 6.0 s, 19198.0 tps, lat 1.667 ms stddev 1.937  
progress: 7.0 s, 19275.1 tps, lat 1.660 ms stddev 1.822  
progress: 8.0 s, 18469.6 tps, lat 1.727 ms stddev 2.888  
  7   9  82   1   0   0|   0   808M|1676B 2100B|   0     0 | 332k  557k  
  7   9  83   1   0   0|   0   739M|2014B 2284B|   0     0 | 325k  544k  
  7   9  83   1   0   0|   0   637M|1720B 1990B|   0     0 | 317k  531k  
  7   9  83   1   0   0|  12k  692M|3386B 3656B|   0     0 | 326k  550k  
  7   8  84   1   0   0|   0   555M|2602B 2872B|   0     0 | 306k  513k  
  7   9  83   1   0   0|   0   670M|2014B 2284B|   0     0 | 313k  522k  

实际上TOAST的更新性能比HEAP MAIN要好。

其他

encode, decode, convert_from 例子 :

postgres=# select convert_from(decode('你好', 'escape'), 'UTF8'::name);    
 convert_from     
--------------    
 你好    
(1 row)    
    
postgres=# select decode(clock_timestamp()::text, 'escape');    
                            decode                                
--------------------------------------------------------------    
 \x323031332d30312d31352030373a34353a30382e3430303737322b3038    
(1 row)    
    
postgres=# select encode(decode(array[1,2,3,4,5]::text, 'escape'), 'escape');    
   encode        
-------------    
 {1,2,3,4,5}    
(1 row)    

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat