digoal
2021-04-28
PostgreSQL , citus , columnar , partition , timescaledb
时序场景对数据库的需求:
1、要求高速写入
2、要求存储支持压缩
3、要求高效率的区间查询和区间分析
PostgreSQL的时序解决方案:
1、采用分区表, 按时间分区, 当前分区使用heap存储格式(提供高性能写入, 批量入库, 单机每秒几十万条不是问题.)
2、将历史分区表 转换为columnar 分区存储. (使用citus的columnar引擎) 提供很好的压缩比. 也就是说一张表, 老的分区是columnar存储, 新的分区是heap存储.
3、PG 支持并行查询, 可以提供很好的分析效率.
4、需要点查或按时间小范围查询的通常是近期的数据(例如最近1个月), 这种分区建议保持heap引擎, 因为它支持index. columnar暂时不支持index.
columnar和heap的差别:
1、columnar是列存储格式, 支持压缩
2、heap是行存储格式, 支持变长字段toast动态压缩, 但是压缩比没有columnar高
3、columnar只支持批量写入(copy或insert into), 适合静态数据, 一次性转换.
4、heap支持任意写入操作, 高并发小事务很匹配.
5、columnar不支持索引, 所以全表扫描效率高, 点查或小范围查询效率低.
6、heap支持索引, 全表扫描比columnar低, 但是点查或小范围查询效率高.
7、columnar不支持单表并行计算. 所以需要通过多个分区, parallel append scan来实现并行计算.
8、heap支持单表并行计算.
另一个关联产品是timescaledb, 本文不做讨论.
https://github.com/timescale/timescaledb
PostgreSQL 13为例.
citus columnar 列存储使用举例
https://github.com/citusdata/citus
创建插件
vi postgresql.conf
shared_preload_libraries='citus'
pg_ctl restart -m fast
postgres=# create extension citus;
CREATE EXTENSION
创建列存表, 写入1000万测试数据
postgres=# create table cstore (id int, c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, crt_time timestamp, info text) using columnar;
CREATE TABLE
postgres=# insert into cstore select generate_series(1,10000000), random()*10, random()*20,random()*100,random()*1000,random()*10000, random()*500, random()*2000, clock_timestamp(), random()::text;
INSERT 0 10000000
创建行存表, 写入1000万数据
postgres=# create table heap (like cstore);
CREATE TABLE
postgres=# insert into heap select * from cstore ;
INSERT 0 10000000
查看columnar列存储参数, 包括压缩比等配置
postgres=# show columnar.compression;
columnar.compression
----------------------
zstd
(1 row)
postgres=# show columnar.compression_level;
columnar.compression_level
----------------------------
3
(1 row)
postgres=# show columnar.stripe_row_limit;
columnar.stripe_row_limit
---------------------------
150000
(1 row)
postgres=# show columnar.chunk_group_row_limit;
columnar.chunk_group_row_limit
--------------------------------
10000
(1 row)
对比列存储和行存储表的空间占用情况
postgres=# \dt+ cstore
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+--------+-------+----------+-------------+--------+-------------
public | cstore | table | postgres | permanent | 243 MB |
(1 row)
postgres=# \dt+ heap
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------+-------+----------+-------------+--------+-------------
public | heap | table | postgres | permanent | 888 MB |
(1 row)
简单的聚合查询性能
postgres=# select count(*) from heap ;
count
----------
10000000
(1 row)
Time: 512.432 ms
postgres=# select count(*) from cstore;
count
----------
10000000
(1 row)
Time: 386.415 ms
执行计划
postgres=# explain (analyze,verbose,costs,settings,buffers,wal,timing,summary) select count(*) from heap ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=238637.75..238637.76 rows=1 width=8) (actual time=1142.604..1142.605 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=14684 read=98953
-> Seq Scan on public.heap (cost=0.00..213637.60 rows=10000060 width=0) (actual time=0.018..686.369 rows=10000000 loops=1)
Output: id, c1, c2, c3, c4, c5, c6, c7, crt_time, info
Buffers: shared hit=14684 read=98953
Settings: max_parallel_workers_per_gather = '0'
Planning Time: 0.035 ms
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.186 ms, Inlining 0.000 ms, Optimization 0.081 ms, Emission 1.006 ms, Total 1.273 ms
Execution Time: 1142.839 ms
(13 rows)
Time: 1143.141 ms (00:01.143)
postgres=# explain (analyze,verbose,costs,settings,buffers,wal,timing,summary) select count(*) from cstore;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=25000.00..25000.01 rows=1 width=8) (actual time=996.225..996.227 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=3
-> Custom Scan (ColumnarScan) on public.cstore (cost=0.00..0.00 rows=10000000 width=0) (actual time=0.021..530.998 rows=10000000 loops=1)
Columnar Chunk Groups Removed by Filter: 0
Buffers: shared hit=3
Settings: max_parallel_workers_per_gather = '0'
Planning:
Buffers: shared hit=6
Planning Time: 0.127 ms
Execution Time: 996.249 ms
(11 rows)
Time: 996.632 ms
2、时序场景实践
创建按月分区的行存表
CREATE table ts (id int, c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, crt_time timestamp, info text)
PARTITION by range (crt_time);
create table ts_202101 PARTITION OF ts FOR VALUES FROM ('2021-01-01') to ('2021-02-01');
create table ts_202102 PARTITION OF ts FOR VALUES FROM ('2021-02-01') to ('2021-03-01');
create table ts_202103 PARTITION OF ts FOR VALUES FROM ('2021-03-01') to ('2021-04-01');
create table ts_202104 PARTITION OF ts FOR VALUES FROM ('2021-04-01') to ('2021-05-01');
create table ts_202105 PARTITION OF ts FOR VALUES FROM ('2021-05-01') to ('2021-06-01');
写入1000万测试数据
insert into ts select generate_series(1,10000000), random()*10, random()*20, random()*30,
random()*100, random()*200, random()*300, random()*1000,
'2021-01-01'::timestamp + ((random()*5*30*24)||' hour')::interval,
random()::text;
月初时创建列存表存放若干个月以前的分区数据, 将若干个月以前的分区数据转入列存表, 交换列存和行存储分区
begin;
lock table heap in exclusive mode;
CREATE table ts_columnar_202101 (like ts) using columnar;
insert into ts_columnar_202101 select * from ts_202101;
alter table ts DETACH PARTITION ts_202101;
alter table ts ATTACH PARTITION ts_columnar_202101 FOR VALUES FROM ('2021-01-01') to ('2021-02-01');
end;
2秒多点交换结束
INSERT 0 2066564
Time: 2654.981 ms (00:02.655)
交换结束后, TS表202101的分区就是列存分区了.
postgres=# \dt+ ts*
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+--------------------+-------------------+----------+-------------+---------+-------------
public | ts | partitioned table | postgres | permanent | 0 bytes |
public | ts_202101 | table | postgres | permanent | 184 MB |
public | ts_202102 | table | postgres | permanent | 166 MB |
public | ts_202103 | table | postgres | permanent | 184 MB |
public | ts_202104 | table | postgres | permanent | 178 MB |
public | ts_202105 | table | postgres | permanent | 178 MB |
public | ts_columnar_202101 | table | postgres | permanent | 58 MB |
(7 rows)
postgres=# \d+ ts
Partitioned table "public.ts"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
id | integer | | | | plain | |
c1 | integer | | | | plain | |
c2 | integer | | | | plain | |
c3 | integer | | | | plain | |
c4 | integer | | | | plain | |
c5 | integer | | | | plain | |
c6 | integer | | | | plain | |
c7 | integer | | | | plain | |
crt_time | timestamp without time zone | | | | plain | |
info | text | | | | extended | |
Partition key: RANGE (crt_time)
Partitions: ts_202102 FOR VALUES FROM ('2021-02-01 00:00:00') TO ('2021-03-01 00:00:00'),
ts_202103 FOR VALUES FROM ('2021-03-01 00:00:00') TO ('2021-04-01 00:00:00'),
ts_202104 FOR VALUES FROM ('2021-04-01 00:00:00') TO ('2021-05-01 00:00:00'),
ts_202105 FOR VALUES FROM ('2021-05-01 00:00:00') TO ('2021-06-01 00:00:00'),
ts_columnar_202101 FOR VALUES FROM ('2021-01-01 00:00:00') TO ('2021-02-01 00:00:00')
查询ts表, 正确的路由到对应分区
postgres=# select count(*) from ts where crt_time between '2021-01-02' and '2021-01-15';
count
--------
866004
(1 row)
Time: 140.311 ms
postgres=# explain select count(*) from ts where crt_time between '2021-01-02' and '2021-01-15';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=766.30..766.31 rows=1 width=8)
-> Custom Scan (ColumnarScan) on ts_columnar_202101 ts (cost=0.00..740.46 rows=10333 width=0)
Filter: ((crt_time >= '2021-01-02 00:00:00'::timestamp without time zone) AND (crt_time <= '2021-01-15 00:00:00'::timestamp without time zone))
(3 rows)
Time: 0.435 ms
列存多分区并行计算例子:
postgres=# explain select count(*) from ts;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=140940.59..140940.60 rows=1 width=8)
-> Gather (cost=140940.27..140940.58 rows=3 width=8)
Workers Planned: 3
-> Partial Aggregate (cost=139940.27..139940.28 rows=1 width=8)
-> Parallel Append (cost=0.00..131875.76 rows=3225806 width=0)
-> Custom Scan (ColumnarScan) on ts_columnar_202101 ts_1 (cost=0.00..0.00 rows=2066564 width=0)
-> Parallel Seq Scan on ts_202103 ts_3 (cost=0.00..30163.08 rows=666908 width=0)
-> Parallel Seq Scan on ts_202104 ts_4 (cost=0.00..29186.01 rows=645301 width=0)
-> Parallel Seq Scan on ts_202105 ts_5 (cost=0.00..29183.47 rows=645247 width=0)
-> Parallel Seq Scan on ts_202102 ts_2 (cost=0.00..27214.16 rows=601716 width=0)
JIT:
Functions: 9
Options: Inlining false, Optimization false, Expressions true, Deforming true
(13 rows)
Time: 1.113 ms
postgres=# select count(*) from ts;
count
----------
10000000
(1 row)
Time: 545.859 ms
detach的分区你可以看着办, 想删掉就删掉.
1、columnar有更高的压缩率
2、columnar更好的统计查询效率, 因为存储空间压缩了, 扫描的block也少了.
3、目前columnar存储仅支持多个分区parallel append scan并行, 不支持单个columnar分区并行查询. 所以为了得到更好的性能, 建议分区.
4、columnar使用限制参考
https://github.com/citusdata/citus/blob/master/src/backend/columnar/README.md
5、在不强求scale out的情况下, citus columnar提供了一个不错的列存储提升分析和时序场景高压缩比与高性能的选择, 结合PG的分区表实现完美的行列分区混合存储, 满足既要高并发写入, 又要高性能分析与压缩比的需求.
其他
https://github.com/citusdata/citus
https://github.com/timescale/timescaledb
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.