digoal
2017-09-18
PostgreSQL , PostGIS , geohash , brin , gist索引 , Greenplum , HybridDB for PostgreSQL
通常一个人的常驻地可能会包括:家、儿女家、双方父母家、情人、异性伴侣家、公司、商圈若干等。
通过对这些数据的运营,可以实现很多业务需求。例如:
1、寻人
《海量用户实时定位和圈人 - 团圆社会公益系统(位置寻人\圈人)》
2、线下广告投放人群圈选,选址,商圈人群画像。
《数据寻龙点穴(空间聚集分析) - 阿里云RDS PostgreSQL最佳实践》
《(新零售)商户网格化(基于位置GIS)运营 - 阿里云RDS PostgreSQL、HybridDB for PostgreSQL最佳实践》
3、基于位置的用户画像透视、基于用户群体的位置透视、以上需求再叠加时间区间条件进行透视。比如
以地图为底,在图上展示每个BOX、GRID(例如每方圆100米,用width_bucket或自定义UDF可以得到这些box id)的平均收入、平均消费。通过颜色深浅来表示收入和消费的数值。
再细一点,可以再分消费领域(饮食、衣服、电子产品、书籍、。。。。),这样是不是一眼就能看出该去哪里开什么类型的店了呢?当然电商发达的今天,一定要考虑线上和线下结合的。
这类空间圈人 + 人物透视的场景中,Greenplum无疑是一个很好的选择(简单粗暴、功能性能都满足),PostgreSQL 10也可以,到底选择PostgreSQL还是Greenplum呢?
1、PostgreSQL 10 适合以10TB ~ 100TB,OLTP为主,OLAP为辅的场景。与Oracle覆盖的场景非常类似。
兼容SQL:2011,百万+级tpmC。
支持多核并行计算。
支持可读写的OSS对象存储外部表。
支持常用类型、扩展数据类型:JSON(B)、Hstore(KV), PostGIS空间数据库、pgrouting(路由,图式搜索)、数组、ltree树类型、HLL估值类型, smlar, imgsmlr等。
支持SQL流计算插件
支持时序插件
支持btree, hash, gin, gist, sp-gist, bloom, brin等索引。
支持plpgsql, sql服务端编程。
支持分析型语法(多维计算、窗口查询)、递归查询(树形查询、图式搜索、等场景)。支持文本全文检索、模糊查询、相似查询、正则查询。支持数组相似查询,图像相似查询。
1.1 适合业务场景:
TB+级OLTP(在线事务处理)+OLAP(实时分析)。
模糊查询、相似搜索、正则搜索
全文检索
物联网
流式数据处理
社交
图式搜索
独立事件分析
冷热分离
异步消息
多值类型、图像特征值 相似搜索
实时数据清洗
GIS应用
任意字段实时搜索
... ...
1.2 主打:功能、稳定性、性能、高可用、可靠性、Oracle兼容性、HTAP。
2、HybridDB for PostgreSQL(Greenplum开源版GPDB改进而来) 适合PB级实时OLAP,非常典型的海量数仓。
兼容SQL:2008,兼容TPC-H,TPC-DS。有数十年的商业化历练经验。
支持可读写的OSS对象存储外部表。
支持常用类型、扩展数据类型:JSON、PostGIS空间数据库、数组、HLL估值类型。
支持bitmap, hash, btree索引。
支持pljava服务端编程。
支持分析型语法(多维计算、窗口查询、MADlib机器学习)、支持全文检索语法。
支持列存储、行存储、压缩、混合存储。
支持4阶段聚合,支持节点间自动重分布。
支持水平扩容。
2.1 适合业务场景:
PB+级实时分析。(传统统计;时间、空间、属性多维属性透视、圈人;任意表、任意维度JOIN;)
2.2 主打:分析型SQL兼容性、功能、稳定性、性能、高可用、扩展性。
1、表结构设计1,宽表(当标签种类在1600以内时)
create table tbl_pos (
uid int8, -- 用户ID
att1 int8, -- 用户标签1
att2 int8, -- 用户标签2
att3 int8, -- 用户标签3
....
pos1 geometry, -- 用户家庭位置
pos2 geometry, -- 用户公司位置
pos3 geometry, -- 用户xx位置
pos4 geometry, -- 用户xxx位置
...
);
或者
create table tbl_tag (
uid int8, -- 用户ID
att1 int8, -- 用户标签1
att2 int8, -- 用户标签2
att3 int8, -- 用户标签3
....
);
create table tbl_pos (
uid int8,
pos_att int2, -- 位置属性,(家、公司、。。。)
pos geometry, -- 位置
);
2、表结构设计2,JSONB作为标签字段,当表签种类大于1600时。
create table tbl_pos (
uid int8, -- 用户ID
att jsonb, -- 用户标签,用JSONB表示
....
pos1 geometry, -- 用户家庭位置
pos2 geometry, -- 用户公司位置
pos3 geometry, -- 用户xx位置
pos4 geometry, -- 用户xxx位置
...
);
3、表结构设计3,数组存标签设计(与结构2的覆盖范围一样),这个设计曾经用在个方案里面:
《恭迎万亿级营销(圈人)潇洒的迈入毫秒时代 - 万亿user_tags级实时推荐系统数据库设计》
create table tbl_tag (
uid int8,
tag text[],
...
pos1 geometry, -- 用户家庭位置
pos2 geometry, -- 用户公司位置
pos3 geometry, -- 用户xx位置
pos4 geometry, -- 用户xxx位置
...
);
4、表结构设计4,标签倒排设计(当标签种类超过1600,并且标签为YES OR NO的类别时,变更标签需要使用合并和UDF的方式,仅仅适合于PostgreSQL),这个设计层级用在这个方案里面:
《基于 阿里云 RDS PostgreSQL 打造实时用户画像推荐系统》
create table tbl_pos (
uid int8,
pos1 geometry, -- 用户家庭位置
pos2 geometry, -- 用户公司位置
pos3 geometry, -- 用户xx位置
pos4 geometry, -- 用户xxx位置
...
);
create table tbl_tag (
tag int,
userbits varbit
);
以上设计各有优劣以及覆盖的场景,看场景进行选择。
接下来就对比PG 10和GPDB,采用第一种设计进行对比,用例中有200种数值类型的标签种类,有10个常用地址。
1、建标签表、位置表,写入10亿标签数据,100亿位置数据。
create extension postgis;
create or replace function ct1 () returns void as $$
declare
sql text := '';
begin
sql := 'create table tbl_tag(uid int8,';
for i in 1..200 loop
sql := sql||'c'||i||' int2 default random()*32767,';
end loop;
sql := rtrim(sql, ',');
sql := sql||')';
execute sql;
end;
$$ language plpgsql strict;
create table tbl_pos(
uid int8 primary key,
pos_att int2,
pos geometry default st_setsrid(st_makepoint(73+random()*62, 3+random()*50), 4326)
);
create table tbl_pos1 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos2 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos3 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos4 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos5 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos6 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos7 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos8 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos9 (like tbl_pos including all) inherits (tbl_pos);
create table tbl_pos10 (like tbl_pos including all) inherits (tbl_pos);
select ct1();
nohup psql -c "insert into tbl_tag select generate_series(1,1000000000)" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos1 select generate_series(1,1000000000),1" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos2 select generate_series(1,1000000000),2" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos3 select generate_series(1,1000000000),3" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos4 select generate_series(1,1000000000),4" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos5 select generate_series(1,1000000000),5" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos6 select generate_series(1,1000000000),6" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos7 select generate_series(1,1000000000),7" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos8 select generate_series(1,1000000000),8" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos9 select generate_series(1,1000000000),9" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos10 select generate_series(1,1000000000),10" >/dev/null 2>&1 &
2、根据位置整理位置表,(因为用户的位置数据是通过行为算出来的,并且通常变化非常小,所以可以视为半静态数据,适合整理)。
create table tbl_pos_1 (like tbl_pos including all, check (pos_att=1)) inherits(tbl_pos);
create table tbl_pos_2 (like tbl_pos including all, check (pos_att=2)) inherits(tbl_pos);
create table tbl_pos_3 (like tbl_pos including all, check (pos_att=3)) inherits(tbl_pos);
create table tbl_pos_4 (like tbl_pos including all, check (pos_att=4)) inherits(tbl_pos);
create table tbl_pos_5 (like tbl_pos including all, check (pos_att=5)) inherits(tbl_pos);
create table tbl_pos_6 (like tbl_pos including all, check (pos_att=6)) inherits(tbl_pos);
create table tbl_pos_7 (like tbl_pos including all, check (pos_att=7)) inherits(tbl_pos);
create table tbl_pos_8 (like tbl_pos including all, check (pos_att=8)) inherits(tbl_pos);
create table tbl_pos_9 (like tbl_pos including all, check (pos_att=9)) inherits(tbl_pos);
create table tbl_pos_10 (like tbl_pos including all, check (pos_att=10)) inherits(tbl_pos);
-- 10位已经精确到米级, 足够使用
nohup psql -c "insert into tbl_pos_1 select * from tbl_pos1 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_2 select * from tbl_pos2 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_3 select * from tbl_pos3 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_4 select * from tbl_pos4 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_5 select * from tbl_pos5 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_6 select * from tbl_pos6 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_7 select * from tbl_pos7 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_8 select * from tbl_pos8 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_9 select * from tbl_pos9 order by pos_att, st_geohash(pos, 10); " >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos_10 select * from tbl_pos10 order by pos_att, st_geohash(pos, 10);" >/dev/null 2>&1 &
3、创建位置表的geohash brin块级索引,加速空间数据圈选。
create index idx_tbl_pos1_pos on tbl_pos1 using brin( pos );
create index idx_tbl_pos2_pos on tbl_pos2 using brin( pos );
create index idx_tbl_pos3_pos on tbl_pos3 using brin( pos );
create index idx_tbl_pos4_pos on tbl_pos4 using brin( pos );
create index idx_tbl_pos5_pos on tbl_pos5 using brin( pos );
create index idx_tbl_pos6_pos on tbl_pos6 using brin( pos );
create index idx_tbl_pos7_pos on tbl_pos7 using brin( pos );
create index idx_tbl_pos8_pos on tbl_pos8 using brin( pos );
create index idx_tbl_pos9_pos on tbl_pos9 using brin( pos );
create index idx_tbl_pos10_pos on tbl_pos10 using brin( pos );
解除未整理表的继承关系(只需要保留整理后的数据)。
alter table tbl_pos1 no inherit tbl_pos;
alter table tbl_pos2 no inherit tbl_pos;
alter table tbl_pos3 no inherit tbl_pos;
alter table tbl_pos4 no inherit tbl_pos;
alter table tbl_pos5 no inherit tbl_pos;
alter table tbl_pos6 no inherit tbl_pos;
alter table tbl_pos7 no inherit tbl_pos;
alter table tbl_pos8 no inherit tbl_pos;
alter table tbl_pos9 no inherit tbl_pos;
alter table tbl_pos10 no inherit tbl_pos;
4、创建标签数据btree索引。
for ((i=1;i<=200;i++))
do
nohup psql -c "set maintenance_work_mem='512MB'; create index idx_tbl_tag_$i on tbl_tag (c$i);" >/dev/null 2>&1 &
done
5、空间使用情况
数据,10亿记录标签表 424 GB,100亿记录位置表 640 GB。
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-----------------+-------+----------+------------+-------------
public | tbl_pos | table | postgres | 8192 bytes |
public | tbl_pos_1 | table | postgres | 64 GB |
public | tbl_pos_10 | table | postgres | 64 GB |
public | tbl_pos_2 | table | postgres | 64 GB |
public | tbl_pos_3 | table | postgres | 64 GB |
public | tbl_pos_4 | table | postgres | 64 GB |
public | tbl_pos_5 | table | postgres | 64 GB |
public | tbl_pos_6 | table | postgres | 64 GB |
public | tbl_pos_7 | table | postgres | 64 GB |
public | tbl_pos_8 | table | postgres | 64 GB |
public | tbl_pos_9 | table | postgres | 64 GB |
public | tbl_tag | table | postgres | 424 GB |
索引,
标签表单列索引21GB,总共4200 GB。
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+----------------------+-------+----------+-----------------+--------+-------------
public | idx_tbl_tag_1 | index | postgres | tbl_tag | 21 GB |
位置表单列BRIN索引2.7MB,总共27MB。
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+----------------------+-------+----------+-----------------+---------+-------------
public | idx_tbl_pos10_pos | index | postgres | tbl_pos10 | 2728 kB |
public | idx_tbl_pos1_pos | index | postgres | tbl_pos1 | 2728 kB |
public | idx_tbl_pos2_pos | index | postgres | tbl_pos2 | 2728 kB |
public | idx_tbl_pos3_pos | index | postgres | tbl_pos3 | 2728 kB |
public | idx_tbl_pos4_pos | index | postgres | tbl_pos4 | 2728 kB |
public | idx_tbl_pos5_pos | index | postgres | tbl_pos5 | 2728 kB |
public | idx_tbl_pos6_pos | index | postgres | tbl_pos6 | 2728 kB |
public | idx_tbl_pos7_pos | index | postgres | tbl_pos7 | 2728 kB |
public | idx_tbl_pos8_pos | index | postgres | tbl_pos8 | 2728 kB |
public | idx_tbl_pos9_pos | index | postgres | tbl_pos9 | 2728 kB |
1、100亿空间数据,按空间圈出约1000万人,400毫秒。
实际可以按游标返回
postgres=# select count(*) from tbl_pos where pos_att=1 and st_within(pos, st_setsrid(st_makebox2d(st_makepoint(120,5), st_makepoint(125.5,10.5)),4326));
count
---------
9757154
(1 row)
Time: 399.846 ms
执行计划如下,BRIN索引并行扫描
postgres=# explain select count(*) from tbl_pos where pos_att=1 and st_within(pos, st_setsrid(st_makebox2d(st_makepoint(120,5), st_makepoint(125.5,10.5)),4326));
QUERY PLAN
-------------------------------------------------------------------------------------------
Finalize Aggregate (cost=10016617202.13..10016617202.14 rows=1 width=8)
-> Gather (cost=10016617202.04..10016617202.05 rows=32 width=8)
Workers Planned: 32
-> Partial Aggregate (cost=10016617202.04..10016617202.05 rows=1 width=8)
-> Parallel Append (cost=0.00..10016617175.99 rows=10418 width=0)
-> Parallel Seq Scan on tbl_pos (cost=10000000000.00..10000000000.00 rows=1 width=0)
Filter: (('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry ~ pos) AND (pos_att = 1) AND _st_contains('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry, pos))
-> Parallel Bitmap Heap Scan on tbl_pos_1 (cost=2591.99..16617175.99 rows=10417 width=0)
Recheck Cond: ('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry ~ pos)
Filter: ((pos_att = 1) AND _st_contains('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry, pos))
-> Bitmap Index Scan on idx_tbl_pos1_posbn (cost=0.00..2508.66 rows=1000000000 width=0)
Index Cond: ('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry ~ pos)
(12 rows)
2、100亿空间数据,按空间圈出约1000万人,JOIN 10亿标签数据,透视这群人的标签属性,7秒。
执行计划如下,使用了并行BRIN扫描和并行nestloop JOIN。
postgres=# explain analyze select c1,count(*),avg(c2),max(c3) from tbl_tag t2 join (select uid from tbl_pos where pos_att=1 and st_within(pos, st_setsrid(st_makebox2d(st_makepoint(120,5), st_makepoint(125.5,10.5)),4326))) t1 on (t1.uid=t2.uid) group by c1;
QUERY PLAN
--------------------------------------------------------------------------
Finalize GroupAggregate (cost=10016650419.09..10016663638.78 rows=32873 width=44) (actual time=5417.105..6404.328 rows=32768 loops=1)
Group Key: t2.c1
-> Gather Merge (cost=10016650419.09..10016659894.42 rows=333344 width=44) (actual time=5417.071..6212.057 rows=1081163 loops=1)
Workers Planned: 32
Workers Launched: 32
-> Partial GroupAggregate (cost=10016650418.26..10016650652.64 rows=10417 width=44) (actual time=5392.695..5506.923 rows=32763 loops=33)
Group Key: t2.c1
-> Sort (cost=10016650418.26..10016650444.30 rows=10417 width=6) (actual time=5392.676..5442.197 rows=295671 loops=33)
Sort Key: t2.c1
Sort Method: quicksort Memory: 30914kB
-> Nested Loop (cost=0.57..10016649723.09 rows=10417 width=6) (actual time=8.413..5277.270 rows=295671 loops=33)
-> Parallel Append (cost=0.00..10016617175.99 rows=10418 width=8) (actual time=8.342..407.141 rows=295671 loops=33)
-> Parallel Seq Scan on tbl_pos (cost=10000000000.00..10000000000.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=33)
Filter: (('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry ~ pos) AND (pos_att = 1) AND _st_contains('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry, pos))
-> Parallel Bitmap Heap Scan on tbl_pos_1 (cost=2591.99..16617175.99 rows=10417 width=8) (actual time=8.341..381.660 rows=295671 loops=33)
Recheck Cond: ('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry ~ pos)
Rows Removed by Index Recheck: 32474
Filter: ((pos_att = 1) AND _st_contains('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry, pos))
Heap Blocks: lossy=3688
-> Bitmap Index Scan on idx_tbl_pos1_posbn (cost=0.00..2508.66 rows=1000000000 width=0) (actual time=19.182..19.182 rows=902400 loops=1)
Index Cond: ('0103000020E610000001000000050000000000000000005E4000000000000014400000000000005E4000000000000025400000000000605F4000000000000025400000000000605F4000000000000014400000000000005E400000000000001440'::geometry ~ pos)
-> Index Scan using idx on tbl_tag t2 (cost=0.57..3.11 rows=1 width=14) (actual time=0.016..0.016 rows=1 loops=9757154)
Index Cond: (uid = tbl_pos.uid)
Planning time: 0.690 ms
Execution time: 7098.662 ms
(25 rows)
3、10亿标签数据,按标签圈人约1000万,并行bitmap Or扫描,14.5秒。
postgres=# select count(*) from tbl_tag where c1=1 or c2 between 1 and 100 or c13=100 or c4 between 1 and 200;
count
---------
9196602
(1 row)
Time: 14491.705 ms (00:14.492)
执行计划如下
postgres=# explain analyze select count(*) from tbl_tag where c1=1 or c2 between 1 and 100 or c13=100 or c4 between 1 and 200;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=10080648.93..10080648.94 rows=1 width=8) (actual time=16170.482..16170.482 rows=1 loops=1)
-> Gather (cost=10080648.84..10080648.85 rows=32 width=8) (actual time=16148.118..16170.475 rows=33 loops=1)
Workers Planned: 32
Workers Launched: 32
-> Partial Aggregate (cost=10080648.84..10080648.85 rows=1 width=8) (actual time=16135.754..16135.754 rows=1 loops=33)
-> Parallel Bitmap Heap Scan on tbl_tag (cost=132192.09..10079940.37 rows=283386 width=0) (actual time=2174.588..16107.422 rows=278685 loops=33)
Recheck Cond: ((c1 = 1) OR ((c2 >= 1) AND (c2 <= 100)) OR (c13 = 100) OR ((c4 >= 1) AND (c4 <= 200)))
Rows Removed by Index Recheck: 4311619
Heap Blocks: exact=3516 lossy=293793
-> BitmapOr (cost=132192.09..132192.09 rows=9087533 width=0) (actual time=2094.773..2094.773 rows=0 loops=1)
-> Bitmap Index Scan on idx_tbl_tag_1 (cost=0.00..333.62 rows=30020 width=0) (actual time=9.718..9.718 rows=30332 loops=1)
Index Cond: (c1 = 1)
-> Bitmap Index Scan on idx_tbl_tag_2 (cost=0.00..43418.50 rows=3200783 width=0) (actual time=787.952..787.952 rows=3053594 loops=1)
Index Cond: ((c2 >= 1) AND (c2 <= 100))
-> Bitmap Index Scan on idx_tbl_tag_13 (cost=0.00..332.99 rows=29936 width=0) (actual time=3.662..3.662 rows=30554 loops=1)
Index Cond: (c13 = 100)
-> Bitmap Index Scan on idx_tbl_tag_4 (cost=0.00..79038.62 rows=5826795 width=0) (actual time=1293.437..1293.437 rows=6101279 loops=1)
Index Cond: ((c4 >= 1) AND (c4 <= 200))
Planning time: 0.289 ms
Execution time: 16733.719 ms
(20 rows)
阿里云的多维metascan特性可以解决这个扫描数据块过多的问题,性能提升到500毫秒以内。
4、10亿标签数据,按标签圈人约1000万,透视这群人的空间属性,hashjoin, 并行,203秒。
explain (analyze,verbose,timing,costs,buffers)
select st_geohash(t1.pos,6), count(*) from
tbl_pos_1 t1 join (select uid from tbl_tag where c1=1 or c2 between 1 and 100 or c13=100 or c4 between 1 and 200) t2
on (t1.uid=t2.uid)
group by st_geohash(t1.pos, 6);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=292262288.63..292678044.97 rows=9068342 width=40) (actual time=187392.324..197157.012 rows=8421096 loops=1)
Output: (st_geohash(t1.pos, 6)), count(*)
Group Key: (st_geohash(t1.pos, 6))
Buffers: shared hit=38723554 read=635308
-> Gather Merge (cost=292262288.63..292519348.94 rows=9068352 width=40) (actual time=187392.317..194293.700 rows=9171920 loops=1)
Output: (st_geohash(t1.pos, 6)), (PARTIAL count(*))
Workers Planned: 32
Workers Launched: 32
Buffers: shared hit=38723554 read=635308
-> Partial GroupAggregate (cost=292262287.80..292267955.52 rows=283386 width=40) (actual time=187348.727..187501.151 rows=277937 loops=33)
Output: (st_geohash(t1.pos, 6)), PARTIAL count(*)
Group Key: (st_geohash(t1.pos, 6))
Buffers: shared hit=1272770058 read=11675191
Worker 0: actual time=187342.771..187498.250 rows=282452 loops=1
Buffers: shared hit=39055272 read=348022
...... 并行聚合
-> Sort (cost=292262287.80..292262996.26 rows=283386 width=32) (actual time=187348.715..187401.757 rows=278685 loops=33)
Output: (st_geohash(t1.pos, 6))
Sort Key: (st_geohash(t1.pos, 6))
Sort Method: quicksort Memory: 25570kB
Buffers: shared hit=1272770058 read=11675191
Worker 0: actual time=187342.758..187396.443 rows=283206 loops=1
Buffers: shared hit=39055272 read=348022
...... 并行SORT
-> Hash Join (cost=10413383.91..292236623.78 rows=283386 width=32) (actual time=79890.153..186716.320 rows=278685 loops=33)
Output: st_geohash(t1.pos, 6)
Hash Cond: (t1.uid = tbl_tag.uid)
Buffers: shared hit=1272769802 read=11675191
Worker 0: actual time=81406.113..186712.149 rows=283206 loops=1
Buffers: shared hit=39055264 read=348022
...... 并行索引扫描
-> Parallel Index Scan using idx1 on public.tbl_pos_1 t1 (cost=0.57..281390010.62 rows=31250000 width=40) (actual time=0.040..92949.279 rows=30303030 loops=33)
Output: t1.uid, t1.pos_att, t1.pos
Buffers: shared hit=991056941 read=11675191
Worker 0: actual time=0.078..91228.338 rows=30782430 loops=1
Buffers: shared hit=30518510 read=348022
...... 并行HASH
-> Hash (cost=10300029.06..10300029.06 rows=9068342 width=8) (actual time=77789.991..77789.991 rows=9196602 loops=33)
Output: tbl_tag.uid
Buckets: 16777216 Batches: 1 Memory Usage: 490315kB
Buffers: shared hit=281712413
Worker 0: actual time=79153.913..79153.913 rows=9196602 loops=1
Buffers: shared hit=8536740
...... 并行bitmap扫描
-> Bitmap Heap Scan on public.tbl_tag (cost=132192.09..10300029.06 rows=9068342 width=8) (actual time=44896.981..74587.551 rows=9196602 loops=33)
Output: tbl_tag.uid
Recheck Cond: ((tbl_tag.c1 = 1) OR ((tbl_tag.c2 >= 1) AND (tbl_tag.c2 <= 100)) OR (tbl_tag.c13 = 100) OR ((tbl_tag.c4 >= 1) AND (tbl_tag.c4 <= 200)))
Heap Blocks: exact=8511538
Buffers: shared hit=281712413
Worker 0: actual time=45358.544..75896.906 rows=9196602 loops=1
Buffers: shared hit=8536740
...... 并行bitmap扫描
-> BitmapOr (cost=132192.09..132192.09 rows=9087533 width=0) (actual time=38429.522..38429.522 rows=0 loops=33)
Buffers: shared hit=831659
Worker 0: actual time=38869.151..38869.151 rows=0 loops=1
Buffers: shared hit=25202
...... 并行bitmap扫描
-> Bitmap Index Scan on idx_tbl_tag_1 (cost=0.00..333.62 rows=30020 width=0) (actual time=9.922..9.922 rows=30332 loops=33)
Index Cond: (tbl_tag.c1 = 1)
Buffers: shared hit=2999
Worker 0: actual time=10.045..10.045 rows=30332 loops=1
Buffers: shared hit=91
...... 并行bitmap扫描
-> Bitmap Index Scan on idx_tbl_tag_2 (cost=0.00..43418.50 rows=3200783 width=0) (actual time=9529.886..9529.886 rows=3053594 loops=33)
Index Cond: ((tbl_tag.c2 >= 1) AND (tbl_tag.c2 <= 100))
Buffers: shared hit=275483
Worker 0: actual time=9710.639..9710.639 rows=3053594 loops=1
Buffers: shared hit=8348
...... 并行bitmap扫描
-> Bitmap Index Scan on idx_tbl_tag_13 (cost=0.00..332.99 rows=29936 width=0) (actual time=9019.691..9019.691 rows=30554 loops=33)
Index Cond: (tbl_tag.c13 = 100)
Buffers: shared hit=2903
Worker 0: actual time=9143.024..9143.024 rows=30554 loops=1
Buffers: shared hit=88
...... 并行bitmap扫描
-> Bitmap Index Scan on idx_tbl_tag_4 (cost=0.00..79038.62 rows=5826795 width=0) (actual time=19870.013..19870.013 rows=6101279 loops=33)
Index Cond: ((tbl_tag.c4 >= 1) AND (tbl_tag.c4 <= 200))
Buffers: shared hit=550274
Worker 0: actual time=20005.432..20005.432 rows=6101279 loops=1
Buffers: shared hit=16675
......
Planning time: 0.302 ms
Execution time: 203637.896 ms
(754 rows)
create or replace function ct1 () returns void as $$
declare
sql text := '';
begin
sql := 'create table tbl_tag(uid int8,';
for i in 1..200 loop
sql := sql||'c'||i||' int2 default random()*32767,';
end loop;
sql := rtrim(sql, ',');
sql := sql||') with (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, CHECKSUM=false)';
execute sql;
end;
$$ language plpgsql strict;
select ct1();
create table tbl_pos(
uid int8,
pos_att int2,
pos geometry default st_setsrid(st_makepoint(73+random()*62, 3+random()*50), 4326)
)
with (APPENDONLY=true, ORIENTATION=row, COMPRESSTYPE=zlib, CHECKSUM=false)
partition by list (pos_att)
(
partition p1 values (1),
partition p2 values (2),
partition p3 values (3),
partition p4 values (4),
partition p5 values (5),
partition p6 values (6),
partition p7 values (7),
partition p8 values (8),
partition p9 values (9),
partition p10 values (10)
)
;
nohup psql -c "insert into tbl_tag select generate_series(1,1000000000)" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),1" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),2" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),3" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),4" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),5" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),6" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),7" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),8" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),9" >/dev/null 2>&1 &
nohup psql -c "insert into tbl_pos select generate_series(1,1000000000),10" >/dev/null 2>&1 &
nohup psql -c "copy (select uid,pos_att,st_geohash(pos, 10) from tbl_pos) to stdout"|psql -c "copy tbl_pos from stdin" >/dev/null 2>&1 &
使用阿里云metascan特性(类似PostgreSQL BRIN索引),加速tbl_pos.pos字段的过滤。
测试Greenplum时,由于环境限制没有使用PostGIS空间插件,使用geohash text code代替,测试结果如下。
1、100亿空间数据,按空间圈出约1000万人,21秒。
select count(*) from tbl_pos where pos_att=1 and st_within(pos, st_setsrid(st_makebox2d(st_makepoint(120,5), st_makepoint(125.5,10.5)),4326));
postgres=# select count(*) from tbl_pos where pos_att=1 and pos between 't9m' and 'tbbd' ;
count
---------
9635855
(1 row)
Time: 21371.543 ms
2、100亿空间数据,按空间圈出约1000万人,JOIN 10亿标签数据,透视这群人的标签属性,29.3秒。
explain analyze select c1,count(*),avg(c2),max(c3) from tbl_tag t2 join (select uid from tbl_pos where pos_att=1 and st_within(pos, st_setsrid(st_makebox2d(st_makepoint(120,5), st_makepoint(125.5,10.5)),4326))) t1 on (t1.uid=t2.uid) group by c1;
postgres=# explain analyze select c1,count(*),avg(c2),max(c3) from tbl_tag t2 join (select uid from tbl_pos where pos_att=1 and pos between 't9m' and 'tbbd') t1 on (t1.uid=t2.uid) group by c1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 48:1 (slice2; segments: 48) (cost=43547383.92..43547955.26 rows=32648 width=44)
Rows out: 32768 rows at destination with 28854 ms to end, start offset by 448 ms.
-> HashAggregate (cost=43547383.92..43547955.26 rows=681 width=44)
Group By: t2.c1
Rows out: Avg 682.7 rows x 48 workers. Max 689 rows (seg6) with 0.001 ms to first row, 11625 ms to end, start offset by 466 ms.
-> Redistribute Motion 48:48 (slice1; segments: 48) (cost=43546078.00..43546730.96 rows=681 width=44)
Hash Key: t2.c1
Rows out: Avg 32694.9 rows x 48 workers at destination. Max 33008 rows (seg31) with 17172 ms to end, start offset by 494 ms.
-> HashAggregate (cost=43546078.00..43546078.00 rows=681 width=44)
Group By: t2.c1
Rows out: Avg 32694.9 rows x 48 workers. Max 32719 rows (seg22) with 0.009 ms to first row, 82 ms to end, start offset by 491 ms.
-> Hash Join (cost=18492191.00..43506178.00 rows=83125 width=6)
Hash Cond: t2.uid = postgres.tbl_pos.uid
Rows out: Avg 200747.0 rows x 48 workers. Max 201863 rows (seg25) with 0.044 ms to first row, 25419 ms to end, start offset by 494 ms.
Executor memory: 6274K bytes avg, 6309K bytes max (seg25).
Work_mem used: 6274K bytes avg, 6309K bytes max (seg25). Workfile: (0 spilling, 0 reused)
-> Append-only Columnar Scan on tbl_tag t2 (cost=0.00..22464112.00 rows=20833334 width=14)
Rows out: 0 rows (seg0) with 0.004 ms to end, start offset by 501 ms.
-> Hash (cost=17993441.00..17993441.00 rows=831251 width=8)
Rows in: (No row requested) 0 rows (seg0) with 0 ms to end.
-> Append (cost=0.00..17993441.00 rows=831251 width=8)
Rows out: 0 rows (seg0) with 15 ms to end, start offset by 503 ms.
-> Append-only Scan on tbl_pos_1_prt_p1 tbl_pos (cost=0.00..17993441.00 rows=831251 width=8)
Filter: pos_att = 1 AND pos >= 't9m'::text AND pos <= 'tbbd'::text
Rows out: Avg 200747.0 rows x 48 workers. Max 201863 rows (seg25) with 48 ms to end, start offset by 494 ms.
Slice statistics:
(slice0) Executor memory: 501K bytes.
(slice1) Executor memory: 1613K bytes avg x 48 workers, 1613K bytes max (seg0). Work_mem: 6309K bytes max.
(slice2) Executor memory: 524K bytes avg x 48 workers, 524K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Settings: optimizer=off
Optimizer status: legacy query optimizer
Total runtime: 29302.351 ms
(34 rows)
Time: 29306.897 ms
3、10亿标签数据,按标签圈人约1000万,3.4秒。
select count(*) from tbl_tag where c1=1 or c2 between 1 and 100 or c13=100 or c4 between 1 and 200;
postgres=# select count(*) from tbl_tag where c1=1 or c2 between 1 and 100 or c13=100 or c4 between 1 and 200;
count
---------
9198749
(1 row)
Time: 3426.337 ms
4、10亿标签数据,按标签圈人约1000万,透视这群人的空间属性,26.2秒。
explain (analyze,verbose,timing,costs,buffers)
select st_geohash(t1.pos,6), count(*) from
tbl_pos_1 t1 join (select uid from tbl_tag where c1=1 or c2 between 1 and 100 or c13=100 or c4 between 1 and 200) t2
on (t1.uid=t2.uid)
group by st_geohash(t1.pos, 6);
postgres=# explain analyze
postgres-# select substring(pos,1,6), count(*) from
postgres-# tbl_pos t1 join (select uid from tbl_tag where c1=1 or c2 between 1 and 100 or c13=100 or c4 between 1 and 200) t2
postgres-# on (t1.uid=t2.uid and t1.pos_att=1)
postgres-# group by substring(pos,1,6);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 48:1 (slice2; segments: 48) (cost=53124325.91..53135748.71 rows=913824 width=40)
Rows out: 8421444 rows at destination with 25714 ms to end, start offset by 449 ms.
-> HashAggregate (cost=53124325.91..53135748.71 rows=19038 width=40)
Group By: "?column1?"
Rows out: Avg 175446.8 rows x 48 workers. Max 176265 rows (seg2) with 0.001 ms to first row, 8243 ms to end, start offset by 466 ms.
-> Redistribute Motion 48:48 (slice1; segments: 48) (cost=53090057.51..53110618.55 rows=19038 width=40)
Hash Key: unnamed_attr_1
Rows out: Avg 191284.2 rows x 48 workers at destination. Max 192297 rows (seg37) with 15634 ms to end, start offset by 529 ms.
-> HashAggregate (cost=53090057.51..53092342.07 rows=19038 width=40)
Group By: "substring"(t1.pos, 1, 6)
Rows out: Avg 191284.2 rows x 48 workers. Max 191966 rows (seg1) with 0.006 ms to first row, 134 ms to end, start offset by 468 ms.
-> Hash Join (cost=37578340.02..53085488.39 rows=19039 width=11)
Hash Cond: t1.uid = tbl_tag.uid
Rows out: Avg 191640.6 rows x 48 workers. Max 192331 rows (seg1) with 0.039 ms to first row, 18171 ms to end, start offset by 468 ms.
Executor memory: 5989K bytes avg, 6011K bytes max (seg1).
Work_mem used: 5989K bytes avg, 6011K bytes max (seg1). Workfile: (0 spilling, 0 reused)
-> Append (cost=0.00..12993441.00 rows=20833334 width=19)
Rows out: 0 rows (seg0) with 1228 ms to end, start offset by 531 ms.
-> Append-only Scan on tbl_pos_1_prt_p1 t1 (cost=0.00..12993441.00 rows=20833334 width=19)
Filter: pos_att = 1
Rows out: Avg 20833333.3 rows x 48 workers. Max 20833547 rows (seg37) with 0.005 ms to first row, 0.006 ms to end, start offset by 531 ms.
-> Hash (cost=37464112.00..37464112.00 rows=190381 width=8)
Rows in: (No row requested) 0 rows (seg0) with 0 ms to end.
-> Append-only Columnar Scan on tbl_tag (cost=0.00..37464112.00 rows=190381 width=8)
Filter: c1 = 1 OR (c2 >= 1 AND c2 <= 100) OR c13 = 100 OR (c4 >= 1 AND c4 <= 200)
Rows out: 0 rows (seg0) with 57 ms to end, start offset by 528 ms.
Slice statistics:
(slice0) Executor memory: 487K bytes.
(slice1) Executor memory: 1725K bytes avg x 48 workers, 1725K bytes max (seg0). Work_mem: 6011K bytes max.
(slice2) Executor memory: 524K bytes avg x 48 workers, 524K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Settings: optimizer=off
Optimizer status: legacy query optimizer
Total runtime: 26166.164 ms
(35 rows)
Time: 26170.031 ms
对比:
数据库 | 表 | 记录数 | SIZE |
---|---|---|---|
PostgreSQL 10 | 标签表(201个字段) | 10 亿 | 424 GB |
Greenplum | 标签表(201个字段) | 10 亿 | 369 GB |
PostgreSQL 10 | 位置表(12个字段) | 100 亿 | 640 GB |
Greenplum | 位置表(12个字段) | 100 亿 | 150 GB |
数据库 | 索引 | 索引类型 | SIZE |
---|---|---|---|
PostgreSQL 10 | 标签表 | btree | 4200 GB |
PostgreSQL 10 | 位置表 | brin | 27 MB |
数据库 | 业务 | 耗时 |
---|---|---|
PostgreSQL 10 | 100亿空间数据,按空间圈出约1000万人 | 400 毫秒 |
Greenplum | 100亿空间数据,按空间圈出约1000万人 | 21 秒 |
PostgreSQL 10 | 100亿空间数据,按空间圈出约1000万人,JOIN 10亿标签数据,透视这群人的标签属性 | 7 秒 |
Greenplum | 100亿空间数据,按空间圈出约1000万人,JOIN 10亿标签数据,透视这群人的标签属性 | 29.3 秒 |
PostgreSQL 10 | 10亿标签数据,按标签圈人约1000万 | 14.5 秒(能通过metascan优化到500毫秒以内) |
Greenplum | 10亿标签数据,按标签圈人约1000万 | 3.4 秒 |
PostgreSQL 10 | 10亿标签数据,按标签圈人约1000万,透视这群人的空间属性 | 203秒 (PG 11版本 merge join partial scan可以大幅提高性能) |
Greenplum | 10亿标签数据,按标签圈人约1000万,透视这群人的空间属性 | 26.2 秒 |
PG 10 通过brin索引,bitmap scan,在部分场景的性能已经超过同等资源的Greenplum。
引入列存引擎、parallel hash补丁、range merge join不对,在同等资源下,另外几个场景的性能会做到和Greenplum差不多(甚至更好)。
1、parallel hash join
https://commitfest.postgresql.org/14/871/
2、range merge join
https://commitfest.postgresql.org/14/1106/
3、parallel append scan
https://commitfest.postgresql.org/14/987/
但是Greenplum的强项是更大的数据量,例如通过打散,并行玩转PB级的实时分析。
而PG,更加适合以TP为主,AP为辅的场景,即Oracle数据库覆盖到的场景。
Greenplum和PostgreSQL两个产品的选择还是请参考前面所述。
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.