digoal
2023-08-24
PostgreSQL , PolarDB , 数据库 , 教学
非常欢迎数据库用户提出场景给我, 在此issue回复即可, 一起来建设沉浸式数据库学习教学素材库, 帮助开发者用好数据库, 提升开发者职业竞争力, 同时为企业降本提效.
本文的实验可以使用永久免费的阿里云云起实验室来完成.
如果你本地有docker环境也可以把镜像拉到本地来做实验:
x86_64机器使用以下docker image:
Apple Chip机器使用以下docker image:
在营销场景中, 通常会对用户的属性、行为等数据进行统计分析, 生成用户的标签, 也就是常说的用户画像.
标签举例: 男性、女性、年轻人、大学生、90后、司机、白领、健身达人、博士、技术达人、科技产品爱好者、2胎妈妈、老师、浙江省、15天内逛过手机电商店铺、... ...
有了用户画像, 在营销场景中一个重要的营销手段是根据条件选中目标人群, 进行精准营销.
例如圈选出包含这些标签的人群: 白领、科技产品爱好者、浙江省、技术达人、15天内逛过手机电商店铺 .
这个实验的目的是在有画像的基础上, 如何快速根据标签组合进行人群圈选 .
设计1张标签元数据表, 后面的用户画像表从这张标签表随机抽取标签. 业务查询时也从这里搜索存在的标签并进行圈选条件的组合, 得到对应的标签ID组合.
drop table if exists tbl_tag;
create table tbl_tag (
tid int primary key, -- 标签id
tag text, -- 标签名
info text -- 标签描述
);
假设有1万个标签, 写入标签元数据表.
insert into tbl_tag select id, md5(id::text), md5(random()::text) from generate_series(1, 10000) id;
创建2个函数, 产生若干的标签. 用来模拟产生每个用户对应的标签数据. 分别返回字符串和数组类型.
第一个函数, 随机提取若干个标签, 始终包含1-100的热门标签8个, 返回用户标签字符串:
create or replace function get_tags_text(int) returns text as $$
with a as (select string_agg(tid::text, ',') s from tbl_tag where tid = any (array(select ceil(random()*100)::int from generate_series(1,8) group by 1)))
, b as (select string_agg(tid::text, ',') s from tbl_tag where tid = any (array(select ceil(100+random()*9900)::int from generate_series(1,$1) group by 1)))
select ','||a.s||','||b.s||',' from a,b;
$$ language sql strict;
得到类似这样的结果:
postgres=# select get_tags_text(10);
get_tags_text
----------------------------------------------------------------------
,11,12,39,44,45,59,272,1001,1322,1402,2514,6888,7404,8922,9200,9409,
(1 row)
postgres=# select get_tags_text(10);
get_tags_text
------------------------------------------------------------------------
,12,34,52,55,71,79,88,302,582,1847,3056,5156,8231,8542,8572,8747,9727,
(1 row)
第二个函数, 随机提取若干个标签, 始终包含1-100的热门标签8个, 返回用户标签数组:
create or replace function get_tags_arr(int) returns int[] as $$
with a as (select array_agg(tid) s from tbl_tag where tid = any (array(select ceil(random()*100)::int from generate_series(1,8) group by 1)))
, b as (select array_agg(tid) s from tbl_tag where tid = any (array(select ceil(100+random()*9900)::int from generate_series(1,$1) group by 1)))
select a.s||b.s from a,b;
$$ language sql strict;
得到类似这样的结果:
postgres=# select * from get_tags_arr(10);
get_tags_arr
----------------------------------------------------------------------------
{13,35,42,61,67,69,76,78,396,2696,3906,4356,5064,5711,7363,9417,9444,9892}
(1 row)
postgres=# select * from get_tags_arr(10);
get_tags_arr
-------------------------------------------------------------------------
{2,10,20,80,84,85,89,3410,3515,4159,4182,5217,6549,6775,7289,9141,9431}
(1 row)
传统数据库没有数组类型, 所以需要用字符串存储标签.
创建用户画像表
drop table if exists tbl_users;
create unlogged table tbl_users ( -- 为便于加速生成测试数据, 使用unlogged table
uid int primary key, -- 用户id
tags text -- 该用户拥有的标签 , 使用字符串类型
);
创建100万个用户, 用户被贴的标签数从32到256个, 随机产生, 其中8个为热门标签(例如性别、年龄段等都属于热门标签).
insert into tbl_users select id, get_tags_text(ceil(24+random()*224)::int) from generate_series(1,1000000) id;
测试如下, 分别搜索包含如下标签组合的用户:
- 2
- 2,8
- 2,2696
- 2,4356,5064,5711,7363,9417,9444
- 4356,5064,5711,7363,9417,9444
使用如下SQL:
select uid from tbl_users where tags like '%,2,%';
select uid from tbl_users where tags like '%,2,%' or tags like '%,8,%';
select uid from tbl_users where tags like '%,2,%' or tags like '%,2696,%';
select uid from tbl_users where tags like '%,2,%' or tags like '%,4356,%' or tags like '%,5064,%' or tags like '%,5711,%' or tags like '%,7363,%' or tags like '%,9417,%' or tags like '%,9444,%' ;
select uid from tbl_users where tags like '%,4356,%' or tags like '%,5064,%' or tags like '%,5711,%' or tags like '%,7363,%' or tags like '%,9417,%' or tags like '%,9444,%' ;
查看以上SQL运行的执行计划和耗时如下:
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_users (cost=0.00..103268.00 rows=80808 width=4) (actual time=0.018..1108.805 rows=77454 loops=1)
Filter: (tags ~~ '%,2,%'::text)
Rows Removed by Filter: 922546
Planning Time: 1.095 ms
Execution Time: 1110.267 ms
(5 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%' or tags like '%,8,%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_users (cost=0.00..105768.00 rows=127232 width=4) (actual time=0.029..2001.379 rows=149132 loops=1)
Filter: ((tags ~~ '%,2,%'::text) OR (tags ~~ '%,8,%'::text))
Rows Removed by Filter: 850868
Planning Time: 1.209 ms
Execution Time: 2004.062 ms
(5 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%' or tags like '%,2696,%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_users (cost=0.00..105768.00 rows=90093 width=4) (actual time=0.035..2058.797 rows=90084 loops=1)
Filter: ((tags ~~ '%,2,%'::text) OR (tags ~~ '%,2696,%'::text))
Rows Removed by Filter: 909916
Planning Time: 1.190 ms
Execution Time: 2060.434 ms
(5 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%' or tags like '%,4356,%' or tags like '%,5064,%' or tags like '%,5711,%' or tags like '%,7363,%' or tags like '%,9417,%' or tags like '%,9444,%' ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_users (cost=0.00..118268.00 rows=135482 width=4) (actual time=0.024..6765.315 rows=150218 loops=1)
Filter: ((tags ~~ '%,2,%'::text) OR (tags ~~ '%,4356,%'::text) OR (tags ~~ '%,5064,%'::text) OR (tags ~~ '%,5711,%'::text) OR (tags ~~ '%,7363,%'::text) OR (tags ~~ '%,9417,%'::text) OR (tags ~~ '%,9444,%'::text))
Rows Removed by Filter: 849782
Planning Time: 4.344 ms
Execution Time: 6767.990 ms
(5 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,4356,%' or tags like '%,5064,%' or tags like '%,5711,%' or tags like '%,7363,%' or tags like '%,9417,%' or tags like '%,9444,%' ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_users (cost=0.00..115768.00 rows=59480 width=4) (actual time=0.112..6206.775 rows=78827 loops=1)
Filter: ((tags ~~ '%,4356,%'::text) OR (tags ~~ '%,5064,%'::text) OR (tags ~~ '%,5711,%'::text) OR (tags ~~ '%,7363,%'::text) OR (tags ~~ '%,9417,%'::text) OR (tags ~~ '%,9444,%'::text))
Rows Removed by Filter: 921173
Planning Time: 4.223 ms
Execution Time: 6208.191 ms
(5 rows)
第一种方法没有用到任何的索引, 每次请求都要扫描用户画像表的所有记录, 计算每一个LIKE的算子, 性能比较差.
为了提升查询性能, 我们可以使用gin索引和pg_trgm插件, 支持字符串内的模糊查询索引加速.
复用方法1的数据, 创建gin索引, 支持索引加速模糊查询.
create extension pg_trgm;
create index on tbl_users using gin (tags gin_trgm_ops);
使用索引后, 查看执行计划和耗时如下:
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=555.93..58686.88 rows=80808 width=4) (actual time=30.315..76.314 rows=77454 loops=1)
Recheck Cond: (tags ~~ '%,2,%'::text)
Heap Blocks: exact=53210
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..535.73 rows=80808 width=0) (actual time=22.967..22.967 rows=77454 loops=1)
Index Cond: (tags ~~ '%,2,%'::text)
Planning Time: 0.991 ms
Execution Time: 78.163 ms
(7 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%' or tags like '%,8,%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=983.56..87215.27 rows=127232 width=4) (actual time=48.651..811.842 rows=149132 loops=1)
Recheck Cond: ((tags ~~ '%,2,%'::text) OR (tags ~~ '%,8,%'::text))
Rows Removed by Index Recheck: 299658
Heap Blocks: exact=41915 lossy=33158
-> BitmapOr (cost=983.56..983.56 rows=131313 width=0) (actual time=43.554..43.554 rows=0 loops=1)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..535.73 rows=80808 width=0) (actual time=24.923..24.923 rows=77454 loops=1)
Index Cond: (tags ~~ '%,2,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..384.22 rows=50505 width=0) (actual time=18.629..18.629 rows=77054 loops=1)
Index Cond: (tags ~~ '%,8,%'::text)
Planning Time: 1.496 ms
Execution Time: 814.748 ms
(11 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%' or tags like '%,2696,%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=958.67..64006.30 rows=90093 width=4) (actual time=75.859..900.779 rows=90084 loops=1)
Recheck Cond: ((tags ~~ '%,2,%'::text) OR (tags ~~ '%,2696,%'::text))
Rows Removed by Index Recheck: 348263
Heap Blocks: exact=39411 lossy=33155
-> BitmapOr (cost=958.67..958.67 rows=90909 width=0) (actual time=71.980..71.981 rows=0 loops=1)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..535.73 rows=80808 width=0) (actual time=26.486..26.487 rows=77454 loops=1)
Index Cond: (tags ~~ '%,2,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..377.89 rows=10101 width=0) (actual time=45.492..45.492 rows=62326 loops=1)
Index Cond: (tags ~~ '%,2696,%'::text)
Planning Time: 1.479 ms
Execution Time: 902.637 ms
(11 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,2,%' or tags like '%,4356,%' or tags like '%,5064,%' or tags like '%,5711,%' or tags like '%,7363,%' or tags like '%,9417,%' or tags like '%,9444,%' ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=3041.18..100880.75 rows=135482 width=4) (actual time=210.772..4047.148 rows=150218 loops=1)
Recheck Cond: ((tags ~~ '%,2,%'::text) OR (tags ~~ '%,4356,%'::text) OR (tags ~~ '%,5064,%'::text) OR (tags ~~ '%,5711,%'::text) OR (tags ~~ '%,7363,%'::text) OR (tags ~~ '%,9417,%'::text) OR (tags ~~ '%,9444,%'::text))
Rows Removed by Index Recheck: 422706
Heap Blocks: exact=56868 lossy=33226
-> BitmapOr (cost=3041.18..3041.18 rows=141614 width=0) (actual time=205.898..205.899 rows=0 loops=1)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..535.73 rows=80808 width=0) (actual time=24.656..24.656 rows=77454 loops=1)
Index Cond: (tags ~~ '%,2,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..428.40 rows=20202 width=0) (actual time=45.014..45.014 rows=62615 loops=1)
Index Cond: (tags ~~ '%,4356,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..377.89 rows=10101 width=0) (actual time=22.680..22.680 rows=39025 loops=1)
Index Cond: (tags ~~ '%,5064,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..428.40 rows=20202 width=0) (actual time=28.809..28.809 rows=62697 loops=1)
Index Cond: (tags ~~ '%,5711,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..377.89 rows=10101 width=0) (actual time=28.646..28.646 rows=62647 loops=1)
Index Cond: (tags ~~ '%,7363,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..327.89 rows=100 width=0) (actual time=28.361..28.361 rows=62172 loops=1)
Index Cond: (tags ~~ '%,9417,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..327.89 rows=100 width=0) (actual time=27.729..27.730 rows=62821 loops=1)
Index Cond: (tags ~~ '%,9444,%'::text)
Planning Time: 4.517 ms
Execution Time: 4050.040 ms
(21 rows)
postgres=# explain analyze select uid from tbl_users where tags like '%,4356,%' or tags like '%,5064,%' or tags like '%,5711,%' or tags like '%,7363,%' or tags like '%,9417,%' or tags like '%,9444,%' ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=2357.58..50755.97 rows=59480 width=4) (actual time=209.115..3689.534 rows=78827 loops=1)
Recheck Cond: ((tags ~~ '%,4356,%'::text) OR (tags ~~ '%,5064,%'::text) OR (tags ~~ '%,5711,%'::text) OR (tags ~~ '%,7363,%'::text) OR (tags ~~ '%,9417,%'::text) OR (tags ~~ '%,9444,%'::text))
Rows Removed by Index Recheck: 455241
Heap Blocks: exact=55903 lossy=33218
-> BitmapOr (cost=2357.58..2357.58 rows=60806 width=0) (actual time=204.235..204.236 rows=0 loops=1)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..428.40 rows=20202 width=0) (actual time=57.485..57.485 rows=62615 loops=1)
Index Cond: (tags ~~ '%,4356,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..377.89 rows=10101 width=0) (actual time=26.156..26.157 rows=39025 loops=1)
Index Cond: (tags ~~ '%,5064,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..428.40 rows=20202 width=0) (actual time=33.539..33.539 rows=62697 loops=1)
Index Cond: (tags ~~ '%,5711,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..377.89 rows=10101 width=0) (actual time=30.136..30.136 rows=62647 loops=1)
Index Cond: (tags ~~ '%,7363,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..327.89 rows=100 width=0) (actual time=28.794..28.794 rows=62172 loops=1)
Index Cond: (tags ~~ '%,9417,%'::text)
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..327.89 rows=100 width=0) (actual time=28.122..28.122 rows=62821 loops=1)
Index Cond: (tags ~~ '%,9444,%'::text)
Planning Time: 3.860 ms
Execution Time: 3691.329 ms
(19 rows)
很显然你不能满足于前面的模糊查询索引带来的性能提升, 特别是当and条件非常多时, 模糊查询的索引也要被多次扫描并使用bitmap进行合并, 性能不好. (以上方法对于一个模糊查询条件性能提升是非常明显的.)
PolarDB和PostgreSQL都支持数组类型, 用数组存储标签, 支持gin索引可以加速数组的包含查询.
创建用户画像表, 使用数组存储标签字段.
drop table if exists tbl_users;
create unlogged table tbl_users ( -- 为便于加速生成测试数据, 使用unlogged table
uid int primary key, -- 用户id
tags int[] -- 该用户拥有的标签 , 使用数组类型
);
创建100万个用户, 用户被贴的标签数从32到256个, 随机产生, 其中8个为热门标签(例如性别、年龄段等都属于热门标签).
insert into tbl_users select id, get_tags_arr(ceil(24+random()*224)::int) from generate_series(1,1000000) id;
create index on tbl_users using gin (tags);
搜索包含如下标签组合的用户:
- 2
- 2,8
- 2,2696
- 2,4356,5064,5711,7363,9417,9444
- 4356,5064,5711,7363,9417,9444
数组匹配的 SQL 语句如下:
select uid from tbl_users where tags @> array[2];
select uid from tbl_users where tags @> array[2,8];
select uid from tbl_users where tags @> array[2,2696];
select uid from tbl_users where tags @> array[2,4356,5064,5711,7363,9417,9444];
select uid from tbl_users where tags @> array[4356,5064,5711,7363,9417,9444];
使用数组类型和gin索引后, 查看执行计划和耗时如下:
postgres=# explain analyze select uid from tbl_users where tags @> array[2];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=437.95..53717.07 rows=76333 width=4) (actual time=24.031..69.706 rows=77641 loops=1)
Recheck Cond: (tags @> '{2}'::integer[])
Heap Blocks: exact=50231
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..418.86 rows=76333 width=0) (actual time=15.026..15.026 rows=77641 loops=1)
Index Cond: (tags @> '{2}'::integer[])
Planning Time: 1.137 ms
Execution Time: 74.015 ms
(7 rows)
postgres=# explain analyze select uid from tbl_users where tags @> array[2,8];
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=49.97..6172.63 rows=5847 width=4) (actual time=10.745..18.272 rows=5303 loops=1)
Recheck Cond: (tags @> '{2,8}'::integer[])
Heap Blocks: exact=5133
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..48.51 rows=5847 width=0) (actual time=10.081..10.081 rows=5303 loops=1)
Index Cond: (tags @> '{2,8}'::integer[])
Planning Time: 0.256 ms
Execution Time: 18.561 ms
(7 rows)
postgres=# explain analyze select uid from tbl_users where tags @> array[2,2696];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=21.27..443.58 rows=382 width=4) (actual time=2.872..4.662 rows=1003 loops=1)
Recheck Cond: (tags @> '{2,2696}'::integer[])
Heap Blocks: exact=999
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..21.18 rows=382 width=0) (actual time=2.729..2.729 rows=1003 loops=1)
Index Cond: (tags @> '{2,2696}'::integer[])
Planning Time: 0.246 ms
Execution Time: 4.750 ms
(7 rows)
postgres=# explain analyze select uid from tbl_users where tags @> array[2,4356,5064,5711,7363,9417,9444];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=64.38..65.50 rows=1 width=4) (actual time=5.476..5.478 rows=0 loops=1)
Recheck Cond: (tags @> '{2,4356,5064,5711,7363,9417,9444}'::integer[])
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..64.38 rows=1 width=0) (actual time=5.471..5.472 rows=0 loops=1)
Index Cond: (tags @> '{2,4356,5064,5711,7363,9417,9444}'::integer[])
Planning Time: 0.223 ms
Execution Time: 5.523 ms
(6 rows)
postgres=# explain analyze select uid from tbl_users where tags @> array[4356,5064,5711,7363,9417,9444];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=55.36..56.47 rows=1 width=4) (actual time=4.476..4.477 rows=0 loops=1)
Recheck Cond: (tags @> '{4356,5064,5711,7363,9417,9444}'::integer[])
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..55.36 rows=1 width=0) (actual time=4.471..4.472 rows=0 loops=1)
Index Cond: (tags @> '{4356,5064,5711,7363,9417,9444}'::integer[])
Planning Time: 0.275 ms
Execution Time: 4.528 ms
(6 rows)
当我们输入一组标签, 如果想放宽圈选条件, 而不仅仅是以上精确包含, 怎么实现? 例如:
- 包含多少个以上的标签
- 有百分之多少以上的标签重合
复用上面的数据, 换上smlar插件和索引来实现以上功能.
创建smlar插件
postgres=# create extension smlar ;
CREATE EXTENSION
换上smlar索引
drop index tbl_users_tags_idx;
create index on tbl_users using gin (tags _int4_sml_ops);
smlar插件的%
操作符用来表达数组近似度过滤.
postgres=# explain analyze select count(*) from tbl_users where tags % array[1,2,3];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1132.46..1132.47 rows=1 width=8) (actual time=75.613..75.614 rows=1 loops=1)
-> Bitmap Heap Scan on tbl_users (cost=35.25..1129.96 rows=1000 width=0) (actual time=75.609..75.610 rows=0 loops=1)
Recheck Cond: (tags % '{1,2,3}'::integer[])
Rows Removed by Index Recheck: 15059
Heap Blocks: exact=13734
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..35.00 rows=1000 width=0) (actual time=31.466..31.466 rows=15059 loops=1)
Index Cond: (tags % '{1,2,3}'::integer[])
Planning Time: 0.408 ms
Execution Time: 75.687 ms
(9 rows)
smlar插件支持的参数配置如下, 通过配置这些参数, 我们可以控制按什么算法来计算相似度, 相似度的过滤阈值是多少?
postgres=# select name,setting,enumvals,extra_desc from pg_settings where name ~ 'smlar';
name | setting | enumvals | extra_desc
------------------------+---------+------------------------+-----------------------------------------------------------------------------
smlar.idf_plus_one | off | | Calculate idf by log(1+d/df)
smlar.persistent_cache | off | | Cache of global stat is stored in transaction-independent memory
smlar.stattable | | | Named table stores global frequencies of array's elements
smlar.tf_method | n | {n,log,const} | TF method: n => number of entries, log => 1+log(n), const => constant value
smlar.threshold | 0.6 | | Array's with similarity lower than threshold are not similar by % operation
smlar.type | cosine | {cosine,tfidf,overlap} | Type of similarity formula: cosine(default), tfidf, overlap
(6 rows)
接下来我们来实现上述两种近似搜索:
- 包含多少个以上的标签
- 有百分之多少以上的标签重合
包含多少个以上的标签, smlar.type = overlap , smlar.threshold = INT
set smlar.type = overlap;
set smlar.threshold = 1; -- 精确匹配
select uid from tbl_users where tags % array[2];
set smlar.type = overlap;
set smlar.threshold = 1; -- 匹配到1个以上标签
select uid from tbl_users where tags % array[2,8];
set smlar.type = overlap;
set smlar.threshold = 2; -- 精确匹配
select uid from tbl_users where tags % array[2,2696];
set smlar.type = overlap;
set smlar.threshold = 5; -- 匹配到5个以上标签
select uid from tbl_users where tags % array[2,4356,5064,5711,7363,9417,9444];
set smlar.type = overlap;
set smlar.threshold = 6; -- 精确匹配
select uid from tbl_users where tags % array[4356,5064,5711,7363,9417,9444];
使用smlar插件, 数组类型和gin索引后, 查看执行计划和耗时如下:
postgres=# explain analyze select uid from tbl_users where tags % array[2];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=17.65..1112.36 rows=1000 width=4) (actual time=38.272..306.985 rows=77129 loops=1)
Recheck Cond: (tags % '{2}'::integer[])
Heap Blocks: exact=50082
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..17.40 rows=1000 width=0) (actual time=26.498..26.498 rows=77129 loops=1)
Index Cond: (tags % '{2}'::integer[])
Planning Time: 0.414 ms
Execution Time: 309.182 ms
(7 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[2,8];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=26.45..1121.16 rows=1000 width=4) (actual time=33.378..790.183 rows=149118 loops=1)
Recheck Cond: (tags % '{2,8}'::integer[])
Rows Removed by Index Recheck: 351146
Heap Blocks: exact=35117 lossy=33064
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..26.20 rows=1000 width=0) (actual time=29.934..29.934 rows=149118 loops=1)
Index Cond: (tags % '{2,8}'::integer[])
Planning Time: 0.924 ms
Execution Time: 794.029 ms
(8 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[2,2696];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=26.45..1121.16 rows=1000 width=4) (actual time=6.287..26.042 rows=1028 loops=1)
Recheck Cond: (tags % '{2,2696}'::integer[])
Heap Blocks: exact=1019
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..26.20 rows=1000 width=0) (actual time=5.956..5.956 rows=1028 loops=1)
Index Cond: (tags % '{2,2696}'::integer[])
Planning Time: 0.439 ms
Execution Time: 26.218 ms
(7 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[2,4356,5064,5711,7363,9417,9444];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=70.45..1165.16 rows=1000 width=4) (actual time=13.211..13.212 rows=0 loops=1)
Recheck Cond: (tags % '{2,4356,5064,5711,7363,9417,9444}'::integer[])
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..70.20 rows=1000 width=0) (actual time=13.204..13.205 rows=0 loops=1)
Index Cond: (tags % '{2,4356,5064,5711,7363,9417,9444}'::integer[])
Planning Time: 0.204 ms
Execution Time: 13.264 ms
(6 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[4356,5064,5711,7363,9417,9444];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=61.65..1156.36 rows=1000 width=4) (actual time=11.364..11.366 rows=0 loops=1)
Recheck Cond: (tags % '{4356,5064,5711,7363,9417,9444}'::integer[])
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..61.40 rows=1000 width=0) (actual time=11.357..11.358 rows=0 loops=1)
Index Cond: (tags % '{4356,5064,5711,7363,9417,9444}'::integer[])
Planning Time: 0.264 ms
Execution Time: 11.447 ms
(6 rows)
有百分之多少以上的标签重合, smlar.type = cosine , smlar.threshold = FLOAT
set smlar.type = cosine;
set smlar.threshold = 1; -- 精确匹配, 目标也必须只包含2, 相当于相等
select uid from tbl_users where tags % array[2];
set smlar.type = cosine;
set smlar.threshold = 0.5; -- 两组标签的交集(重叠标签)占两组标签叠加(并集)后的50%以上
select uid from tbl_users where tags % array[2,8];
set smlar.type = cosine;
set smlar.threshold = 1; -- 精确匹配, 两组标签的交集(重叠标签)占两组标签叠加(并集)后的100%以上
select uid from tbl_users where tags % array[2,2696];
set smlar.type = cosine;
set smlar.threshold = 0.7; -- 两组标签的交集(重叠标签)占两组标签叠加(并集)后的70%以上
select uid from tbl_users where tags % array[2,4356,5064,5711,7363,9417,9444];
set smlar.type = cosine;
set smlar.threshold = 0.9; -- 两组标签的交集(重叠标签)占两组标签叠加(并集)后的90%以上
select uid from tbl_users where tags % array[4356,5064,5711,7363,9417,9444];
使用smlar插件, 数组类型和gin索引后, 查看执行计划和耗时如下:
postgres=# explain analyze select uid from tbl_users where tags % array[2];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=17.65..1112.36 rows=1000 width=4) (actual time=301.094..301.094 rows=0 loops=1)
Recheck Cond: (tags % '{2}'::integer[])
Rows Removed by Index Recheck: 77129
Heap Blocks: exact=50082
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..17.40 rows=1000 width=0) (actual time=25.659..25.659 rows=77129 loops=1)
Index Cond: (tags % '{2}'::integer[])
Planning Time: 0.252 ms
Execution Time: 301.135 ms
(8 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[2,8];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=26.45..1121.16 rows=1000 width=4) (actual time=799.554..799.554 rows=0 loops=1)
Recheck Cond: (tags % '{2,8}'::integer[])
Rows Removed by Index Recheck: 500264
Heap Blocks: exact=35117 lossy=33064
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..26.20 rows=1000 width=0) (actual time=43.356..43.356 rows=149118 loops=1)
Index Cond: (tags % '{2,8}'::integer[])
Planning Time: 0.379 ms
Execution Time: 799.611 ms
(8 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[2,2696];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=26.45..1121.16 rows=1000 width=4) (actual time=26.476..26.478 rows=0 loops=1)
Recheck Cond: (tags % '{2,2696}'::integer[])
Rows Removed by Index Recheck: 1028
Heap Blocks: exact=1019
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..26.20 rows=1000 width=0) (actual time=5.242..5.242 rows=1028 loops=1)
Index Cond: (tags % '{2,2696}'::integer[])
Planning Time: 0.570 ms
Execution Time: 26.570 ms
(8 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[2,4356,5064,5711,7363,9417,9444];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=70.45..1165.16 rows=1000 width=4) (actual time=16.722..16.723 rows=0 loops=1)
Recheck Cond: (tags % '{2,4356,5064,5711,7363,9417,9444}'::integer[])
Rows Removed by Index Recheck: 8
Heap Blocks: exact=8
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..70.20 rows=1000 width=0) (actual time=16.586..16.587 rows=8 loops=1)
Index Cond: (tags % '{2,4356,5064,5711,7363,9417,9444}'::integer[])
Planning Time: 0.276 ms
Execution Time: 16.795 ms
(8 rows)
postgres=# explain analyze select uid from tbl_users where tags % array[4356,5064,5711,7363,9417,9444];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_users (cost=61.65..1156.36 rows=1000 width=4) (actual time=9.755..9.757 rows=0 loops=1)
Recheck Cond: (tags % '{4356,5064,5711,7363,9417,9444}'::integer[])
-> Bitmap Index Scan on tbl_users_tags_idx (cost=0.00..61.40 rows=1000 width=0) (actual time=9.748..9.749 rows=0 loops=1)
Index Cond: (tags % '{4356,5064,5711,7363,9417,9444}'::integer[])
Planning Time: 0.294 ms
Execution Time: 9.811 ms
(6 rows)
方法 | SQL1 耗时 ms | SQL2 耗时 ms | SQL3 耗时 ms | SQL4 耗时 ms | SQL5 耗时 ms |
---|---|---|---|---|---|
传统字符串 + 全表扫描 | 1110.267 | 2004.062 | 2060.434 | 6767.990 | 6208.191 |
传统字符串 + 模糊搜索 + gin索引加速 | 78.163 | 814.748 | 902.637 | 4050.040 | 3691.329 |
数组 + gin索引加速 | 74.015 | 18.561 | 4.750 | 5.523 | 4.528 |
数组(重叠个数)相似度搜索 + gin索引加速 | 309.182 | 794.029 | 26.218 | 13.264 | 11.447 |
数组(重叠占比)相似度搜索 + gin索引加速 | 301.135 | 799.611 | 26.570 | 16.795 | 9.811 |
1、数组类型
2、gin索引
3、smlar 插件
更多算法参考: https://github.com/jirutka/smlar
4、pg_trgm 插件
pg_trgm插件对字符串做了什么处理, 可以利用gin索引加速模糊查询加速?
smlar插件是如何通过索引快速判断两个数组的相似性达到阈值的?
为什么多个模糊匹配条件使用and条件后, 性能下降严重?
为什么使用数组类型后, 标签条件越多性能越好?
如果多个模糊匹配条件是or 条件呢? 性能会下降还是提升?
还有什么业务场景会用到数组?
还有哪些业务场景会用到字符串模糊匹配?
还有什么业务场景非常适合使用数组相似的功能?
除了使用标签匹配来圈选相似目标人群, 还可不可以使用其他方式圈选? 例如向量距离?
使用标签匹配时, 如果我们要排除某些标签, 而不是包含某些标签, 应该如何写sql, 性能又会怎么样呢?
为什么使用smlar进行相似度过滤时, 相似度越高性能越好?
SQL圈选性能和返回符合条件的用户记录数有没有关系? 是什么关系?
当使用pg_trgm进行模糊搜索加速时, 如果字符串中包含wchar(例如中文)时性能如果很差要怎么办? 如果需要模糊搜索的字符只有1个或2个字符时性能如果很差要怎么办?