PostgreSQL 相似人群圈选，人群扩选，向量相似使用实践 - cube

作者

digoal

日期

2018-10-11

背景

PostgreSQL 相似插件非常多，插件的功能以及用法如下：

《PostgreSQL 相似搜索插件介绍大汇总 (cube,rum,pg_trgm,smlar,imgsmlr,pg_similarity) (rum,gin,gist)》

相似人群分析在精准营销，推荐系统中的需求很多。

人的属性可以使用向量来表达，每个值代表一个属性的权重值，通过向量相似，可以得到一群相似的人群。

例如

create table tt (  
  uid int8 primary key,  
  att1 float4,  -- 属性1 的权重值   
  att2 float4,  -- 属性2 的权重值  
  att3 float4,  -- 属性3 的权重值  
  ...  
  attn float4   -- 属性n 的权重值  
);

使用cube表示属性

create table tt (  
  uid int8 primary key,  
  att cube  -- 属性  
);

使用cube或imgsmlr可以达到类似的目的。

a <-> b	float8	Euclidean distance between a and b.  
a <#> b	float8	Taxicab (L-1 metric) distance between a and b.  
a <=> b	float8	Chebyshev (L-inf metric) distance between a and b.

但是如果向量很大(比如属性很多)，建议使用一些方法抽象出典型的特征值，压缩向量。类似图层，图片压缩。实际上imgsmlr就是这么做的：

例如256*256的像素，压缩成4*4的像素，存储为特征值。

例子

1、创建插件

create extension cube;

2、创建测试表

create table tt (id int , c1 cube);

3、创建GIST索引

create index idx_tt_1 on tt using gist(c1);

4、创建生成随机CUBE的函数

create or replace function gen_rand_cube(int,int) returns cube as $$  
  select ('('||string_agg((random()*$2)::text, ',')||')')::cube from generate_series(1,$1);  
$$ language sql strict;

5、CUBE最多存100个维度

postgres=# \set VERBOSITY verbose  
  
postgres=# select gen_rand_cube(1000,10);  
  
ERROR:  22P02: invalid input syntax for cube  
DETAIL:  A cube cannot have more than 100 dimensions.  
CONTEXT:  SQL function "gen_rand_cube" statement 1  
LOCATION:  cube_yyparse, cubeparse.y:111

6、写入测试数据

insert into tt select id, gen_rand_cube(16, 10) from generate_series(1,10000) t(id);

7、通过单个特征值CUBE查询相似人群，以点搜群

select * from tt order by c1 <-> '(1,2,3,4,5,6,7)' limit x;  -- 个体搜群体

8、通过多个特征值CUBE查询相似人群，以群搜群

select * from tt order by c1 <-> '[(1,2,3,4,5,6,7),(1,3,4,5,6,71,3), ...]' limit x; -- 群体搜群体

postgres=# explain select * from tt order by c1 <-> '[(1,2,3),(2,3,4)]' limit 1;  
                                QUERY PLAN                                  
--------------------------------------------------------------------------  
 Limit  (cost=0.11..0.14 rows=1 width=44)  
   ->  Index Scan using idx_tt_1 on tt  (cost=0.11..0.16 rows=2 width=44)  
         Order By: (c1 <-> '(1, 2, 3),(2, 3, 4)'::cube)  
(3 rows)

9、如果需要再计算压缩前的特征值的相似性，可以使用原始值再计算一遍。

《PostgreSQL 遗传学应用 - 矩阵相似距离计算 (欧式距离,...XX距离)》

select *,   
  c1 <-> ?1,   -- c1表示压缩后的特征值浮点数向量，比如（4*4）  
  distance_udf(detail_c1,?2)   -- deatil_c1 表示原始特征值浮点数向量(比如128*128)    
from tt order by c1 <-> ?1 limit xx;