digoal
2024-05-05
PostgreSQL , PolarDB , DuckDB , vector , extension , vss
DuckDB 发布vss 向量插件, 又多了一个涉猎的场景, 项目地址如下 : https://github.com/duckdb/duckdb_vss 根据介绍与PG这个向量插件底层有些类似 : https://github.com/lanterndata/lantern https://github.com/unum-cloud/usearch
vss 向量插件有望内置在5.20发布的v0.10.3版本中. vss 向量插件介绍文档:
向量索引不是什么神秘的东东了, 但是在duckdb这种in-process olap产品中支持向量索引, 意义就非同寻常, 意味着RAG应用可能不需要再拖一个关系数据库或者专业的向量数据库服务, 因为直接加载libduckdb即可.
体验DuckDB vss, 依旧推荐我的万能镜像:
进入docker 容器后
su - postgres
cd ~
git clone --depth 1 https://github.com/duckdb/duckdb_vss
cd duckdb_vss
git clone --depth 1 -b v0.10.2 https://github.com/duckdb/duckdb
make -j4
可选(不太优雅, 后面有优雅的方法), 创建a.sql
, 生成10万条16维度随机向量:
cd ~/duckdb_vss
echo "insert into a (vec) values (gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16));" > ./a.sql
for ((i=1;i<10000;i++)) do
echo "insert into a (vec) values (gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16));" >> ./a.sql
done
准备duckdb
cd ~/duckdb_vss
./build/release/duckdb
CREATE OR REPLACE TEMP MACRO gen_arr(dims) as (
with recursive cte as ( select 0 as id, round((random()*100)::float,2) as ele
union all
select id+1 as id, round((random()*100)::float,2) as ele from cte where id < dims-1)
select array_agg(ele) from cte
);
create sequence seq;
CREATE TABLE a (id int default nextval('seq'), vec FLOAT[16]);
可选(不太优雅, 后面有优雅的方法), 导入到duckdb in memory database
begin;
.read a.sql
end;
以上导入不太优雅, 有一种更优雅的方式:
insert into a (vec) select array_agg(f) from
(select i,round(random()*100, 2) f from range (0, 16*100000) t(i)) -- 导入100000条
group by i//16; -- //表示取整的结果, 相当于 (i/16)::int ; 如果取余数使用 i%16 相当于 mod(i,16) ;
Run Time (s): real 0.135 user 0.252736 sys 0.015283
创建向量索引, 使用向量搜索:
D load 'vss';
create index idx on a using hnsw (vec) with (metric = 'cosine');
select gen_arr(16)::float[16];
[15.96, 71.92, 81.69, 46.45, 79.38, 99.68, 23.87, 29.97, 81.5, 79.39, 79.94, 14.07, 18.69, 88.09, 40.91, 91.29]
.timer on
select * from a order by array_cosine_similarity(vec, '[15.96, 71.92, 81.69, 46.45, 79.38, 99.68, 23.87, 29.97, 81.5, 79.39, 79.94, 14.07, 18.69, 88.09, 40.91, 91.29]'::float[16]) limit 10;
┌───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ id │ vec │
│ int32 │ float[16] │
├───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 37782 │ [8.58, 72.76, 93.36, 53.85, 82.06, 76.71, 42.34, 29.73, 89.68, 46.34, 90.53, 47.72, 10.39, 75.55, 30.47, 73.5] │
│ 66603 │ [24.7, 92.06, 47.99, 56.92, 76.43, 92.61, 24.2, 22.18, 51.49, 89.63, 66.81, 39.09, 12.69, 76.16, 68.1, 91.82] │
│ 53058 │ [13.94, 69.53, 50.63, 54.99, 74.15, 61.28, 4.07, 29.9, 91.27, 47.82, 41.77, 2.15, 7.73, 76.65, 34.69, 62.68] │
│ 90601 │ [38.14, 54.75, 70.61, 26.27, 41.76, 77.41, 24.52, 56.55, 72.44, 67.16, 74.32, 6.48, 47.41, 70.84, 38.26, 96.35] │
│ 4366 │ [33.03, 68.98, 65.19, 66.84, 84.83, 63.71, 46.52, 30.35, 86.23, 55.99, 55.4, 7.02, 34.45, 73.2, 10.68, 82.51] │
│ 50588 │ [3.3, 63.69, 72.9, 62.28, 69.63, 76.36, 39.87, 55.34, 82.18, 46.39, 91.37, 30.25, 52.2, 63.17, 35.52, 74.51] │
│ 95578 │ [18.03, 68.11, 76.52, 22.12, 44.43, 96.9, 23.78, 7.24, 96.68, 61.5, 77.03, 19.23, 40.75, 69.01, 0.44, 63.72] │
│ 40105 │ [24.75, 89.22, 75.89, 44.39, 77.42, 76.48, 65.36, 23.91, 87.82, 83.0, 66.94, 48.08, 63.37, 91.07, 55.44, 82.87] │
│ 65181 │ [14.41, 85.79, 51.18, 29.0, 90.92, 79.34, 27.27, 18.83, 74.15, 63.09, 71.19, 25.36, 33.59, 35.18, 18.81, 98.44] │
│ 11741 │ [9.9, 54.65, 87.59, 18.57, 84.12, 98.76, 3.8, 11.39, 69.95, 88.63, 96.89, 52.26, 53.76, 69.69, 31.16, 73.61] │
├───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows 2 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.002 user 0.001809 sys 0.000151
执行计划:
explain select * from a order by array_cosine_similarity(vec, '[15.96, 71.92, 81.69, 46.45, 79.38, 99.68, 23.87, 29.97, 81.5, 79.39, 79.94, 14.07, 18.69, 88.09, 40.91, 91.29]'::float[16]) limit 10;
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ #1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ id │
│ vec │
│array_cosine_similarity(vec│
│ , [15.96, 71.92, 81.69, 46│
│.45, 79.38, 99.68, 23.... │
│.97, 81.5, 79.39, 79.94, 14│
│ .07, 18.69, 88.09, 40.91, │
│ 91.29]) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HNSW_INDEX_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ a (HNSW INDEX SCAN : idx) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ id │
│ vec │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 10 │
└───────────────────────────┘
目前vss支持3种向量距离, 索引选项和距离函数对应关系如下
metric = 'cosine'
array_cosine_similarity
metric = 'l2sq'
array_distance
metric = 'ip'
array_inner_product
和PG向量插件一样, recall准确率, 索引创建耗时和几个参数有关, 有兴趣的朋友可以参考如下文章:
-- hnsw 相关参数:
select name from duckdb_settings() where name like 'hnsw';
-- 文中指出的, 实际测试没有这个参数, 等v0.10.3 release后再看吧. 会话参数, 设置SQL查询时参数, 影响最大可召回记录条数.
set hnsw_ef_search=100;
-- 设置索引其他参数
CREATE INDEX my_hnsw_cosine_index
ON a
USING HNSW (vec)
WITH (metric = 'cosine' , ef_construction = 64);
Option | Default | Description |
---|---|---|
ef_construction |
128 | The number of candidate vertices to consider during the construction of the index. A higher value will result in a more accurate index, but will also increase the time it takes to build the index. |
ef_search |
64 | The number of candidate vertices to consider during the search phase of the index. A higher value will result in a more accurate index, but will also increase the time it takes to perform a search. |
M |
16 | The maximum number of neighbors to keep for each vertex in the graph. A higher value will result in a more accurate index, but will also increase the time it takes to build the index. |
M0 |
2 * M |
The base connectivity, or the number of neighbors to keep for each vertex in the zero-th level of the graph. A higher value will result in a more accurate index, but will also increase the time it takes to build the index. |
Additionally, you can also override the ef_search parameter set at index construction time by setting the SET hnsw_ef_search = ⟨int⟩
configuration option at runtime. This can be useful if you want to trade search performance for accuracy or vice-versa on a per-connection basis. You can also unset the override by calling RESET hnsw_ef_search
.