Skip to content

Latest commit

 

History

History
215 lines (161 loc) · 11.9 KB

20240505_03.md

File metadata and controls

215 lines (161 loc) · 11.9 KB

DuckDB 发布vss 向量插件

作者

digoal

日期

2024-05-05

标签

PostgreSQL , PolarDB , DuckDB , vector , extension , vss


背景

DuckDB 发布vss 向量插件, 又多了一个涉猎的场景, 项目地址如下 : https://github.com/duckdb/duckdb_vss 根据介绍与PG这个向量插件底层有些类似 : https://github.com/lanterndata/lantern https://github.com/unum-cloud/usearch

vss 向量插件有望内置在5.20发布的v0.10.3版本中. vss 向量插件介绍文档:

向量索引不是什么神秘的东东了, 但是在duckdb这种in-process olap产品中支持向量索引, 意义就非同寻常, 意味着RAG应用可能不需要再拖一个关系数据库或者专业的向量数据库服务, 因为直接加载libduckdb即可.

体验DuckDB vss, 依旧推荐我的万能镜像:

进入docker 容器后

su - postgres  
cd ~  
git clone --depth 1 https://github.com/duckdb/duckdb_vss  
cd duckdb_vss  
git clone --depth 1 -b v0.10.2 https://github.com/duckdb/duckdb  
make -j4  

可选(不太优雅, 后面有优雅的方法), 创建a.sql, 生成10万条16维度随机向量:

cd ~/duckdb_vss  
  
echo "insert into a (vec) values (gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16));" > ./a.sql  
  
for ((i=1;i<10000;i++)) do  
echo "insert into a (vec) values (gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16)),(gen_arr(16));" >> ./a.sql  
done  

准备duckdb

cd ~/duckdb_vss  
./build/release/duckdb    
  
CREATE OR REPLACE TEMP MACRO gen_arr(dims) as (   
  with recursive cte as ( select 0 as id, round((random()*100)::float,2) as ele   
    union all   
    select id+1 as id, round((random()*100)::float,2) as ele from cte where id < dims-1)   
  select array_agg(ele) from cte   
);   
  
create sequence seq;  
  
CREATE TABLE a (id int default nextval('seq'), vec FLOAT[16]);   

可选(不太优雅, 后面有优雅的方法), 导入到duckdb in memory database

begin;  
.read a.sql  
end;  

以上导入不太优雅, 有一种更优雅的方式:

insert into a (vec) select array_agg(f) from
  (select i,round(random()*100, 2) f from range (0, 16*100000) t(i))  -- 导入100000条   
group by i//16;  -- //表示取整的结果, 相当于 (i/16)::int  ; 如果取余数使用 i%16 相当于 mod(i,16)  ;  

Run Time (s): real 0.135 user 0.252736 sys 0.015283  

创建向量索引, 使用向量搜索:

D load 'vss';  
  
create index idx on a using hnsw (vec) with (metric = 'cosine');   
  
select gen_arr(16)::float[16];
[15.96, 71.92, 81.69, 46.45, 79.38, 99.68, 23.87, 29.97, 81.5, 79.39, 79.94, 14.07, 18.69, 88.09, 40.91, 91.29]
  
.timer on

select * from a order by array_cosine_similarity(vec, '[15.96, 71.92, 81.69, 46.45, 79.38, 99.68, 23.87, 29.97, 81.5, 79.39, 79.94, 14.07, 18.69, 88.09, 40.91, 91.29]'::float[16]) limit 10;   

┌───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  id   │                                                       vec                                                       │
│ int32 │                                                    float[16]                                                    │
├───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 37782 │ [8.58, 72.76, 93.36, 53.85, 82.06, 76.71, 42.34, 29.73, 89.68, 46.34, 90.53, 47.72, 10.39, 75.55, 30.47, 73.5]  │
│ 66603 │ [24.7, 92.06, 47.99, 56.92, 76.43, 92.61, 24.2, 22.18, 51.49, 89.63, 66.81, 39.09, 12.69, 76.16, 68.1, 91.82]   │
│ 53058 │ [13.94, 69.53, 50.63, 54.99, 74.15, 61.28, 4.07, 29.9, 91.27, 47.82, 41.77, 2.15, 7.73, 76.65, 34.69, 62.68]    │
│ 90601 │ [38.14, 54.75, 70.61, 26.27, 41.76, 77.41, 24.52, 56.55, 72.44, 67.16, 74.32, 6.48, 47.41, 70.84, 38.26, 96.35] │
│  4366 │ [33.03, 68.98, 65.19, 66.84, 84.83, 63.71, 46.52, 30.35, 86.23, 55.99, 55.4, 7.02, 34.45, 73.2, 10.68, 82.51]   │
│ 50588 │ [3.3, 63.69, 72.9, 62.28, 69.63, 76.36, 39.87, 55.34, 82.18, 46.39, 91.37, 30.25, 52.2, 63.17, 35.52, 74.51]    │
│ 95578 │ [18.03, 68.11, 76.52, 22.12, 44.43, 96.9, 23.78, 7.24, 96.68, 61.5, 77.03, 19.23, 40.75, 69.01, 0.44, 63.72]    │
│ 40105 │ [24.75, 89.22, 75.89, 44.39, 77.42, 76.48, 65.36, 23.91, 87.82, 83.0, 66.94, 48.08, 63.37, 91.07, 55.44, 82.87] │
│ 65181 │ [14.41, 85.79, 51.18, 29.0, 90.92, 79.34, 27.27, 18.83, 74.15, 63.09, 71.19, 25.36, 33.59, 35.18, 18.81, 98.44] │
│ 11741 │ [9.9, 54.65, 87.59, 18.57, 84.12, 98.76, 3.8, 11.39, 69.95, 88.63, 96.89, 52.26, 53.76, 69.69, 31.16, 73.61]    │
├───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                       2 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.002 user 0.001809 sys 0.000151

执行计划:

explain select * from a order by array_cosine_similarity(vec, '[15.96, 71.92, 81.69, 46.45, 79.38, 99.68, 23.87, 29.97, 81.5, 79.39, 79.94, 14.07, 18.69, 88.09, 40.91, 91.29]'::float[16]) limit 10; 

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │
│             #1            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            │
│            vec            │
│array_cosine_similarity(vec│
│ , [15.96, 71.92, 81.69, 46│
│.45, 79.38, 99.68, 23....  │
│.97, 81.5, 79.39, 79.94, 14│
│ .07, 18.69, 88.09, 40.91, │
│           91.29])         │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      HNSW_INDEX_SCAN      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ a (HNSW INDEX SCAN : idx) │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            │
│            vec            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 10          │
└───────────────────────────┘           

目前vss支持3种向量距离, 索引选项和距离函数对应关系如下

metric = 'cosine'  
array_cosine_similarity  
  
metric = 'l2sq'  
array_distance  
  
metric = 'ip'  
array_inner_product  

和PG向量插件一样, recall准确率, 索引创建耗时和几个参数有关, 有兴趣的朋友可以参考如下文章:

-- hnsw 相关参数:   
select name from duckdb_settings() where name like 'hnsw';  
  
-- 文中指出的, 实际测试没有这个参数, 等v0.10.3 release后再看吧.  会话参数, 设置SQL查询时参数, 影响最大可召回记录条数.    
set hnsw_ef_search=100;   
  
-- 设置索引其他参数  
CREATE INDEX my_hnsw_cosine_index 
ON a 
USING HNSW (vec) 
WITH (metric = 'cosine' , ef_construction = 64);
Option Default Description
ef_construction 128 The number of candidate vertices to consider during the construction of the index. A higher value will result in a more accurate index, but will also increase the time it takes to build the index.
ef_search 64 The number of candidate vertices to consider during the search phase of the index. A higher value will result in a more accurate index, but will also increase the time it takes to perform a search.
M 16 The maximum number of neighbors to keep for each vertex in the graph. A higher value will result in a more accurate index, but will also increase the time it takes to build the index.
M0 2 * M The base connectivity, or the number of neighbors to keep for each vertex in the zero-th level of the graph. A higher value will result in a more accurate index, but will also increase the time it takes to build the index.

Additionally, you can also override the ef_search parameter set at index construction time by setting the SET hnsw_ef_search = ⟨int⟩ configuration option at runtime. This can be useful if you want to trade search performance for accuracy or vice-versa on a per-connection basis. You can also unset the override by calling RESET hnsw_ef_search.

digoal's wechat