digoal
2023-09-22
PostgreSQL , PolarDB , ai , 大模型 , 向量 , 向量相近 , hnsw , ivfflat , 延迟 , 吞吐 , build index , lantern , lantern_extras
lantern: 又一款PolarDB|PG 向量类型+索引+操作符 插件, 宣称性能吊打neon,embedding,hnsw,pgvector
lantern_extras: AI 功能练习插件, 内置大模型, 可从集市下载模型, 使用大模型转换文本|图像为向量, 加速向量索引build
想试一试lantern, lantern_extras? 可以使用这个docker镜像, 已内置这2个插件.
1、《PolarDB|PostgreSQL amd64 image》
# 拉取镜像, 第一次拉取一次即可.
docker pull registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts
# 启动容器
docker run --platform linux/amd64 -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts
# 进入容器
docker exec -ti pg bash
# 连接数据库
psql
2、《PolarDB|PostgreSQL arm64 image》
# 拉取镜像, 第一次拉取一次即可.
docker pull registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts_arm64
# 启动容器
docker run -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts_arm64
# 进入容器
docker exec -ti pg bash
# 连接数据库
psql
https://github.com/lanterndata/lantern
Lantern builds and uses usearch, a single-header state-of-the-art HNSW implementation.
lantern = 向量类型 + 索引 + 向量距离计算操作符 + 向量距离排序
lantern github中有以下3种向量操作性能benchmark数据
- 按向量距离排序并返回少量记录. 要求低延迟. 对应相似图像,近似问题等搜索.
- 按向量距离排序并返回大量记录. 要求高吞吐. 对应营销人群圈选等场景.
- build index
未来还值得测试性能的场景:
- 在有向量索引的表进行插入、更新的吞吐和延迟.
- 在大量写入后, 立即进行向量检索的性能. 类似GIN fastupdate功能, 启用后可能有大量数据还没有合并进索引, 此时性能会比较差.
- 未来新增向量互补性排序, 不想关查询等的性能.
未来向量应用的场景畅想参考信息:
用法举例:
CREATE EXTENSION lantern;
CREATE TABLE small_world (id integer, vector real[3]);
INSERT INTO small_world (id, vector) VALUES (0, '{0,0,0}'), (1, '{0,0,1}');
CREATE INDEX ON small_world USING hnsw (vector);
Customize hnsw index parameters depending on your vector data, such as the distance function (e.g., dist_l2sq_ops
), index construction parameters, and index search parameters.
CREATE INDEX ON small_world USING hnsw (vector dist_l2sq_ops)
WITH (M=2, ef_construction=10, ef=4, dim=3);
SET enable_seqscan = false;
SELECT id, l2sq_dist(vector, ARRAY[0,0,0]) AS dist
FROM small_world ORDER BY vector <-> ARRAY[0,0,0] LIMIT 1;
Note that the operator <->
is intended exclusively for use with index lookups. If you expect to not use the index in a query, just use the distance function directly (e.g. l2sq_dist(v1, v2)
)
There are four defined operator classes that can be employed during index creation:
- dist_l2sq_ops: Default for the type
real[]
- dist_vec_l2sq_ops: Default for the type
vector
- dist_cos_ops: Applicable to the type
real[]
- dist_hamming_ops: Applicable for the type
integer[]
构造索引参数. The M, ef, and ef_construction
parameters control the performance of the HNSW algorithm for your use case.
- In general, lower
M
andef_construction
speed up index creation at the cost of recall. - Lower
M
andef
improve search speed and result in fewer shared buffer hits at the cost of recall. Tuning these parameters will require experimentation for your specific use case.
特性
- Embedding generation for popular use cases (CLIP model, Hugging Face models, custom model)
- Interoperability with pgvector's data type, so anyone using pgvector can switch to Lantern
- Parallel index creation via an external indexer
- Ability to generate the index graph outside of the database server
- Support for creating the index outside of the database and inside another instance allows you to create an index without interrupting database workflows.
- See all of our helper functions to better enable your workflows
https://github.com/lanterndata/lantern_extras
特性:
- Streaming download of vector embeddings in archived and uncompressed formats
- Streaming download of various standard vector benchmark datasets
- SIFT
- GIST
- Generation of various various embeddings for data stored in Postgres tables without leaving the database
有了模型可以进行推理、结构化(如文本)、非结构化(如图像)数据转成向量的工作.
用法例子:
-- parse the first 41 vectors from the uncompressed .fvecs vector dataset on server machine
SELECT parse_fvecs('/tmp/rustftp/siftsmall/siftsmall_base.fvecs', 41);
-- load the first 10k vectors from the uncompressed vector dataset into a table named sift
SELECT * INTO sift FROM parse_fvecs('/tmp/rustftp/siftsmall/siftsmall_base.fvecs', 10000);
-- load SIFT dataset ground truth vectors into a table from an online ftp archive
SELECT query,
true_nearest INTO sift_ground
FROM get_sift_groundtruth('ftp://host/path/to/siftsmall.tar.gz');
-- generate CLIP embeddings for columns of a postgres table
SELECT abstract,
introduction,
figure1,
clip_text(abstract) AS abstract_ai,
clip_text(introduction) AS introduction_ai,
clip_image(figure1) AS figure1_ai
INTO papers_augmented
FROM papers;
generate embeddings from other models which can be extended
-- generate text embedding
SELECT text_embedding('BAAI/bge-base-en', 'My text input');
-- generate image embedding with image url
SELECT image_embedding('clip/ViT-B-32-visual', 'https://link-to-your-image');
-- generate image embedding with image path (this path should be accessible from postgres server)
SELECT image_embedding('clip/ViT-B-32-visual', '/path/to/image/in-postgres-server');
-- get available list of models
SELECT get_available_models();
从集市添加新模型(例如 从onnx下载模型)
集市:
To add new textual or visual models for generating vector embeddings you can follow this steps:
1、Find the model onnx file or convert it using optimum-cli. Example
optimum-cli export onnx --model BAAI/bge-base-en onnx/
2、Host the onnx model
3、Add model information in MODEL_INFO_MAP
under lantern_extras/src/encoder.rs
4、Add new image/text processor based on model inputs (you can check existing processors they might match the model) and then add the match
arm in process_text
or process_image
function in EncoderService
so it will run corresponding processor for model.
After this your model should be callable from SQL like
SELECT text_embedding('your/model_name', 'Your text');
lantern_extras的做法是把模型load到本地, 而还有一种是调用api的方法使用远程大模型, 例如在阿里云可以使用plpython3u来调用灵积里面的诸多大模型能力, 玩法参考:
可以在外部并行build 向量索引, 并将build好的索引导入到数据库中, 加速build索引的过程.
加速索引build例子:
Usage: lantern-create-index --uri <URI> --table <TABLE> --column <COLUMN> -m <M> --efc <EFC> --ef <EF> -d <DIMS> --metric-kind <METRIC_KIND> --out <OUT>
例如:
lantern-create-index -u "postgresql://localhost/test" -t "small_world" -c "vec" -m 16 --ef 64 --efc 128 -d 3 --metric-kind cos --out /tmp/index.usearch
依赖onnxruntime lib
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://github.com/microsoft/onnxruntime/releases/tag/v1.15.1
使用如下镜像环境进行打包
《amd64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》
《arm64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》
1、arm64 打包
# install pgrx
cd /tmp
# 查看依赖pgrx版本? Cargo.toml pgrx = "=0.11.3"
cargo install --locked --version 0.11.3 cargo-pgrx
cargo pgrx init # create PGRX_HOME 后, 立即ctrl^c 退出
cargo pgrx init --pg14=`which pg_config` # 不用管报警
# build and install lantern_extras
cd /tmp
curl -Z --connect-timeout 120 -m 36000 --retry 12000 --retry-delay 5 --retry-max-time 1200 -L https://github.com/microsoft/onnxruntime/releases/download/v1.15.1/onnxruntime-linux-aarch64-1.15.1.tgz -o onnxruntime-linux-aarch64-1.15.1.tgz
tar -zxvf onnxruntime-linux-aarch64-1.15.1.tgz
export ORT_STRATEGY=system
export ORT_LIB_LOCATION=/tmp/onnxruntime-linux-aarch64-1.15.1/lib
git clone --depth 1 https://github.com/lanterndata/lantern_extras
cd /tmp/lantern_extras
# 配置cargo config
vi ~/.cargo/config
[target.aarch64-unknown-linux-gnu]
rustflags = ["-C", "link-args=-Wl,-rpath,/tmp/onnxruntime-linux-aarch64-1.15.1/lib"]
[target.'cfg(target_os="linux")']
# Postgres symbols won't be available until runtime
rustflags = ["-Clink-arg=-Wl,-undefined,dynamic_lookup"]
# replace aarch64-unknown-linux-gnu with your architecture.
# You can get it by running rustc -vV | sed -n 's|host: ||p'
# 打包 lantern_extras
cargo pgrx package --package lantern_extras
# 安装build index命令行工具:
cd /tmp/lantern_extras
cargo install --path lantern_cli
Usage: lantern-cli create-index --uri <URI> --table <TABLE> --column <COLUMN> -m <M> --efc <EFC> --ef <EF> -d <DIMS> --metric-kind <METRIC_KIND> --out <OUT>
root@3501de034e72:/tmp/lantern_extras# /root/.cargo/bin/lantern-cli --help
Usage: lantern-cli <COMMAND>
Commands:
create-index Create external index
create-embeddings Create embeddings
show-models Show embedding models
measure-model-speed Show embedding models
start-daemon Start in daemon mode
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
打包好之后, 可以拷贝到当前PG 14指定目录, 在容器中安装 lantern_extras .
cp target/release/lantern_extras-pg14/usr/share/postgresql/14/extension/lantern_extras.control /usr/share/postgresql/14/extension/
cp target/release/lantern_extras-pg14/usr/share/postgresql/14/extension/lantern_extras--0.0.3.sql /usr/share/postgresql/14/extension/
cp target/release/lantern_extras-pg14/usr/lib/postgresql/14/lib/lantern_extras.so $PGHOME/lib/
psql
create extension lantern_extras;
2、amd64 打包
# install pgrx
cd /tmp
# 查看依赖pgrx版本? Cargo.toml pgrx = "=0.11.3"
cargo install --locked --version 0.11.3 cargo-pgrx
cargo pgrx init # create PGRX_HOME 后, 立即ctrl^c 退出
cargo pgrx init --pg14=`which pg_config` # 不用管报警
# build and install lantern_extras
cd /tmp
curl -Z --connect-timeout 120 -m 36000 --retry 12000 --retry-delay 5 --retry-max-time 1200 -L https://github.com/microsoft/onnxruntime/releases/download/v1.15.1/onnxruntime-linux-x64-1.15.1.tgz -o onnxruntime-linux-x64-1.15.1.tgz
tar -zxvf onnxruntime-linux-x64-1.15.1.tgz
export ORT_STRATEGY=system
export ORT_LIB_LOCATION=/tmp/onnxruntime-linux-x64-1.15.1/lib
git clone --depth 1 https://github.com/lanterndata/lantern_extras
cd lantern_extras
# 配置cargo config
vi ~/.cargo/config
[target.x86_64-unknown-linux-gnu]
rustflags = ["-C", "link-args=-Wl,-rpath,/tmp/onnxruntime-linux-x64-1.15.1/lib"]
[target.'cfg(target_os="linux")']
# Postgres symbols won't be available until runtime
rustflags = ["-Clink-arg=-Wl,-undefined,dynamic_lookup"]
# replace x86_64-unknown-linux-gnu with your architecture.
# You can get it by running rustc -vV | sed -n 's|host: ||p'
# 打包 lantern_extras
cargo pgrx package --package lantern_extras
# 安装build index命令行工具:
cd /tmp/lantern_extras
cargo install --path lantern_cli
打包好之后, 可以拷贝到当前PG 14指定目录, 在容器中安装 lantern_extras .
cp target/release/lantern_extras-pg14/usr/share/postgresql/14/extension/lantern_extras.control /usr/share/postgresql/14/extension/
cp target/release/lantern_extras-pg14/usr/share/postgresql/14/extension/lantern_extras--0.0.3.sql /usr/share/postgresql/14/extension/
cp target/release/lantern_extras-pg14/usr/lib/postgresql/14/lib/lantern_extras.so $PGHOME/lib/
psql
create extension lantern_extras;
3、将编译好的文件拷贝到制作docker image的dir中:
《arm64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》
《arm64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》
docker cp pg:/tmp/lantern_extras/target/release/lantern_extras-pg14/usr/share/postgresql/14/extension/lantern_extras.control ~/pg14/
docker cp pg:/tmp/lantern_extras/target/release/lantern_extras-pg14/usr/share/postgresql/14/extension/lantern_extras--0.0.3.sql ~/pg14/
docker cp pg:/tmp/lantern_extras/target/release/lantern_extras-pg14/usr/lib/postgresql/14/lib/lantern_extras.so ~/pg14/
docker cp pg:/root/.cargo/bin/lantern-cli ~/pg14/