digoal
2024-02-02
PostgreSQL , PolarDB , DuckDB , vector , 稠密向量 , 稀疏向量 , 高维度 , pg_sparse , pgvector
pgvector是非常不错的稠密向量存储和索引插件. 但是维度限制在2000.
- https://github.com/pgvector/pgvector
- Vectors with up to 2,000 dimensions can be indexed.
如果向量超过2000维度怎么办?
- 使用pg_sparse: 在 Postgres 中使用 SPLADE 进行相似性搜索
- 已经集成到“宇宙第一数据库镜像”中, 欢迎体验.
pg_sparse介绍
我们很高兴推出pg_sparse
:第一个在 Postgres 中使用 HNSW
实现稀疏向量高效存储和检索的扩展。pg_sparse
稀疏向量的含义pgvector
与稠密向量的含义相同。
稀疏向量由 SPLADE
等新模型生成,可以检测精确关键字的存在,同时还可以捕获术语之间的语义相似性。与稠密向量不同,稀疏向量包含更多的条目,其中大部分为零。例如,OpenAI
的text-embedding-ada-002
模型输出具有 1536
个条目的密集向量,而 SPLADE
输出具有超过 30,000
个条目的稀疏向量。
pg_sparse
是 的一个分支pgvector
,这意味着它利用了pgvector
现有的向量存储和 HNSW 实现。它由两个主要变化组成:
- 一种新的 Postgres 数据类型,svector通过非零条目存储稀疏向量
- 对分配 Postgres 页面的方式进行修改,pgvector以支持具有可变数量的非零条目的向量
您可以通过在现有的自托管 Postgres 实例中安装或pg_sparse运行我们的 Docker 映像来轻松开始。运行以下查询即可开始:
-- Load extension
CREATE EXTENSION svector;
-- Create test data
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));
INSERT INTO items (embedding) VALUES ('[1,0,3,0]'), ('[0,0,5,6]');
-- Create HNSW index for cosine similarity
CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);
-- Run query
SELECT * FROM items ORDER BY embedding <=> '[3,0,1,0]';
接下来,让我们看一下一个更复杂的示例,该示例使用 来插入、索引和搜索由 SPLADE
生成的稀疏向量pg_sparse
。要创建稀疏向量,让我们安装依赖项:
!pip install -U transformers torch datasets pandas tqdm
接下来,让我们运行 Python 代码
- 加载包含 50,000 部电影描述的示例 Huggingface 数据集
- 为每个描述生成一个 SPLADE 向量
- 将数据集保存为 CSV 文件
from transformers import AutoModelForMaskedLM, AutoTokenizer
from tqdm import tqdm
from datasets import load_dataset
import pandas as pd
import re
import torch
model_id = 'naver/splade-cocondenser-ensembledistil'
dataset_name = 'SandipPalit/Movie_Dataset'
def create_splade(text):
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)
vec = torch.max(
torch.log(
1 + torch.relu(output.logits)
) * tokens.attention_mask.unsqueeze(-1),
dim=1)[0].squeeze()
return vec
def clean_text(text):
return re.sub(r'[\r\n\t]+', ' ', text)
# Initialize SPLADE model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
# Load dataset
dataset = load_dataset(dataset_name)
training_dataset = dataset["train"]
# Generate SPLADE vectors
# Note this will take a long time, consider reducing the size of
# training_dataset to reduce the time
texts = []
vectors = []
for example in tqdm(training_dataset, desc="Processing..."):
text = clean_text(example['Overview'])
texts.append(text)
vector = create_splade(text)
vectors.append(vector.tolist())
# Save as dataframe
df = pd.DataFrame({
'text': texts,
'splade_vector': vectors
})
# Generate another SPLADE vector for querying
query = "Space exploration"
df.at[0, 'text'] = query
df.at[0, 'splade_vector'] = create_splade(query).tolist()
# Save to CSV
df.to_csv("splade_vectors.csv", index=False)
使用像 一样的 Postgres 客户端psql
,我们现在可以加载我们的电影数据集。由于数据集的大小,这将需要几分钟的时间。
CREATE TABLE movies (description text, splade_vector svector(30522));
\copy movies FROM '/path/to/splade_vectors.csv' DELIMITER ',' CSV HEADER;
我们还可以启用计时来检查查询的性能。
\timing
现在我们准备好执行我们的第一次搜索。首先,我们将根据余弦相似度查找与第一行最相似的十行:
SELECT description
FROM movies
ORDER BY splade_vector <=> (SELECT splade_vector FROM movies LIMIT 1)
LIMIT 10;
预期反应
description
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Space exploration
A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.
Delhi Boys Going To Space
1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...
Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.
Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.
A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.
An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.
An exploration into the fate of the post-modern man.
No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.
(10 rows)
Time: 154.361 ms
该查询执行了 Postgres 顺序扫描,花了 155 毫秒返回前十行。现在,让我们创建一个 HNSW 索引来加速搜索结果。这可能需要几分钟才能运行。
CREATE INDEX ON movies
USING shnsw (splade_vector svector_cosine_ops);
创建索引后,让我们重新运行与上面相同的搜索查询。我们得到了相同的结果,但查询时间为 2.740 毫秒 — 查询时间加快了 50 倍!
预期反应
description
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Space exploration
A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.
Delhi Boys Going To Space
1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...
Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.
Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.
A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.
An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.
An exploration into the fate of the post-modern man.
No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.
(10 rows)
Time: 2.704 ms
我们测量了由 SPLADE 生成的 100K 个稀疏向量数据集的索引创建和查询时间,每个向量都有 30,522 维。我们将 Postgres 的maintenance_working_mem
配置设置为 512MB,以便整个 HNSW 图可以在构建过程中装入内存。
ALTER SYSTEM SET maintenance_work_mem = '512MB';
SELECT pg_reload_conf();
ef_construction
是一个参数,可以创建更高质量的图表和更准确的搜索结果,但代价是更长的索引创建时间。在默认ef_construction
值下64
,构建索引需要 200 秒(500 个向量/秒)。
ef_构造 索引时间 向量数量
32 114秒 100,000
64 200多岁 100,000
128 362秒 100,000
PS: pgvector 0.6.0 引入了并行创建hnsw索引, 所以创建索引速度可以飞速提升了.
接下来,我们比较了使用和不使用 HNSW 索引返回前 10 个结果的时间。我们设置m=16、ef_construction=64
、 和ef_search=40
。对于 HNSW,此搜索花费了 6 毫秒。如果没有 HNSW,此搜索需要 150 毫秒。