Skip to content

Latest commit

 

History

History
228 lines (170 loc) · 11.4 KB

20240202_01.md

File metadata and controls

228 lines (170 loc) · 11.4 KB

如果向量超过2000维度怎么办? pg_sparse: 在 Postgres 中使用 SPLADE 进行相似性搜索

作者

digoal

日期

2024-02-02

标签

PostgreSQL , PolarDB , DuckDB , vector , 稠密向量 , 稀疏向量 , 高维度 , pg_sparse , pgvector


背景

pgvector是非常不错的稠密向量存储和索引插件. 但是维度限制在2000.

如果向量超过2000维度怎么办?

  • 使用pg_sparse: 在 Postgres 中使用 SPLADE 进行相似性搜索
  • 已经集成到“宇宙第一数据库镜像”中, 欢迎体验.

pg_sparse介绍

我们很高兴推出pg_sparse:第一个在 Postgres 中使用 HNSW 实现稀疏向量高效存储和检索的扩展。pg_sparse稀疏向量的含义pgvector与稠密向量的含义相同。

稀疏向量由 SPLADE 等新模型生成,可以检测精确关键字的存在,同时还可以捕获术语之间的语义相似性。与稠密向量不同,稀疏向量包含更多的条目,其中大部分为零。例如,OpenAItext-embedding-ada-002模型输出具有 1536 个条目的密集向量,而 SPLADE 输出具有超过 30,000 个条目的稀疏向量。

pg_sparse是 的一个分支pgvector,这意味着它利用了pgvector现有的向量存储和 HNSW 实现。它由两个主要变化组成:

  • 一种新的 Postgres 数据类型,svector通过非零条目存储稀疏向量
  • 对分配 Postgres 页面的方式进行修改,pgvector以支持具有可变数量的非零条目的向量

您可以通过在现有的自托管 Postgres 实例中安装或pg_sparse运行我们的 Docker 映像来轻松开始。运行以下查询即可开始:

-- Load extension  
CREATE EXTENSION svector;  
-- Create test data  
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));  
INSERT INTO items (embedding) VALUES ('[1,0,3,0]'), ('[0,0,5,6]');  
-- Create HNSW index for cosine similarity  
CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);  
-- Run query  
SELECT * FROM items ORDER BY embedding <=> '[3,0,1,0]';  

Postgres 内部的 SPLADE:一个示例

接下来,让我们看一下一个更复杂的示例,该示例使用 来插入、索引和搜索由 SPLADE 生成的稀疏向量pg_sparse。要创建稀疏向量,让我们安装依赖项:

!pip install -U transformers torch datasets pandas tqdm  

接下来,让我们运行 Python 代码

  • 加载包含 50,000 部电影描述的示例 Huggingface 数据集
  • 为每个描述生成一个 SPLADE 向量
  • 将数据集保存为 CSV 文件
from transformers import AutoModelForMaskedLM, AutoTokenizer  
from tqdm import tqdm  
from datasets import load_dataset  
import pandas as pd  
import re  
import torch  
  
model_id = 'naver/splade-cocondenser-ensembledistil'  
dataset_name = 'SandipPalit/Movie_Dataset'  
  
def create_splade(text):  
    tokens = tokenizer(text, return_tensors='pt')  
    output = model(**tokens)  
    vec = torch.max(  
        torch.log(  
            1 + torch.relu(output.logits)  
        ) * tokens.attention_mask.unsqueeze(-1),  
    dim=1)[0].squeeze()  
  
    return vec  
  
def clean_text(text):  
    return re.sub(r'[\r\n\t]+', ' ', text)  
  
# Initialize SPLADE model  
tokenizer = AutoTokenizer.from_pretrained(model_id)  
model = AutoModelForMaskedLM.from_pretrained(model_id)  
  
# Load dataset  
dataset = load_dataset(dataset_name)  
training_dataset = dataset["train"]  
  
# Generate SPLADE vectors  
# Note this will take a long time, consider reducing the size of  
# training_dataset to reduce the time  
texts = []  
vectors = []  
  
for example in tqdm(training_dataset, desc="Processing..."):  
    text = clean_text(example['Overview'])  
    texts.append(text)  
    vector = create_splade(text)  
    vectors.append(vector.tolist())  
  
# Save as dataframe  
df = pd.DataFrame({  
    'text': texts,  
    'splade_vector': vectors  
})  
  
# Generate another SPLADE vector for querying  
query = "Space exploration"  
df.at[0, 'text'] = query  
df.at[0, 'splade_vector'] = create_splade(query).tolist()  
  
# Save to CSV  
df.to_csv("splade_vectors.csv", index=False)  

使用像 一样的 Postgres 客户端psql,我们现在可以加载我们的电影数据集。由于数据集的大小,这将需要几分钟的时间。

CREATE TABLE movies (description text, splade_vector svector(30522));  
\copy movies FROM '/path/to/splade_vectors.csv' DELIMITER ',' CSV HEADER;  

我们还可以启用计时来检查查询的性能。

\timing  

现在我们准备好执行我们的第一次搜索。首先,我们将根据余弦相似度查找与第一行最相似的十行:

SELECT description  
FROM movies  
ORDER BY splade_vector <=> (SELECT splade_vector FROM movies LIMIT 1)  
LIMIT 10;  

预期反应

                                                                                                                         description  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Space exploration  
 A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.  
 Delhi Boys Going To Space  
 1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...  
 Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.  
 Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.  
 A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.  
 An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.  
 An exploration into the fate of the post-modern man.  
 No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.  
(10 rows)  
  
Time: 154.361 ms  

该查询执行了 Postgres 顺序扫描,花了 155 毫秒返回前十行。现在,让我们创建一个 HNSW 索引来加速搜索结果。这可能需要几分钟才能运行。

CREATE INDEX ON movies  
USING shnsw (splade_vector svector_cosine_ops);  

创建索引后,让我们重新运行与上面相同的搜索查询。我们得到了相同的结果,但查询时间为 2.740 毫秒 — 查询时间加快了 50 倍!

预期反应

                                                                                                                         description  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Space exploration  
 A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.  
 Delhi Boys Going To Space  
 1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...  
 Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.  
 Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.  
 A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.  
 An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.  
 An exploration into the fate of the post-modern man.  
 No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.  
(10 rows)  
  
Time: 2.704 ms  

基准测试结果

我们测量了由 SPLADE 生成的 100K 个稀疏向量数据集的索引创建和查询时间,每个向量都有 30,522 维。我们将 Postgres 的maintenance_working_mem配置设置为 512MB,以便整个 HNSW 图可以在构建过程中装入内存。

ALTER SYSTEM SET maintenance_work_mem = '512MB';  
SELECT pg_reload_conf();  

ef_construction是一个参数,可以创建更高质量的图表和更准确的搜索结果,但代价是更长的索引创建时间。在默认ef_construction值下64,构建索引需要 200 秒(500 个向量/秒)。

ef_构造 索引时间  向量数量  
32 114秒  100,000  
64 200多岁 100,000  
128   362秒  100,000  

PS: pgvector 0.6.0 引入了并行创建hnsw索引, 所以创建索引速度可以飞速提升了.

接下来,我们比较了使用和不使用 HNSW 索引返回前 10 个结果的时间。我们设置m=16、ef_construction=64、 和ef_search=40。对于 HNSW,此搜索花费了 6 毫秒。如果没有 HNSW,此搜索需要 150 毫秒。

digoal's wechat