Skip to content

Latest commit

 

History

History
54 lines (25 loc) · 4.4 KB

20230928_01.md

File metadata and controls

54 lines (25 loc) · 4.4 KB

TimescaleDB 发布基于DiskANN的增强向量索引

作者

digoal

日期

2023-09-28

标签

PostgreSQL , PolarDB , embedding , diskann


背景

https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/

Introducing Timescale Vector, PostgreSQL++ for production AI applications. Timescale Vector enhances pgvector with faster search, higher recall, and more efficient time-based filtering, making PostgreSQL your new go-to vector database. Timescale Vector is available today in early access on Timescale’s cloud data platform. Keep reading to learn why and how we built it. Then take it out for a ride: try Timescale Vector for free today, with a 90-day extended trial.

https://www.microsoft.com/en-us/research/project/project-akupara-approximate-nearest-neighbor-search-for-large-scale-semantic-search/

https://github.com/Microsoft/DiskANN

Deep Learning-based embeddings are used widely for “dense retrieval” in information retrieval, computer vision, NLP, amongst others, owing to capture diverse types of semantic information. This paradigm constructs embeddings so that semantically similar items are closer in a high dimensional metric space. The first step to enabling search and recommendation with such embeddings is to index the embeddings of the corpus and support approximate nearest-neighbor search (ANNS) a.k.a. Vector Search for query embeddings. While ANNS is a fundamental problem has been studied for decades, existing algorithms suffer from two main drawbacks: either their search accuracies are low, thereby affecting the quality of results downstream, or their memory (DRAM) footprint is enormous, making it hard to serve them at web scale.

In this project, we are designing algorithms to address the challenges of scaling ANNS for web and enterprise search and recommendation systems. Our goal is to build systems that serve trillions of points in a streaming setting cost effectively. Below is a summary of the associated research directions:

DiskANN:(opens in new tab) an ANNS algorithm which can achieve both high accuracy as well as low DRAM footprint, by suitably using auxilliary SSD storage, which is significantly more cost-effective than DRAM. Using DiskANN, we can index 5-10X more points per machine than the state-of-the-art DRAM-based solutions: e.g., DiskANN can index upto a billion vectors while achieving 95% search accuracy with 5ms latencies, while existing DRAM-based algorithms peak at 100-200M points for similar latency and accuracy.

号称可以轻松支持10亿级别向量, 索引相比pgvector hnsw占用空间小至十分之一, 性能略优于pgvector hnsw, build时间比pgvector略快. 当前仅支持timescaledb cloud版本体验.

digoal's wechat