Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector embeddings support in Pinot #10919

Open
Aravind-Suresh opened this issue Jun 15, 2023 · 14 comments
Open

Vector embeddings support in Pinot #10919

Aravind-Suresh opened this issue Jun 15, 2023 · 14 comments

Comments

@Aravind-Suresh
Copy link
Contributor

Creating this issue to initiate discussions about supporting vector embeddings in Pinot.

This write-up collates some initial thoughts about this. It isn't a design doc, we'll work on the design doc once we've a high-level alignment.

@siddharthteotia
Copy link
Contributor

Glad to see there are others thinking about this as well.

I had recently created a short internal proposal on why a case can be made for vector storage and indexing in Pinot.

I think first thing we need to do is to get alignment / consensus within the community that it makes sense to do vector search in Pinot

This is our internal Description and Business Justification we created. @jasperjiaguo can add more info

Description

Vector embeddings are numerical coordinate (multi dimensional space) based representations typically resulting from a machine learning model training. For example training of LLM on text can produce billions of vector embeddings which are the distilled representation of text / words (training data). Goal is to build optimal storage, indexing and query execution capabilities for such kind of data.

Benefit / Use Case

Can be a crucial foundation for AI systems that can leverage high performance similarity indexing and analytics on vector embeddings for recommendation, image matching, pattern recognition, anomaly detection etc.

Specifically in the case of LLMs and prompt engineering pipeline - vector storage, indexing and querying can be used to store and query domain specific facts (that were created during training e.g neural network learning) which can then be fed into NLP models / ChatBots, Conversational Prompts etc

@siddharthteotia
Copy link
Contributor

Would love to collaborate on this.

@abhioncbr
Copy link
Collaborator

This is interesting. +1

@jasperjiaguo
Copy link
Contributor

jasperjiaguo commented Jun 15, 2023

Recommendation systems and Language Model (LLM) applications often utilize high-dimensional vector spaces to represent complex data like user profiles or linguistic patterns. Similarity-based vector indexing/search, a crucial element of these systems, identifies 'close' vectors in this space, signifying high similarity. This is commonly achieved through calculating the cosine similarity or Euclidean distance between vectors.

For instance, (1) in recommendation systems, items similar to a user's past interests are identified and suggested. (2) Meanwhile in LLM applications, instead of submitting a customer’s prompt directly to model, the question is first routed to the vector database (can be considered as the memory of the LLM), which will retrieve the top 10 or 15 most relevant documents for that query. The vector database then bundles those supporting documents with the user’s original question, submits the full package as the knowledge context prompt to the LLM, which returns more relevant answer. (https://mlops.community/combine-and-query-multiple-documents-with-llm/, https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/MilvusIndexDemo.html)

However, given the potentially vast number of vectors, searching for the most similar ones can be computationally challenging. Therefore, Approximate Nearest Neighbor (ANN) algorithms like FAISS, Annoy, or ScaNN are employed to expedite this process by quickly finding the nearest vectors in high-dimensional spaces.

https://milvus.io/docs/index.md

https://github.com/facebookresearch/faiss

https://www.datanami.com/2023/03/27/vector-databases-emerge-to-fill-critical-role-in-ai/

https://github.com/linkedin/venice#read-compute

@Aravind-Suresh
Copy link
Contributor Author

Thanks for the inputs @siddharthteotia @jasperjiaguo - yes, given the high dimensionality of the embeddings (OpenAI-davinci embeddings are >12k in dimensions), it's practical to use approximate algorithms.

In addition to recommendation systems and vector-search based prompts, there are also applications in semantic searches, clustering (grouping of related issues, text) as well.

We recently tried powering automated Q&A via vector-search (using vector search based prompts) and it achieves good precision on unstructured data input as well (we used langchain here - https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html)

Given that new features are being powered via embeddings (Glean's AI powered enterprise search is one recent example - https://www.glean.com/blog/unlocking-the-power-of-vector-search-in-enterprise), it would be good to evaluate how Pinot can support this in a real-time setup.

Looking forward to the collaboration here!

@kishoreg
Copy link
Member

cc @KKcorps who is also thinking about it.

@jasperjiaguo
Copy link
Contributor

jasperjiaguo commented Jun 15, 2023

@Aravind-Suresh Exactly. I've also been using llama_index and langchain with chatgpt apis. I think one usability addition to this feature may be to integrate Pinot vector store with these python packages or provide similar powerful python libs. Here is a list of vector store llama_index supports: https://gpt-index.readthedocs.io/en/latest/how_to/integrations/vector_stores.html .

@xiangfu0
Copy link
Contributor

cc: @kkrugler

@xiangfu0
Copy link
Contributor

xiangfu0 commented Jun 15, 2023

Here are some takes from my side:
High level principals:

  • CPU solution
  • KNN search has to be distributable
  • The minimal search space is considered within one segment level(10-100MM rows/points)
  • Pluggable index structure along with the search algorithm

Considering the doc size in one segment is usually < 10MM, so I think any of current billion scale approach is sufficient for us.

In terms of implementation, here is just take an example of using SPTAG(https://github.com/microsoft/SPTAG), paper( https://arxiv.org/pdf/2111.08566.pdf). Definitely leverage existing libraries to no re-invent the wheel.

During Index build phase, we need to build per segment basis SPTAG index. Use hierarchical balanced clustering to generate a set of regions(centroids).
We can configure below two parameters:

  • Number of regions or the percentage of total points are centroids(number of regions). From paper, 16% for best for search performance and memory usage
  • Replicas for a vector assigned to multiple closed clusters, larger number means better recall but search requires more resources and longer latency. From paper, 8 is best to balance perf and latency. Need to use RNG algorithm to avoid the high similarity of posting list for close regions

During Query phase:
kNN search functionality should be able to configure:

  • k(required), which is how many results to fetch,
  • t(optional), a percent number to include more regions to search based on the distance to the closest centroids, this will increase the recall rate but still keep low resources usage

@KKcorps
Copy link
Contributor

KKcorps commented Jun 19, 2023

IMO, CPU based solution would be too slow for vector search. The vector embeddings popular currently use 700 to 1536 length floating point arrays for a single object.

Computing similarity across million such object at runtime for indexing is quite compute heavy.

@walterddr
Copy link
Contributor

walterddr commented Jun 27, 2023

CPU solutions only make sense in certain scenarios IMO and I am not sure if those are fit.

  • Q: can it perform significantly better in specific use cases, for example ANNS use cases that the setup GPU & I/O overhead outweighs the batch performance benefit on the GPU.
  • Q: can we use an algorithm that doesn't depend on product quantization (or any that specifically designed to leverage the large parallelism of GPU but not so good with branch prediction)
    • for example graph-based approaches ?
    • this also echoes back to Q1 b/c most likely these branching algorithm are not good for batching
  • Q: would we perform significantly cheaper while still maintain the equal amount of performance? and is there a use cases similar to that (for example ad-hoc exploration of the dataset before massively scaled up when GPU is justify)

specifically Pinot, i knew that most of the vector databases leverage "inverted index" mechanism to speed up the ANNS algorithm. i don't think that's identical to the inverted index we have in Pinot but we should see if the indexing framework after index-spi is introduce can be used.

@PeterCorless
Copy link

PeterCorless commented Apr 15, 2024

@hpvd
Copy link

hpvd commented Apr 23, 2024

Release video: Apache Pinot 1.1 | Overview of Latest Features and Updates
talks also about vector index support brought by:
Support Vector index and HNSW as the first implementation #11977

https://www.youtube.com/watch?v=wSwPtOajsGY&t=1m20s

@hpvd
Copy link

hpvd commented Jun 9, 2024

related to open pull request:
Vector data type in Pinot
#11262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants