Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up a vector search database as an API, then in the front end #3398

Open
mlissner opened this issue Nov 22, 2023 · 14 comments
Open

Set up a vector search database as an API, then in the front end #3398

mlissner opened this issue Nov 22, 2023 · 14 comments

Comments

@mlissner
Copy link
Member

People just keep asking for this, and it seems like something our customers would use if we had it.

One customer I just talked to wants to do it using Pinecone. Maybe that's an idea. Elastic also seems to make this possible and even has a product page for it: https://www.elastic.co/enterprise-search/vector-search

Maybe it's something we should do, but I do worry about how much memory it would use.

@mlissner
Copy link
Member Author

Another option that Harvard folks are playing with: https://www.trychroma.com/

@mlissner
Copy link
Member Author

I sent a mass email to a lot of our customers to see if people want to contribute money to building this. We'll see what kind of a response I get.

@mlissner
Copy link
Member Author

mlissner commented Dec 14, 2023

One of our clients has done this with our data and he reports the following:

Llama Index makes things a lot easier...

Beyond just getting the data into the vector db, the following optimization activities may need to be conducted:

  • Building and evaluation dataset to measure retrieval performance
  • Tune hyperparameters like chunk size and included metadata
  • Evaluate more advanced chunking strategies (e.g. SentenceWindow chunking)
  • Fine tune an open source embedding model to squeeze out additional performance

@mlissner
Copy link
Member Author

I talked to a couple of contacts today about this. A couple notes:

  • Once you chunk the data and embed it, that's about 600GB.
  • Quandrant has quantization out of the box that can take big floats and make them smaller so they take up less memory and save money. I don't know if we can do this with Elastic.
  • Llamaindex is apparently very helpful, and one guy recommended it above LangChain.
  • Getting the vector size correct is really hard and really important. We might need to try different things and see what performs best.
  • We need an evaluation dataset (with positive and negative hits) to know if our tweaks are working. This feels like a really good academic exercise, or it could also be a great one for FLP to lead and release as leaders in the field.

@n-shamsi
Copy link

Related issues: #3489, and #3490

@mlissner
Copy link
Member Author

These analyses are great, thanks @n-shamsi. What would you suggest as the next step?

@n-shamsi
Copy link

n-shamsi commented Dec 21, 2023

These analyses are great, thanks @n-shamsi. What would you suggest as the next step?

Thank you! I made some tasks on the issues, I think they're good for follow-ups but I am open to suggestions! Here's a summary:

#3489

  • What's the relationship between data type and search accuracy for a given DB?
  • What technical implementations optimize query performance for each DB? Are they suitable for our data?
  • What specific ML integration is used for a given DB, and is it useful for our data?

#3490

  • Select a vector DB and sample dataset

I think I'll start with the one on #3490 because it will help answer the three on #3489 more effectively.

@vonwooding
Copy link

Re: Technical Implementations to Optimize Query Performance

Thank you, @n-shamsi, for your research and insights.

Exploring Hypothetical Document Embeddings (HyDE) might benefit the project. HyDE transforms a user's query into a hypothetical document, which is then compared for similarity with the existing document set, rather than directly comparing the query itself.

This approach could be particularly useful for Free Law Project users who may not always have the legal expertise to frame precise queries.

While the full applicability and scalability of HyDE in our context remain to be assessed, it could provide a promising direction for our query optimization efforts.

For more information:

Repo: https://github.com/texttron/hyde
Paper: https://arxiv.org/abs/2212.10496

@n-shamsi
Copy link

n-shamsi commented Dec 21, 2023

Exploring Hypothetical Document Embeddings (HyDE) might benefit the project.

HyDE sounds awesome. I think we just need a sample dataset to get started. I am also interested in any suggestions from other followers on the issue, I am wondering if there's a particular solution that clients might be interested in having tested? I think Llama is on the shortlist because of that.

@mlissner is there a sample dataset we can use currently, or should @vonwooding and myself create one?

@vonwooding
Copy link

vonwooding commented Dec 21, 2023

fwiw I might suggest SCOTUS Fourth Amendment search/seizure cases. There are probably 600-700 total opinions. Plus, it's a familiar and important issue for practitioners and the public alike.

@mlissner
Copy link
Member Author

No, there's no evaluation data set, but this came up yesterday when I talked to somebody else. I think there's an opportunity to create a really nice evaluation data set. Seems like we should spin that off into its own issue and discuss it there? I'll invite the person I talked to yesterday to chime in?

@mlissner
Copy link
Member Author

I'm reliably (?) informed this leaderboard shows the best models to use for vectorizing:

https://huggingface.co/spaces/mteb/leaderboard

I don't know how true that is, but it's what I hear on the street!

@n-shamsi
Copy link

Picking up this issue again, we could start with a sample from here: https://github.com/freelawproject/reporters-db/blob/main/reporters_db/data/laws.json

Are there any particular data features that should be included for evaluation?

@mlissner
Copy link
Member Author

Sorry, I'm not sure what you mean, Nina. Are you suggesting we use the many laws there as the evaluation data set? Should we create a fresh issue for discussing that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants