-
-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up a vector search database as an API, then in the front end #3398
Comments
Another option that Harvard folks are playing with: https://www.trychroma.com/ |
I sent a mass email to a lot of our customers to see if people want to contribute money to building this. We'll see what kind of a response I get. |
One of our clients has done this with our data and he reports the following:
|
I talked to a couple of contacts today about this. A couple notes:
|
These analyses are great, thanks @n-shamsi. What would you suggest as the next step? |
Thank you! I made some tasks on the issues, I think they're good for follow-ups but I am open to suggestions! Here's a summary:
I think I'll start with the one on #3490 because it will help answer the three on #3489 more effectively. |
Re: Technical Implementations to Optimize Query Performance Thank you, @n-shamsi, for your research and insights. Exploring Hypothetical Document Embeddings (HyDE) might benefit the project. HyDE transforms a user's query into a hypothetical document, which is then compared for similarity with the existing document set, rather than directly comparing the query itself. This approach could be particularly useful for Free Law Project users who may not always have the legal expertise to frame precise queries. While the full applicability and scalability of HyDE in our context remain to be assessed, it could provide a promising direction for our query optimization efforts. For more information: Repo: https://github.com/texttron/hyde |
HyDE sounds awesome. I think we just need a sample dataset to get started. I am also interested in any suggestions from other followers on the issue, I am wondering if there's a particular solution that clients might be interested in having tested? I think Llama is on the shortlist because of that. @mlissner is there a sample dataset we can use currently, or should @vonwooding and myself create one? |
fwiw I might suggest SCOTUS Fourth Amendment search/seizure cases. There are probably 600-700 total opinions. Plus, it's a familiar and important issue for practitioners and the public alike. |
No, there's no evaluation data set, but this came up yesterday when I talked to somebody else. I think there's an opportunity to create a really nice evaluation data set. Seems like we should spin that off into its own issue and discuss it there? I'll invite the person I talked to yesterday to chime in? |
I'm reliably (?) informed this leaderboard shows the best models to use for vectorizing: https://huggingface.co/spaces/mteb/leaderboard I don't know how true that is, but it's what I hear on the street! |
Picking up this issue again, we could start with a sample from here: https://github.com/freelawproject/reporters-db/blob/main/reporters_db/data/laws.json Are there any particular data features that should be included for evaluation? |
Sorry, I'm not sure what you mean, Nina. Are you suggesting we use the many laws there as the evaluation data set? Should we create a fresh issue for discussing that? |
People just keep asking for this, and it seems like something our customers would use if we had it.
One customer I just talked to wants to do it using Pinecone. Maybe that's an idea. Elastic also seems to make this possible and even has a product page for it: https://www.elastic.co/enterprise-search/vector-search
Maybe it's something we should do, but I do worry about how much memory it would use.
The text was updated successfully, but these errors were encountered: