Set up a vector search database as an API, then in the front end #3398

mlissner · 2023-11-22T23:01:56Z

People just keep asking for this, and it seems like something our customers would use if we had it.

One customer I just talked to wants to do it using Pinecone. Maybe that's an idea. Elastic also seems to make this possible and even has a product page for it: https://www.elastic.co/enterprise-search/vector-search

Maybe it's something we should do, but I do worry about how much memory it would use.

mlissner · 2023-11-29T22:20:27Z

Another option that Harvard folks are playing with: https://www.trychroma.com/

mlissner · 2023-12-14T00:30:00Z

I sent a mass email to a lot of our customers to see if people want to contribute money to building this. We'll see what kind of a response I get.

mlissner · 2023-12-14T19:33:59Z

One of our clients has done this with our data and he reports the following:

Llama Index makes things a lot easier...

Beyond just getting the data into the vector db, the following optimization activities may need to be conducted:

Building and evaluation dataset to measure retrieval performance

Tune hyperparameters like chunk size and included metadata

Evaluate more advanced chunking strategies (e.g. SentenceWindow chunking)

Fine tune an open source embedding model to squeeze out additional performance

mlissner · 2023-12-21T00:40:50Z

I talked to a couple of contacts today about this. A couple notes:

Once you chunk the data and embed it, that's about 600GB.
Quandrant has quantization out of the box that can take big floats and make them smaller so they take up less memory and save money. I don't know if we can do this with Elastic.
Llamaindex is apparently very helpful, and one guy recommended it above LangChain.
Getting the vector size correct is really hard and really important. We might need to try different things and see what performs best.
We need an evaluation dataset (with positive and negative hits) to know if our tweaks are working. This feels like a really good academic exercise, or it could also be a great one for FLP to lead and release as leaders in the field.

n-shamsi · 2023-12-21T16:48:56Z

Related issues: #3489, and #3490

mlissner · 2023-12-21T17:12:01Z

These analyses are great, thanks @n-shamsi. What would you suggest as the next step?

n-shamsi · 2023-12-21T17:30:06Z

These analyses are great, thanks @n-shamsi. What would you suggest as the next step?

Thank you! I made some tasks on the issues, I think they're good for follow-ups but I am open to suggestions! Here's a summary:

#3489

What's the relationship between data type and search accuracy for a given DB?
What technical implementations optimize query performance for each DB? Are they suitable for our data?
What specific ML integration is used for a given DB, and is it useful for our data?

#3490

Select a vector DB and sample dataset

I think I'll start with the one on #3490 because it will help answer the three on #3489 more effectively.

vonwooding · 2023-12-21T18:06:38Z

Re: Technical Implementations to Optimize Query Performance

Thank you, @n-shamsi, for your research and insights.

Exploring Hypothetical Document Embeddings (HyDE) might benefit the project. HyDE transforms a user's query into a hypothetical document, which is then compared for similarity with the existing document set, rather than directly comparing the query itself.

This approach could be particularly useful for Free Law Project users who may not always have the legal expertise to frame precise queries.

While the full applicability and scalability of HyDE in our context remain to be assessed, it could provide a promising direction for our query optimization efforts.

For more information:

Repo: https://github.com/texttron/hyde
Paper: https://arxiv.org/abs/2212.10496

n-shamsi · 2023-12-21T18:12:09Z

Exploring Hypothetical Document Embeddings (HyDE) might benefit the project.

HyDE sounds awesome. I think we just need a sample dataset to get started. I am also interested in any suggestions from other followers on the issue, I am wondering if there's a particular solution that clients might be interested in having tested? I think Llama is on the shortlist because of that.

@mlissner is there a sample dataset we can use currently, or should @vonwooding and myself create one?

vonwooding · 2023-12-21T18:44:58Z

fwiw I might suggest SCOTUS Fourth Amendment search/seizure cases. There are probably 600-700 total opinions. Plus, it's a familiar and important issue for practitioners and the public alike.

mlissner · 2023-12-21T20:08:14Z

No, there's no evaluation data set, but this came up yesterday when I talked to somebody else. I think there's an opportunity to create a really nice evaluation data set. Seems like we should spin that off into its own issue and discuss it there? I'll invite the person I talked to yesterday to chime in?

mlissner · 2023-12-23T00:47:52Z

I'm reliably (?) informed this leaderboard shows the best models to use for vectorizing:

https://huggingface.co/spaces/mteb/leaderboard

I don't know how true that is, but it's what I hear on the street!

n-shamsi · 2024-01-10T13:29:46Z

Picking up this issue again, we could start with a sample from here: https://github.com/freelawproject/reporters-db/blob/main/reporters_db/data/laws.json

Are there any particular data features that should be included for evaluation?

mlissner · 2024-01-10T18:14:14Z

Sorry, I'm not sure what you mean, Nina. Are you suggesting we use the many laws there as the evaluation data set? Should we create a fresh issue for discussing that?

This was referenced Dec 21, 2023

Compare vector DB options using performance and usage criteria #3489

Open

Empirically Determine an Appropriate Vector Size #3490

Open

mlissner mentioned this issue May 6, 2024

Bulk Data Loading #4024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up a vector search database as an API, then in the front end #3398

Set up a vector search database as an API, then in the front end #3398

mlissner commented Nov 22, 2023

mlissner commented Nov 29, 2023

mlissner commented Dec 14, 2023

mlissner commented Dec 14, 2023 •

edited

Loading

mlissner commented Dec 21, 2023

n-shamsi commented Dec 21, 2023

mlissner commented Dec 21, 2023

n-shamsi commented Dec 21, 2023 •

edited

Loading

vonwooding commented Dec 21, 2023

n-shamsi commented Dec 21, 2023 •

edited by mlissner

Loading

vonwooding commented Dec 21, 2023 •

edited

Loading

mlissner commented Dec 21, 2023

mlissner commented Dec 23, 2023

n-shamsi commented Jan 10, 2024

mlissner commented Jan 10, 2024

Set up a vector search database as an API, then in the front end #3398

Set up a vector search database as an API, then in the front end #3398

Comments

mlissner commented Nov 22, 2023

mlissner commented Nov 29, 2023

mlissner commented Dec 14, 2023

mlissner commented Dec 14, 2023 • edited Loading

mlissner commented Dec 21, 2023

n-shamsi commented Dec 21, 2023

mlissner commented Dec 21, 2023

n-shamsi commented Dec 21, 2023 • edited Loading

vonwooding commented Dec 21, 2023

n-shamsi commented Dec 21, 2023 • edited by mlissner Loading

vonwooding commented Dec 21, 2023 • edited Loading

mlissner commented Dec 21, 2023

mlissner commented Dec 23, 2023

n-shamsi commented Jan 10, 2024

mlissner commented Jan 10, 2024

mlissner commented Dec 14, 2023 •

edited

Loading

n-shamsi commented Dec 21, 2023 •

edited

Loading

n-shamsi commented Dec 21, 2023 •

edited by mlissner

Loading

vonwooding commented Dec 21, 2023 •

edited

Loading