Large RAG workflow with arXiv #10987

mrocklin · 2024-03-07T17:01:44Z

mrocklin
Mar 7, 2024
Maintainer

I've been talking to a couple people about producing a large scale ingestion into a vector database to support RAG. Apparently this is a commonly requested workflow, and something that we're probably well set up to help accomplish.

At Coiled we've had some marketing success recently with large scale heroic calculations on open datasets. I was looking around for a large scale dataset/problem to solve, and came up with the idea of ingesting the arXiv dataset of scientific pre-print articles. It's around 2,000,000 articles and around 3-4TB in size in a public requester pays bucket on S3. The format is a bunch of PDFs (or source files) in tar files. Here is a notebook which processes all of them and does a trivial check on each pdf.

I think that it could be fun to process this data, shove it all into some vector database, and then put some chatbot frontend on it. Some challenges that we'll likely face:

How does one properly convert a PDF into a vector?
How should we properly enrich that data so that it is as meaningful as possible, while still being simple and easy to do?
How do we process 2,000,000 files efficiently
How do we ingest these into a vector db?
How do we put something in front of this VectorDB so that it looks flashy?

mrocklin · 2024-03-08T16:33:27Z

mrocklin
Mar 8, 2024
Maintainer Author

A few of us met earlier today. Some notes:

We can split this between processing raw data and embedding that data into a database, and then building applications on the frontend of that database. Mostly Dask folks care about the first part of that (but it'd be great to have people think about the latter parts of that too.
There are a lot of projects doing this (see this github search) already. We need to be clear that we're not trying to build the best arXiv search, but rather that we're focused on a method that can be easily replicated by others working on their own data. We're focused on finding a simple and scalable architecture here that others can build off of.
For the choice of vector database, matt is mostly interested in something that feels lightweight. Lance feels good in that respect (it's embedded, and there's no service to run) but it's not currently winning the popularity contest.
For modeling and embedding arXiv might be large and diverse enough that it might be hard to tune the model much. We might just have to go with something generic.
Computationally there's the embarrassingly parallel part of modeling the data and computing embeddings (easy!) and the harder part of doing parallel writes to a vector database. Dask people need to think a little about the latter part.

0 replies

mrocklin · 2024-03-08T16:34:58Z

mrocklin
Mar 8, 2024
Maintainer Author

For concurrent writes I engaged a little with the LanceDB folks here: lancedb/lancedb#1077

It looks like they won't do concurrent writes, but we can probably just use a dask.distributed.Lock and be OK.

0 replies

mrocklin · 2024-03-08T17:39:10Z

mrocklin
Mar 8, 2024
Maintainer Author

Here is a brief notebook with a function that yields pages out of arxiv PDFs: https://gist.github.com/mrocklin/39433928ba44ff7e981a2d7355688185

Next up we need to select some embedding function. Once we have that I think I can start playing with distributed computation and storage into LanceDB. @nenb @pmeier if either of you have suggestions for good embedding models that can be run locally here that would be welcome.

2 replies

mrocklin Mar 8, 2024
Maintainer Author

Using this for now: https://www.sbert.net/#usage

nenb Mar 8, 2024

Nice choice. 😄 (See the chroma link I shared below for an even lighter-weight version.)

nenb · 2024-03-08T20:05:31Z

nenb
Mar 8, 2024

I can offer some signposts and personal opinions, but I'm not experienced enough with the topic to offer more than that.

To get a feel for the outputs and the distributed computation parts of the problem, chromadb offer a specially-prepared version of the all-MiniLM-L6-v2 model: https://github.com/chroma-core/chroma/blob/34e2795216d610389e05cba52fcb781ba1b433bc/chromadb/utils/embedding_functions.py#L362-L367 and https://docs.trychroma.com/embeddings#default-all-minilm-l6-v2. You don't need to fiddle with getting the model running, and my understanding is that it's still somewhat representative of what computing an embedding will look like.

More generally, the HF leaderboard is probably closer to SoA. The documentation I find to be generally very nice for getting things working, but it's still common enough that I'll hit some problem which will derail me for a while when trying something out. So, that's why I like the solution that Chroma offer above.

If you play around a bit with it, you might notice some limitations, such as encoding tabular data from PDFs (which a lot of people are working on improving right now I believe). If everything was in nicely structured Markdown format, then I would expect the results to be vastly superior. Any time you can avoid PDFs, or somehow process the information in them into a more machine-friendly format like HTML or Markdown, I would expect the answers to be a lot better. I don't think you need to do this now, it's just a comment that users might have some sort of pre-processing stage at some point.

0 replies

mrocklin · 2024-03-09T17:09:19Z

mrocklin
Mar 9, 2024
Maintainer Author

Here is a brief update: https://gist.github.com/mrocklin/f7c1eeb3895a6798b233cd0e3de335ff

I can build the database locally. I can also build one in S3. I can't yet get AWS machines to talk to S3 through LanceDB though (issue here).

I can do everything else at scale on cloud machines. I can also spin up GPUs to accelerate that work (I'm not yet sure which is a better choice financially).

I'm still not sure how efficient parallel writing will be. That may become a bottleneck. We'll find out once I figure out how to get AWS machines to talk to S3 through lancedb.

2 replies

mrocklin Mar 11, 2024
Maintainer Author

I can't yet get AWS machines to talk to S3 through LanceDB though

We figured out what was stopping this and I did a distributed write with a couple hundred files. Things worked fine. Two learnings:

GPUs are faster, but overall about 2x more expensive
Querying the resulting S3-backed LanceDB database is slow, around 5-30s (probably untenable). I'm hoping that this can be resolved somehow on their end (this seems longer than it needs to be).

mrocklin Mar 11, 2024
Maintainer Author

lancedb/lancedb#1088

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large RAG workflow with arXiv #10987

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Large RAG workflow with arXiv #10987

mrocklin Mar 7, 2024 Maintainer

Replies: 5 comments · 4 replies

mrocklin Mar 8, 2024 Maintainer Author

mrocklin Mar 8, 2024 Maintainer Author

mrocklin Mar 8, 2024 Maintainer Author

mrocklin Mar 8, 2024 Maintainer Author

nenb Mar 8, 2024

nenb Mar 8, 2024

mrocklin Mar 9, 2024 Maintainer Author

mrocklin Mar 11, 2024 Maintainer Author

mrocklin Mar 11, 2024 Maintainer Author

mrocklin
Mar 7, 2024
Maintainer

Replies: 5 comments 4 replies

mrocklin
Mar 8, 2024
Maintainer Author

mrocklin
Mar 8, 2024
Maintainer Author

mrocklin
Mar 8, 2024
Maintainer Author

mrocklin Mar 8, 2024
Maintainer Author

nenb
Mar 8, 2024

mrocklin
Mar 9, 2024
Maintainer Author

mrocklin Mar 11, 2024
Maintainer Author

mrocklin Mar 11, 2024
Maintainer Author