Replies: 5 comments 4 replies
-
A few of us met earlier today. Some notes:
|
Beta Was this translation helpful? Give feedback.
-
For concurrent writes I engaged a little with the LanceDB folks here: lancedb/lancedb#1077 It looks like they won't do concurrent writes, but we can probably just use a |
Beta Was this translation helpful? Give feedback.
-
Here is a brief notebook with a function that yields pages out of arxiv PDFs: https://gist.github.com/mrocklin/39433928ba44ff7e981a2d7355688185 Next up we need to select some embedding function. Once we have that I think I can start playing with distributed computation and storage into LanceDB. @nenb @pmeier if either of you have suggestions for good embedding models that can be run locally here that would be welcome. |
Beta Was this translation helpful? Give feedback.
-
I can offer some signposts and personal opinions, but I'm not experienced enough with the topic to offer more than that. To get a feel for the outputs and the distributed computation parts of the problem, More generally, the HF leaderboard is probably closer to SoA. The documentation I find to be generally very nice for getting things working, but it's still common enough that I'll hit some problem which will derail me for a while when trying something out. So, that's why I like the solution that Chroma offer above. If you play around a bit with it, you might notice some limitations, such as encoding tabular data from PDFs (which a lot of people are working on improving right now I believe). If everything was in nicely structured Markdown format, then I would expect the results to be vastly superior. Any time you can avoid PDFs, or somehow process the information in them into a more machine-friendly format like HTML or Markdown, I would expect the answers to be a lot better. I don't think you need to do this now, it's just a comment that users might have some sort of pre-processing stage at some point. |
Beta Was this translation helpful? Give feedback.
-
Here is a brief update: https://gist.github.com/mrocklin/f7c1eeb3895a6798b233cd0e3de335ff I can build the database locally. I can also build one in S3. I can't yet get AWS machines to talk to S3 through LanceDB though (issue here). I can do everything else at scale on cloud machines. I can also spin up GPUs to accelerate that work (I'm not yet sure which is a better choice financially). I'm still not sure how efficient parallel writing will be. That may become a bottleneck. We'll find out once I figure out how to get AWS machines to talk to S3 through lancedb. |
Beta Was this translation helpful? Give feedback.
-
I've been talking to a couple people about producing a large scale ingestion into a vector database to support RAG. Apparently this is a commonly requested workflow, and something that we're probably well set up to help accomplish.
At Coiled we've had some marketing success recently with large scale heroic calculations on open datasets. I was looking around for a large scale dataset/problem to solve, and came up with the idea of ingesting the arXiv dataset of scientific pre-print articles. It's around 2,000,000 articles and around 3-4TB in size in a public requester pays bucket on S3. The format is a bunch of PDFs (or source files) in tar files. Here is a notebook which processes all of them and does a trivial check on each pdf.
I think that it could be fun to process this data, shove it all into some vector database, and then put some chatbot frontend on it. Some challenges that we'll likely face:
Beta Was this translation helpful? Give feedback.
All reactions