Slow index creation [Windows + WSL] #30

RubenAMtz · 2024-01-09T17:22:08Z

Hey all, I am struggling to index a corpus of ~1.1 million passages (this is after preprocessing). I left the process running all night and it made a 1% progress, it can't be right. I am using 11/32 gb RAM, 0/8 usage on VRAM.

Is it supposed to use a gpu in the indexing process? It detects the gpu it doesn't use it.

bclavie · 2024-01-09T20:12:25Z

Hey @RubenAMtz,

Thanks for this! There's a few problems here, some of them due to RAGatouille and one in your code.

1 - The way indexing works is its first embedded, then processed (this is what Iteration 17 means) to create clusters and ensure querying will be super fast. By default, colbert-v2.0 has a number of k-means iterations of 20, which will create a really strong index! I'll provide an easy way of lowering this in the future for tests, etc... A workaround if you'd like to lower it for your own tests, you can do so by first loading RAG normally then doing RAG.model.config.kmeans_niters = 10 (or any other value).

2 - RAGatouille currently ships with faiss-cpu as the default install, because it support all platforms and doesn't require a GPU. For indexing faiss-gpu is much quicker (cc @timothepearce this is relevant to you too). I need to figure out a way to easily change which one is installed depending on the user platform, or add a warning at indexing time, faiss is finicky because it's entirely separate packages...

In the meantime, you can manually use faiss-gpu by installing it via pip:

pip uninstall faiss-cpu
pip install faiss-gpu

This should massively speed up indexing! (It'll still be slow!)

In an upcoming release (soon, hopefully), I'll be adding more warnings, both in the documentation and when running .index() so the user is at least made aware more clearly!

3 - The one issue that is on your end: add_to_index should be used very sparingly! Indeed, with the way ColBERT works, for large volumes of documents it's generally more efficient (especially with faiss-gpu!) to just rebuild the index. For indexing large collections, you'll be needing to load your data into memory and send it all to RAG.index() in one go, without creating batches (the documents will automatically be processed in batches by .index() )

RubenAMtz · 2024-01-09T23:14:55Z

@bclavie I see, it makes sense, I've implemented the changes except for the kmeans_niters parameter, however, I've been waiting for around 30 minutes on this screen:

GPU usage is at 0 still, is the long waiting time expected? Maybe I need to adjust niters as you suggested.

fblissjr · 2024-01-10T17:11:17Z

This is pretty much what I get on WSL2, zero progress, no cpu or gpu usage. I very minorly modified some of the upstream colbert code to disable distributed processing (kept getting remote node errors with torch), but just stuck here, even on the toy wiki example notebook 1. Running as a .py script wrapped in main is the same result.

Anyone found a way around this?

bclavie · 2024-01-10T17:17:20Z

Hey,

Thanks to all of you for flagging up the issues! This is all quite odd, and there appears to be quite a lot of variability in how well it runs on Windows/WSL, with some people reporting it working great and (seemingly many) others having all sorts of issues. I appreciate this is quite frustrating!

Supporting Windows is currently not something I can prioritise, but I'd greatly appreciate it if someone managed to figure out what exactly in the upstream library is causing these issues 🤔

bclavie · 2024-01-10T19:28:33Z

In the meantime, the new .rerank() (example here) function could maybe fare better with Windows because it doesn't quite rely on multiprocessing. Not a perfect substitute to full-corpus ColBERT search sadly, but could be worth a try!

RubenAMtz · 2024-01-10T20:32:15Z

Thanks, @bclavie, will give it a try, and will keep an eye on this issue, hopefully, someone with time and expertise will come along to find out what is causing the issues.

fblissjr · 2024-01-10T23:55:46Z

@bclavie Thanks for the response! Will share any details if I can nail it down.

fblissjr · 2024-01-11T00:06:56Z

In the meantime, the new .rerank() (example here) function could maybe fare better with Windows because it doesn't quite rely on multiprocessing. Not a perfect substitute to full-corpus ColBERT search sadly, but could be worth a try!

This one ran quickly and painlessly on my wsl2 setup.

fblissjr · 2024-01-12T17:12:15Z

FYI - this PR in Colbert fixed it! Indexed in under 2 seconds in the 01-basic indexing notebook! Definitely was related to distributed mode on a single gpu / workstation.

stanford-futuredata/ColBERT#290

bclavie · 2024-01-12T17:14:17Z

Hey, thanks for confirming! This PR should indeed fix Indexing in Colab&Windows, and we (@Anmol6) are also looking at doing the same for training (once both are done, it'll also open up the way for mps support on MacBooks)

Can't thank @Anmol6 enough for taking this on!

fblissjr · 2024-01-12T19:01:22Z

Hey, thanks for confirming! This PR should indeed fix Indexing in Colab&Windows, and we (@Anmol6) are also looking at doing the same for training (once both are done, it'll also open up the way for mps support on MacBooks)

Can't thank @Anmol6 enough for taking this on!

Just tried the last part of example 2, and getting that same error as before. Trainer is definitely forcing distributed torch but the collection indexer fixed indexing, though. Good sign!

bclavie · 2024-01-12T19:13:53Z

Yeah training is still auto-forking even on single GPUs! Changing this is the next step (but indexing felt like a bigger priority as training on windows is a rarer use case)

fblissjr · 2024-01-12T19:16:00Z

Yeah training is still auto-forking even on single GPUs! Changing this is the next step (but indexing felt like a bigger priority as training on windows is a rarer use case)

totally - seeing the indexing process gives me the weekend to explore how it's working - much of this is intuitive so far. appreciate it, looking forward to this project as it grows!

bclavie · 2024-01-14T22:42:49Z

Hey,

Multiprocessing is no longer enforced for indexing when using no GPU or a single GPU thanks to @Anmol6's excellent upstream work on stanford-futuredata/ColBERT#290 & propagated by #51.

This is likely to fix the indexing problems on Windows (or at least, one of the problems). The performance may likely still be worse than on Linux, but it should at least start and run properly! Let me know if this solves this issue.

TheMcSebi · 2024-02-12T14:19:10Z

I've gotten it to work relatively quickly on WSL2 by using Python 3.10 and pinning torch to 2.0.1. I'm running CUDA 12.3 on Ubuntu 22.04. This is what I did to successfully install and run RAGatouille:

conda create -n rag python=3.10
conda activate rag
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
git clone https://github.com/bclavie/RAGatouille
cd RAGatouille/
pip install -e .
pip uninstall faiss-cpu
conda install faiss-gpu

To get started, I used a slightly modified version of the code included in the README.md to index my obsidian notes, which only took about half a minute in total.

I also successfully indexed and queried a large text corpus of roughly 1gb for testing. It did in fact take a very long time to start noticably using the GPU, but the entire process also finished within roughly 2 hours.

Some metrices on running queries against an index of this size:

Conditions	Time to response
First run after a cold start of WSL2	3 minutes until first response
Second run	30 seconds until first response
Consecutive queries without restarting the interpreter	less than 1 second 🤯

I've uploaded the two scripts I'm using to index and query the database. The search script includes some code to postprocess the resulting documents using llama2 hosted by a local ollama server.
create_index.py
do_search.py
It generates surprisingly good and consistent results from my very limited tests.

fblissjr · 2024-02-13T16:25:02Z

I just got myself a Mac Studio M2 Ultra, and have been running this on WSL2 + CUDA (RTX 4090) and now the Mac. So far no issues on either anymore (haven't run all the example notebooks yet, just the first few). Bravo team.

bclavie added help wanted Extra attention is needed windows labels Jan 10, 2024

bclavie added the bug Something isn't working label Jan 10, 2024

bclavie changed the title ~~Slow index creation~~ Slow index creation [Windows + WSL] Jan 10, 2024

fblissjr mentioned this issue Jan 12, 2024

Indexing: handle cpu & single-gpu without using multiprocessing & dist. data parallel stanford-futuredata/ColBERT#290

Merged

bclavie linked a pull request Jan 15, 2024 that will close this issue

fix: ensure config is propagated to indexer #56

Merged

bclavie closed this as completed in #56 Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow index creation [Windows + WSL] #30

Slow index creation [Windows + WSL] #30

RubenAMtz commented Jan 9, 2024 •

edited

Loading

bclavie commented Jan 9, 2024

RubenAMtz commented Jan 9, 2024

fblissjr commented Jan 10, 2024

bclavie commented Jan 10, 2024

bclavie commented Jan 10, 2024 •

edited

Loading

RubenAMtz commented Jan 10, 2024

fblissjr commented Jan 10, 2024

fblissjr commented Jan 11, 2024

fblissjr commented Jan 12, 2024

bclavie commented Jan 12, 2024

fblissjr commented Jan 12, 2024 •

edited

Loading

bclavie commented Jan 12, 2024

fblissjr commented Jan 12, 2024

bclavie commented Jan 14, 2024

TheMcSebi commented Feb 12, 2024

fblissjr commented Feb 13, 2024

Slow index creation [Windows + WSL] #30

Slow index creation [Windows + WSL] #30

Comments

RubenAMtz commented Jan 9, 2024 • edited Loading

bclavie commented Jan 9, 2024

RubenAMtz commented Jan 9, 2024

fblissjr commented Jan 10, 2024

bclavie commented Jan 10, 2024

bclavie commented Jan 10, 2024 • edited Loading

RubenAMtz commented Jan 10, 2024

fblissjr commented Jan 10, 2024

fblissjr commented Jan 11, 2024

fblissjr commented Jan 12, 2024

bclavie commented Jan 12, 2024

fblissjr commented Jan 12, 2024 • edited Loading

bclavie commented Jan 12, 2024

fblissjr commented Jan 12, 2024

bclavie commented Jan 14, 2024

TheMcSebi commented Feb 12, 2024

fblissjr commented Feb 13, 2024

RubenAMtz commented Jan 9, 2024 •

edited

Loading

bclavie commented Jan 10, 2024 •

edited

Loading

fblissjr commented Jan 12, 2024 •

edited

Loading