Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training #116

Open
fonspa opened this issue Apr 3, 2022 · 4 comments
Open

Distributed training #116

fonspa opened this issue Apr 3, 2022 · 4 comments

Comments

@fonspa
Copy link

fonspa commented Apr 3, 2022

Hi,
thanks to all maintainers of this project, that's a great tool to streamline the building and tuning of a Faiss index.

I have a quick dumb question about the training of an index in distributed mode. Am I correct that the training is done on the host, i.e non distributed, and that only the adding/optimizing part is distributed ? After a quick look at the code and doc, I feel like that's the case, right ? If that's the case, would there be a possibility of training the index in a distributed fashion?

@rom1504
Copy link
Contributor

rom1504 commented Apr 3, 2022

hey!
glad you like autofaiss

currently indeed autofaiss train the index on a single node.
Usually this is not a problem because the number of points that are used for training is a small part of the whole embedding set, usually up to 32x the number of clusters so for example 3M, even for a billion size index.
So it takes only up to an hour for training

However, it is indeed a technical possibility to distribute the training. If you have an use case that requires it, I advise you look into the pointers I put in that issue #101

What is your use case?

@fonspa
Copy link
Author

fonspa commented Apr 4, 2022

Well usually I tend to train my indexes with a number of points on the higher end of the recommended range, about 128 to 256 * nCentroids.
For about half a billion base vectors, I generally try indexes with 2^17, 2^18 centroids, that's around 20M to 80M points to cluster, for, hopefully, the best possible coverage of the distribution of my points.
Training with this many points takes a while ! Maybe that's being a bit too cautious and I could take far less points to train.
Another thing to factor in is that I generally can't use all the cores on the host, as it's generally under medium to heavy use. I usually only get a fraction of those cores, while the cluster cores are plenty and less frequently occupied.
Thanks a lot for your answer and your pointer to the distributed kmeans, It looks very promising.

@rom1504
Copy link
Contributor

rom1504 commented Apr 4, 2022

Did you measure better knn recall by using so many training points ?
I used 2^17 centroids and 64x training points for a 5B embeddings index and it's working well.

In the past we did experiments with the number of training points and didn't see a big impact in using much more.

@fonspa
Copy link
Author

fonspa commented Apr 4, 2022

I did notice a better tradeoff speed / recall@[1,K] (for some value of K that I need) when using more training points, most notably for queries coming from the database vectors.
But I might have to lower the number of training points anyway as I can't monopolize too many cores on the host for a too long time. That's why I was hoping for a training on the cluster cores instead.

Thanks for your input! That's nice and useful to hear how others are approaching the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants