Distributed training #116

fonspa · 2022-04-03T18:33:06Z

Hi,
thanks to all maintainers of this project, that's a great tool to streamline the building and tuning of a Faiss index.

I have a quick dumb question about the training of an index in distributed mode. Am I correct that the training is done on the host, i.e non distributed, and that only the adding/optimizing part is distributed ? After a quick look at the code and doc, I feel like that's the case, right ? If that's the case, would there be a possibility of training the index in a distributed fashion?

rom1504 · 2022-04-03T19:01:32Z

hey!
glad you like autofaiss

currently indeed autofaiss train the index on a single node.
Usually this is not a problem because the number of points that are used for training is a small part of the whole embedding set, usually up to 32x the number of clusters so for example 3M, even for a billion size index.
So it takes only up to an hour for training

However, it is indeed a technical possibility to distribute the training. If you have an use case that requires it, I advise you look into the pointers I put in that issue #101

What is your use case?

fonspa · 2022-04-04T12:33:24Z

Well usually I tend to train my indexes with a number of points on the higher end of the recommended range, about 128 to 256 * nCentroids.
For about half a billion base vectors, I generally try indexes with 2^17, 2^18 centroids, that's around 20M to 80M points to cluster, for, hopefully, the best possible coverage of the distribution of my points.
Training with this many points takes a while ! Maybe that's being a bit too cautious and I could take far less points to train.
Another thing to factor in is that I generally can't use all the cores on the host, as it's generally under medium to heavy use. I usually only get a fraction of those cores, while the cluster cores are plenty and less frequently occupied.
Thanks a lot for your answer and your pointer to the distributed kmeans, It looks very promising.

rom1504 · 2022-04-04T12:51:44Z

Did you measure better knn recall by using so many training points ?
I used 2^17 centroids and 64x training points for a 5B embeddings index and it's working well.

In the past we did experiments with the number of training points and didn't see a big impact in using much more.

fonspa · 2022-04-04T13:45:31Z

I did notice a better tradeoff speed / recall@[1,K] (for some value of K that I need) when using more training points, most notably for queries coming from the database vectors.
But I might have to lower the number of training points anyway as I can't monopolize too many cores on the host for a too long time. That's why I was hoping for a training on the cluster cores instead.

Thanks for your input! That's nice and useful to hear how others are approaching the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training #116

Distributed training #116

fonspa commented Apr 3, 2022

rom1504 commented Apr 3, 2022

fonspa commented Apr 4, 2022 •

edited

rom1504 commented Apr 4, 2022

fonspa commented Apr 4, 2022

Distributed training #116

Distributed training #116

Comments

fonspa commented Apr 3, 2022

rom1504 commented Apr 3, 2022

fonspa commented Apr 4, 2022 • edited

rom1504 commented Apr 4, 2022

fonspa commented Apr 4, 2022

fonspa commented Apr 4, 2022 •

edited