Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying with a 20-item array is much more than 20 times slower than querying with 1-item array. #53

Closed
suzaku opened this issue Mar 22, 2017 · 12 comments

Comments

@suzaku
Copy link

suzaku commented Mar 22, 2017

When comparing index.search of IndexFlatL2 with sklearn's kneighbors implementation, I found that:

  1. faiss was several times faster when searching with 1-item arrays
  2. faiss was much slower when searching with many items at a time

I'm using faiss with OpenBLAS.

@mdouze
Copy link
Contributor

mdouze commented Mar 22, 2017

Hi

That seems weird. Could you post a script to reproduce?

@suzaku
Copy link
Author

suzaku commented Mar 22, 2017

@mdouze
I don't know if this matters, I'm running the tests inside a Docker container, and the container is running in a 4-core virtual machine managed by Docker for Mac.

Anyway, I'm trying to create a Dockerfile for the faiss + OpenBLAS option, so that we can run the tests in the same environment (except that our machine are different).

@mdouze
Copy link
Contributor

mdouze commented Mar 22, 2017

Please just post the source that you are using.

@suzaku
Copy link
Author

suzaku commented Mar 22, 2017

Because what I am interested in is exact KNN, so the Index class I tried is IndexFlatL2.
I first got this result with my production data, but I can reproduce it with some random data generated with numpy below:

In [11]: import numpy as np

In [12]: X = np.random.random((1000000, 160)).astype('float32')

In [13]: index = faiss.IndexFlatL2(160)

In [14]: index.add(X)

In [15]: %timeit index.search(X[:20], 20)
1 loop, best of 3: 3.62 s per loop

In [16]: %timeit index.search(X[:1], 20)
10 loops, best of 3: 64.4 ms per loop

In [17]: 64.4 * 20
Out[17]: 1288.0

In [18]: %timeit [index.search(X[i:i+1], 20) for i in xrange(20)]
1 loop, best of 3: 1.27 s per loop

@mdouze
Copy link
Contributor

mdouze commented Mar 22, 2017

This may indeed be an issue due to OpenBLAS, because 20 is the threshold where the code switches from the internal distance computation to the OpenBLAS one.

To verify this, could you set the threshold to something higher (1024) in utils.cpp line 855 and test again?

I will check on my side as well.

@suzaku
Copy link
Author

suzaku commented Mar 22, 2017

@mdouze Sure, I will tell you the result after trying.

BTW, when I ran similar tests with sklearn, the results just reversed:

In [19]: from sklearn.neighbors import NearestNeighbors

In [20]: nbrs = NearestNeighbors(n_neighbors=20, algorithm='brute', metric='l2').fit(X
    ...: )

In [21]: %timeit nbrs.kneighbors(X[:1], n_neighbors=20)
1 loop, best of 3: 179 ms per loop

In [22]: %timeit nbrs.kneighbors(X[:20], n_neighbors=20)
1 loop, best of 3: 603 ms per loop

It ran almost 3 times slower than faiss when querying single items, but it's much faster for batch queries.

@mdouze
Copy link
Contributor

mdouze commented Mar 22, 2017

OK I can repro the issue with OpenBLAS

In [8]: %timeit index.search(X[:19], 20)
10 loops, best of 3: 115 ms per loop

In [9]: %timeit index.search(X[:20], 20)
1 loops, best of 3: 5.09 s per loop

Let's look for a fix now...

@suzaku
Copy link
Author

suzaku commented Mar 22, 2017

BTW, I used OMP_NUM_THREADS=8 when running the tests.

@mdouze
Copy link
Contributor

mdouze commented Mar 22, 2017

Does not happen with MKL, tested on MKL BLAS

In [1]: import numpy as np

In [2]: import faiss
Failed to load GPU Faiss: No module named swigfaiss_gpu
Faiss falling back to CPU-only.

In [3]: import time

In [4]: X = np.random.random((1000000, 160)).astype('float32')

In [5]: index = faiss.IndexFlatL2(160)

In [6]: index.add(X)

In [7]: t0 = time.time(); index.search(X[:19], 20); print time.time() - t0
0.113594055176

In [8]: t0 = time.time(); index.search(X[:20], 20); print time.time() - t0
0.107301950455

@suzaku
Copy link
Author

suzaku commented Mar 22, 2017

So, does this suggest that faiss should use different threshold for different BLAS implementation?

@mdouze
Copy link
Contributor

mdouze commented Mar 22, 2017

If possible, I would suggest switching to MKL BLAS. The internal Faiss implementation of exhaustive search is very inefficient in terms of memory accesses for a large number of queries nq. It seems that the break-down in terms of speed is between nq=512 and nq=2048 queries, but at nq=512

BLAS 10.0s
internal 4.6s

When extrapolating from the nq=19 (internal) should give 2.6s. Therefore, both implementations are not satisfactory.

@mdouze
Copy link
Contributor

mdouze commented Mar 22, 2017

OpenBLAS and OpenMP are known not to play will together, see https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded.

A few more comments, see session
https://gist.github.com/mdouze/9694e8ec06a8d71add0630ce4e0ea294

  • there is jitter in the performance: OpenBLAS speed seems ok at the first call, then degrades ([10] then [11]), and can become good again later.

  • setting the # OMP threads to 1 is fast all the time ([15-17], [30-32]). Unfortunately this will hit performance when FlatL2 is used as coarse quantizer.

  • 40 threads seems worse than 20.

Another test, setting OMP_WAIT_POLICY, see https://gist.github.com/mdouze/9a392de0de81614ea2f39b8c724597a3

Setting the policy to PASSIVE seems to solve the problem. My guess is:

  • by default, when OpenMP is idle, OpenMP by keeps its threads in busy wait for a few ms then puts them to sleep.

  • OpenBLAS dynamically tries to guess how many threads it can use. Since it sees OpenMP threads in their busy wait, it scales the # threads down.

  • in PASSIVE mode, OpenMP threads go to sleep immediately, which fixes the problem (but OpenMP parallel sections become slower to launch)

@mdouze mdouze closed this as completed Mar 22, 2017
mqnfred pushed a commit to mqnfred/faiss that referenced this issue Oct 23, 2023
…er_ptr-unsafe

Make `TryFromInnerPtr::try_from_inner_ptr` unsafe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants