Querying with a 20-item array is much more than 20 times slower than querying with 1-item array. #53

suzaku · 2017-03-22T07:02:17Z

When comparing index.search of IndexFlatL2 with sklearn's kneighbors implementation, I found that:

faiss was several times faster when searching with 1-item arrays
faiss was much slower when searching with many items at a time

I'm using faiss with OpenBLAS.

The text was updated successfully, but these errors were encountered:

mdouze · 2017-03-22T08:39:28Z

Hi

That seems weird. Could you post a script to reproduce?

suzaku · 2017-03-22T08:48:14Z

@mdouze
I don't know if this matters, I'm running the tests inside a Docker container, and the container is running in a 4-core virtual machine managed by Docker for Mac.

Anyway, I'm trying to create a Dockerfile for the faiss + OpenBLAS option, so that we can run the tests in the same environment (except that our machine are different).

mdouze · 2017-03-22T08:49:17Z

Please just post the source that you are using.

suzaku · 2017-03-22T08:56:19Z

Because what I am interested in is exact KNN, so the Index class I tried is IndexFlatL2.
I first got this result with my production data, but I can reproduce it with some random data generated with numpy below:

In [11]: import numpy as np

In [12]: X = np.random.random((1000000, 160)).astype('float32')

In [13]: index = faiss.IndexFlatL2(160)

In [14]: index.add(X)

In [15]: %timeit index.search(X[:20], 20)
1 loop, best of 3: 3.62 s per loop

In [16]: %timeit index.search(X[:1], 20)
10 loops, best of 3: 64.4 ms per loop

In [17]: 64.4 * 20
Out[17]: 1288.0

In [18]: %timeit [index.search(X[i:i+1], 20) for i in xrange(20)]
1 loop, best of 3: 1.27 s per loop

mdouze · 2017-03-22T09:03:18Z

This may indeed be an issue due to OpenBLAS, because 20 is the threshold where the code switches from the internal distance computation to the OpenBLAS one.

To verify this, could you set the threshold to something higher (1024) in utils.cpp line 855 and test again?

I will check on my side as well.

suzaku · 2017-03-22T09:07:37Z

@mdouze Sure, I will tell you the result after trying.

BTW, when I ran similar tests with sklearn, the results just reversed:

In [19]: from sklearn.neighbors import NearestNeighbors

In [20]: nbrs = NearestNeighbors(n_neighbors=20, algorithm='brute', metric='l2').fit(X
    ...: )

In [21]: %timeit nbrs.kneighbors(X[:1], n_neighbors=20)
1 loop, best of 3: 179 ms per loop

In [22]: %timeit nbrs.kneighbors(X[:20], n_neighbors=20)
1 loop, best of 3: 603 ms per loop

It ran almost 3 times slower than faiss when querying single items, but it's much faster for batch queries.

mdouze · 2017-03-22T09:07:50Z

OK I can repro the issue with OpenBLAS

In [8]: %timeit index.search(X[:19], 20)
10 loops, best of 3: 115 ms per loop

In [9]: %timeit index.search(X[:20], 20)
1 loops, best of 3: 5.09 s per loop

Let's look for a fix now...

suzaku · 2017-03-22T09:08:43Z

BTW, I used OMP_NUM_THREADS=8 when running the tests.

mdouze · 2017-03-22T09:16:14Z

Does not happen with MKL, tested on MKL BLAS

In [1]: import numpy as np

In [2]: import faiss
Failed to load GPU Faiss: No module named swigfaiss_gpu
Faiss falling back to CPU-only.

In [3]: import time

In [4]: X = np.random.random((1000000, 160)).astype('float32')

In [5]: index = faiss.IndexFlatL2(160)

In [6]: index.add(X)

In [7]: t0 = time.time(); index.search(X[:19], 20); print time.time() - t0
0.113594055176

In [8]: t0 = time.time(); index.search(X[:20], 20); print time.time() - t0
0.107301950455

suzaku · 2017-03-22T09:26:09Z

So, does this suggest that faiss should use different threshold for different BLAS implementation?

mdouze · 2017-03-22T09:38:52Z

If possible, I would suggest switching to MKL BLAS. The internal Faiss implementation of exhaustive search is very inefficient in terms of memory accesses for a large number of queries nq. It seems that the break-down in terms of speed is between nq=512 and nq=2048 queries, but at nq=512

BLAS 10.0s
internal 4.6s

When extrapolating from the nq=19 (internal) should give 2.6s. Therefore, both implementations are not satisfactory.

mdouze · 2017-03-22T10:02:08Z

OpenBLAS and OpenMP are known not to play will together, see https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded.

A few more comments, see session
https://gist.github.com/mdouze/9694e8ec06a8d71add0630ce4e0ea294

there is jitter in the performance: OpenBLAS speed seems ok at the first call, then degrades ([10] then [11]), and can become good again later.
setting the # OMP threads to 1 is fast all the time ([15-17], [30-32]). Unfortunately this will hit performance when FlatL2 is used as coarse quantizer.
40 threads seems worse than 20.

Another test, setting OMP_WAIT_POLICY, see https://gist.github.com/mdouze/9a392de0de81614ea2f39b8c724597a3

Setting the policy to PASSIVE seems to solve the problem. My guess is:

by default, when OpenMP is idle, OpenMP by keeps its threads in busy wait for a few ms then puts them to sleep.
OpenBLAS dynamically tries to guess how many threads it can use. Since it sees OpenMP threads in their busy wait, it scales the # threads down.
in PASSIVE mode, OpenMP threads go to sleep immediately, which fixes the problem (but OpenMP parallel sections become slower to launch)

…er_ptr-unsafe Make `TryFromInnerPtr::try_from_inner_ptr` unsafe

mdouze closed this as completed Mar 22, 2017

zhou0913meng mentioned this issue Nov 10, 2020

Where are sub-quantizers stored? #1516

Closed

2 tasks

mqnfred pushed a commit to mqnfred/faiss that referenced this issue Oct 23, 2023

Merge pull request facebookresearch#53 from Enet4/safety/try_from_inn…

f27e57e

…er_ptr-unsafe Make `TryFromInnerPtr::try_from_inner_ptr` unsafe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying with a 20-item array is much more than 20 times slower than querying with 1-item array. #53

Querying with a 20-item array is much more than 20 times slower than querying with 1-item array. #53

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017 •

edited

Loading

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

mdouze commented Mar 22, 2017

Querying with a 20-item array is much more than 20 times slower than querying with 1-item array. #53

Querying with a 20-item array is much more than 20 times slower than querying with 1-item array. #53

Comments

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017 • edited Loading

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

suzaku commented Mar 22, 2017

mdouze commented Mar 22, 2017

mdouze commented Mar 22, 2017

mdouze commented Mar 22, 2017 •

edited

Loading