negative distance returned in IndexFlatL2 search query #297

hongyi-zhang · 2017-12-28T00:33:39Z

Dear all, does anyone know why the following code could return negative entries for D? I am calculating the L2-nearest neighbors of CIFAR images, for which I assume IndexFlatL2 should return non-negative distances (and 0 for exact match).

index = faiss.IndexFlatL2(d)
index.add(Data)
D, I = index.search(Data, 10)

Some notes:

The most problematic case seems to be when some feature dimensions have significantly larger variance than the others, in which case the negative entries can be quite large (say around -10^3).
I get the same problem using either CPU or GPU.
I get correct results running the tutorial code 1-Flat.py.

Thanks!

The text was updated successfully, but these errors were encountered:

mdouze · 2018-01-03T22:23:36Z

Hi,
Distances for large batches (Data.shape[0] > 20) are computed with d(x, y) = ||x||^2 + ||y||^2 - 2 * <x, y>
Therefore if the magnitude of x and y is very different, roundoff errors may output negative values.
If you reduce the batch size, it switches back to the explicit formulation that necessarily outputs positive distances.
More generally, it is not advised to use vectors with large differences in magnitude: they cause numerical instability and distances are often not significant.

mdouze · 2018-01-16T17:22:40Z

No activity. Closing.

greaber · 2018-05-31T16:34:34Z

Hi, I'm quite new to nearest neighbor techniques and just tried faiss on my data (80 dimensional log magnitude mel spectrogram frames). I was surprised to see negative distances. @mdouze, what exactly you are recommending when you say "it is not advised to use vectors with large differences in magnitude"? A lot of datasets will have big differences in magnitude between different vectors, and you can't necessarily change the dataset (although in my case maybe some transformation, like undoing the log, would improve things, but I don't know any methodology for finding appropriate transformations). Reducing the batch size to below 20 is always a possibility, but I guess it will hurt performance, and it sounds like you are saying it won't actually help accuracy?

mdouze · 2018-06-01T08:52:21Z

The problem is that if you have a query vector x and two database vectors y_1 and y_2, where ||x|| >> ||y_1|| and ||x|| >> ||y_2|| then there will be accuracy losses because computations are performed with 32-bit float precision.

For example, in 1D, float-32 => 24 bits mantissa => epsilon = 1/16M, so if there is a factor 16M between the magnitudes of x and y_i then || x - y_1|| = || x- y_2|| = ||x||, so y_1 and y_2 will be indistinguishable. Of course this is an extreme case, but any relative difference M in magnitude does incur a loss of precision that of log2(M) bits.

In the current version of faiss, you can switch between the two implementations by adjusting

distance_compute_blas_threshold

that is set to 20 by default.

greaber · 2018-06-04T13:33:08Z

Is it possible to set distance_compute_bias_threshold from python? If so, how exactly?

mdouze · 2018-06-04T15:23:10Z

faiss.cvar.distance_compute_blas_threshold = 40

hsiaoma · 2018-06-05T03:19:58Z

I met the same problem when I tried to generate simple data for testing. I used [1, 1], [2, 2], .. [N, N] where N = 10^5 as the database, and compared the result between Flat and IVF4096, Flat. I wanted to use Flat as ground truth but it turned out IVF4096, Flat was the accurate one. Now that I read the post, it may be caused by the magnitude issue. I think it may be better if you mention this in the wiki somewhere so beginners like us are less confused.

serycjon · 2019-02-27T10:22:32Z

setting the distance_compute_blas_threshold (to value bigger than both my DB size and my query size, using faiss.cvar.distance_compute_blas_threshold = 400000 just after import faiss) doesn't seem to be doing anything with the GpuIndexFlatL2, does this work only for the CPU indexes and GPU always uses the x^2 + y^2 - 2xy trick, or am I doing something wrong?

mdouze · 2019-02-27T10:39:21Z

It is only for CPU Faiss. GPU always uses cblas I believe. @wickedfoo ?

wickedfoo · 2019-03-03T02:21:01Z

GPU always does the -2xy via cublas.

However, L2 distance computation should be prevented from going negative as of the last release I believe.

mdouze · 2019-03-04T12:58:01Z

Yes indeed, for both CPU and GPU.

joshim5 · 2020-06-25T22:18:32Z

Perhaps this was fixed for index.search(Data, 10), but I get negative distances for index.range_search. Here's an example:

lims, D, I = index.range_search(data, 555)
>>> min(D)
-3145728.0
>>> max(D)
512.0

Is this numerical overflow?

mdouze closed this as completed Jan 16, 2018

beauby mentioned this issue Apr 29, 2019

Why did not use openblas when the query vector number < 20? #806

Closed

JinHai-CN mentioned this issue Jul 21, 2019

Faiss tuning issue #897

Closed

mdouze mentioned this issue Jan 16, 2020

Problem: precision of faiss output #1086

Closed

joshim5 mentioned this issue Jun 25, 2020

negative distance returned in IndexFlatL2 range_search query #1266

Closed

4 tasks

mdouze mentioned this issue Jun 13, 2022

Faiss: how to count Euclidean distance #2347

Closed

kno10 mentioned this issue Mar 22, 2024

IndexFlatL2 calculations are wrong #3245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

negative distance returned in IndexFlatL2 search query #297

negative distance returned in IndexFlatL2 search query #297

hongyi-zhang commented Dec 28, 2017

mdouze commented Jan 3, 2018

mdouze commented Jan 16, 2018

greaber commented May 31, 2018

mdouze commented Jun 1, 2018

greaber commented Jun 4, 2018

mdouze commented Jun 4, 2018

hsiaoma commented Jun 5, 2018

serycjon commented Feb 27, 2019

mdouze commented Feb 27, 2019

wickedfoo commented Mar 3, 2019

mdouze commented Mar 4, 2019

joshim5 commented Jun 25, 2020

negative distance returned in IndexFlatL2 search query #297

negative distance returned in IndexFlatL2 search query #297

Comments

hongyi-zhang commented Dec 28, 2017

mdouze commented Jan 3, 2018

mdouze commented Jan 16, 2018

greaber commented May 31, 2018

mdouze commented Jun 1, 2018

greaber commented Jun 4, 2018

mdouze commented Jun 4, 2018

hsiaoma commented Jun 5, 2018

serycjon commented Feb 27, 2019

mdouze commented Feb 27, 2019

wickedfoo commented Mar 3, 2019

mdouze commented Mar 4, 2019

joshim5 commented Jun 25, 2020