Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

negative distance returned in IndexFlatL2 search query #297

Closed
hongyi-zhang opened this issue Dec 28, 2017 · 12 comments
Closed

negative distance returned in IndexFlatL2 search query #297

hongyi-zhang opened this issue Dec 28, 2017 · 12 comments

Comments

@hongyi-zhang
Copy link

Dear all, does anyone know why the following code could return negative entries for D? I am calculating the L2-nearest neighbors of CIFAR images, for which I assume IndexFlatL2 should return non-negative distances (and 0 for exact match).

index = faiss.IndexFlatL2(d)
index.add(Data)
D, I = index.search(Data, 10)

Some notes:

  1. The most problematic case seems to be when some feature dimensions have significantly larger variance than the others, in which case the negative entries can be quite large (say around -10^3).
  2. I get the same problem using either CPU or GPU.
  3. I get correct results running the tutorial code 1-Flat.py.

Thanks!

@mdouze
Copy link
Contributor

mdouze commented Jan 3, 2018

Hi,
Distances for large batches (Data.shape[0] > 20) are computed with d(x, y) = ||x||^2 + ||y||^2 - 2 * <x, y>
Therefore if the magnitude of x and y is very different, roundoff errors may output negative values.
If you reduce the batch size, it switches back to the explicit formulation that necessarily outputs positive distances.
More generally, it is not advised to use vectors with large differences in magnitude: they cause numerical instability and distances are often not significant.

@mdouze
Copy link
Contributor

mdouze commented Jan 16, 2018

No activity. Closing.

@mdouze mdouze closed this as completed Jan 16, 2018
@greaber
Copy link

greaber commented May 31, 2018

Hi, I'm quite new to nearest neighbor techniques and just tried faiss on my data (80 dimensional log magnitude mel spectrogram frames). I was surprised to see negative distances. @mdouze, what exactly you are recommending when you say "it is not advised to use vectors with large differences in magnitude"? A lot of datasets will have big differences in magnitude between different vectors, and you can't necessarily change the dataset (although in my case maybe some transformation, like undoing the log, would improve things, but I don't know any methodology for finding appropriate transformations). Reducing the batch size to below 20 is always a possibility, but I guess it will hurt performance, and it sounds like you are saying it won't actually help accuracy?

@mdouze
Copy link
Contributor

mdouze commented Jun 1, 2018

The problem is that if you have a query vector x and two database vectors y_1 and y_2, where ||x|| >> ||y_1|| and ||x|| >> ||y_2|| then there will be accuracy losses because computations are performed with 32-bit float precision.

For example, in 1D, float-32 => 24 bits mantissa => epsilon = 1/16M, so if there is a factor 16M between the magnitudes of x and y_i then || x - y_1|| = || x- y_2|| = ||x||, so y_1 and y_2 will be indistinguishable. Of course this is an extreme case, but any relative difference M in magnitude does incur a loss of precision that of log2(M) bits.

In the current version of faiss, you can switch between the two implementations by adjusting

distance_compute_blas_threshold

that is set to 20 by default.

@greaber
Copy link

greaber commented Jun 4, 2018

Is it possible to set distance_compute_bias_threshold from python? If so, how exactly?

@mdouze
Copy link
Contributor

mdouze commented Jun 4, 2018

faiss.cvar.distance_compute_blas_threshold = 40

@hsiaoma
Copy link

hsiaoma commented Jun 5, 2018

I met the same problem when I tried to generate simple data for testing. I used [1, 1], [2, 2], .. [N, N] where N = 10^5 as the database, and compared the result between Flat and IVF4096, Flat. I wanted to use Flat as ground truth but it turned out IVF4096, Flat was the accurate one. Now that I read the post, it may be caused by the magnitude issue. I think it may be better if you mention this in the wiki somewhere so beginners like us are less confused.

@serycjon
Copy link

setting the distance_compute_blas_threshold (to value bigger than both my DB size and my query size, using faiss.cvar.distance_compute_blas_threshold = 400000 just after import faiss) doesn't seem to be doing anything with the GpuIndexFlatL2, does this work only for the CPU indexes and GPU always uses the x^2 + y^2 - 2xy trick, or am I doing something wrong?

@mdouze
Copy link
Contributor

mdouze commented Feb 27, 2019

It is only for CPU Faiss. GPU always uses cblas I believe. @wickedfoo ?

@wickedfoo
Copy link
Contributor

GPU always does the -2xy via cublas.

However, L2 distance computation should be prevented from going negative as of the last release I believe.

@mdouze
Copy link
Contributor

mdouze commented Mar 4, 2019

Yes indeed, for both CPU and GPU.

@joshim5
Copy link

joshim5 commented Jun 25, 2020

Perhaps this was fixed for index.search(Data, 10), but I get negative distances for index.range_search. Here's an example:

lims, D, I = index.range_search(data, 555)
>>> min(D)
-3145728.0
>>> max(D)
512.0

Is this numerical overflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants