[Bug?] Index retrieval is not self-consistent. #176

ysig · 2023-10-24T10:44:21Z

Hi,

I had very bad scores using the autofaiss build_index method and except that the bug could have been caused from the fact that when you build_index you don't save the fname that corresponds to each index ... I also noticed something utterly strange. In particular, the index is not self-consistent. That means if I use the same set of embeddings that I used to build an index to do retrieval I don't always get the same point (yet the reconstruction error is zero when building the index).

import numpy as np
from autofaiss import build_index

data_path = 'images'
features_dir = 'features'

files = [f for f in os.listdir(data_path)]
full = [join(path, f) for f in os.listdir(data_path)]
index = build_index(embeddings=np.concatenate([np.load(f) for f in full], axis=0), nb_cores=12, save_on_disk=False)[0]

how_many = 0
for f in os.listdir(features_dir):
    query_vector = np.load(join(features_dir, f))
    distances, indices = index.search(query_vector, 1)
    distances, indices = np.squeeze(distances), np.squeeze(indices)
    how_many += int(files[indices[0]] != f)

print(which_many)

Which outputs for my data around 178 (out of 1000000 points).

Is this a bug - or do I need different input parameters to make this work properly?

I use faiss==1.7.4 and autofaiss==2.15.8.

Thank you,
ysig

The text was updated successfully, but these errors were encountered:

rom1504 · 2023-10-24T14:23:55Z

What is fname ? What do you call self-consistent ? Approximate knn index do not have a top1 accuracy of 100%. If you need that, use an exact index instead. With only 1M points it can make sense if you don't need too much speed. If you do need speed, then you'll have to live with less than 100% of accuracy and indeed you can tweak the option to pick something that works for you (read the readme for this)

…

On Tue, Oct 24, 2023 at 12:44 PM ysig ***@***.***> wrote: Hi, I had very bad scores using the autofaiss build_index method and except that the bug could have been caused from the fact that when you build_index you don't save the fname that corresponds to each index ... I also noticed something utterly strange. In particular, the index is not self-consistent. That means if I use the same set of embeddings that I used to build an index to do retrieval I don't always get the same point (yet the reconstruction error is zero when building the index). import numpy as npfrom autofaiss import build_index data_path = 'images'features_dir = 'features' files = [f for f in os.listdir(data_path)]full = [join(path, f) for f in os.listdir(data_path)]index = build_index(embeddings=np.concatenate([np.load(f) for f in full], axis=0), nb_cores=12, save_on_disk=False)[0] how_many = 0for f in os.listdir(features_dir): query_vector = np.load(join(features_dir, f)) neighbors = [] distances, indices = index.search(query_vector, 1) distances, indices = np.squeeze(distances), np.squeeze(indices) how_many += int(files[indices[0]] != f) print(which_many) Which outputs for my data around 178 (out of 1000000 points). Is this a bug - or do I need different input parameters to make this work properly? I use faiss==1.7.4 and autofaiss==2.15.8. Thank you, ysig — Reply to this email directly, view it on GitHub <#176>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437QJNKRVC47NJDDX2B3YA6LZDAVCNFSM6AAAAAA6NPQR66VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQOJYGEZTKNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ysig · 2023-10-24T15:48:03Z

Hi @rom1504 and thanks for your answer.

fname is file name.
self-consistent means top-1 accuracy = 100%

Well for nearest neighbors I don't think it makes sense to not have 100% accuracy otherwise I'm not sure it make sense to use it? I need reasonable speed, I have 5M data points. What should I use?

rom1504 · 2023-10-24T16:27:07Z

there is no need to save file names. The files are read in alphabetical order

"Well for nearest neighbors I don't think it makes sense to not have 100% accuracy" it does make a lot of sense. Almost no use case requires 100% accuracy. Usually a lower accuracy for the knn part does not even impact the overall quality of the system.

What is reasonable speed? query speed of 1min 1s 1ms ?

if you want perfect knn accuracy (note that says nothing on your system quality) you can use a faiss flat index

I encourage you to read some more on approximate knn first, there are some links in the readme

ysig · 2023-10-24T17:41:43Z

@rom1504 Thanks again for your detailed answer!

There is no need to save file names. The files are read in alphabetical order:

imagine you work on a server than deletes a file randomly (happens with Jean Zay)
In general it's good practice to at least give the user the option to save - it allows for guaranteed consistency
index should be self contained - i.e. after computing the index you can transfer it without keeping the original embeds.

Now my purpose/scope it to measure embedding knn accuracy for a baseline on a certain task.
It's for a research paper and the result needs to be scientifically precise.

Flat index will be very slow however:

I have around 500,000 test points and around 5M train points split in 5 indexes so given that my budget then is B = 5* T * 500000 if I set my budget B to one day, i.e. B = 86400 seconds then T = 86400 / (5*500000) = 0.03456 s,
i.e. I need queries that cost <= 0.03456 + ε seconds.

Could you propose something?

Thank you,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug?] Index retrieval is not self-consistent. #176

[Bug?] Index retrieval is not self-consistent. #176

ysig commented Oct 24, 2023 •

edited

Loading

rom1504 commented Oct 24, 2023 via email

ysig commented Oct 24, 2023 •

edited

Loading

rom1504 commented Oct 24, 2023

ysig commented Oct 24, 2023 •

edited

Loading

[Bug?] Index retrieval is not self-consistent. #176

[Bug?] Index retrieval is not self-consistent. #176

Comments

ysig commented Oct 24, 2023 • edited Loading

rom1504 commented Oct 24, 2023 via email

ysig commented Oct 24, 2023 • edited Loading

rom1504 commented Oct 24, 2023

ysig commented Oct 24, 2023 • edited Loading

ysig commented Oct 24, 2023 •

edited

Loading

ysig commented Oct 24, 2023 •

edited

Loading

ysig commented Oct 24, 2023 •

edited

Loading