Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug?] Index retrieval is not self-consistent. #176

Open
ysig opened this issue Oct 24, 2023 · 4 comments
Open

[Bug?] Index retrieval is not self-consistent. #176

ysig opened this issue Oct 24, 2023 · 4 comments

Comments

@ysig
Copy link

ysig commented Oct 24, 2023

Hi,

I had very bad scores using the autofaiss build_index method and except that the bug could have been caused from the fact that when you build_index you don't save the fname that corresponds to each index ... I also noticed something utterly strange. In particular, the index is not self-consistent. That means if I use the same set of embeddings that I used to build an index to do retrieval I don't always get the same point (yet the reconstruction error is zero when building the index).

import numpy as np
from autofaiss import build_index

data_path = 'images'
features_dir = 'features'

files = [f for f in os.listdir(data_path)]
full = [join(path, f) for f in os.listdir(data_path)]
index = build_index(embeddings=np.concatenate([np.load(f) for f in full], axis=0), nb_cores=12, save_on_disk=False)[0]

how_many = 0
for f in os.listdir(features_dir):
    query_vector = np.load(join(features_dir, f))
    distances, indices = index.search(query_vector, 1)
    distances, indices = np.squeeze(distances), np.squeeze(indices)
    how_many += int(files[indices[0]] != f)

print(which_many)

Which outputs for my data around 178 (out of 1000000 points).

Is this a bug - or do I need different input parameters to make this work properly?

I use faiss==1.7.4 and autofaiss==2.15.8.

Thank you,
ysig

@rom1504
Copy link
Contributor

rom1504 commented Oct 24, 2023 via email

@ysig
Copy link
Author

ysig commented Oct 24, 2023

Hi @rom1504 and thanks for your answer.

fname is file name.
self-consistent means top-1 accuracy = 100%

Well for nearest neighbors I don't think it makes sense to not have 100% accuracy otherwise I'm not sure it make sense to use it? I need reasonable speed, I have 5M data points. What should I use?

@rom1504
Copy link
Contributor

rom1504 commented Oct 24, 2023

there is no need to save file names. The files are read in alphabetical order

"Well for nearest neighbors I don't think it makes sense to not have 100% accuracy" it does make a lot of sense. Almost no use case requires 100% accuracy. Usually a lower accuracy for the knn part does not even impact the overall quality of the system.

What is reasonable speed? query speed of 1min 1s 1ms ?

if you want perfect knn accuracy (note that says nothing on your system quality) you can use a faiss flat index

I encourage you to read some more on approximate knn first, there are some links in the readme

@ysig
Copy link
Author

ysig commented Oct 24, 2023

@rom1504 Thanks again for your detailed answer!

There is no need to save file names. The files are read in alphabetical order:

  • imagine you work on a server than deletes a file randomly (happens with Jean Zay)
  • In general it's good practice to at least give the user the option to save - it allows for guaranteed consistency
  • index should be self contained - i.e. after computing the index you can transfer it without keeping the original embeds.

Now my purpose/scope it to measure embedding knn accuracy for a baseline on a certain task.
It's for a research paper and the result needs to be scientifically precise.

Flat index will be very slow however:

I have around 500,000 test points and around 5M train points split in 5 indexes so given that my budget then is B = 5* T * 500000 if I set my budget B to one day, i.e. B = 86400 seconds then T = 86400 / (5*500000) = 0.03456 s,
i.e. I need queries that cost <= 0.03456 + ε seconds.

Could you propose something?

Thank you,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants