-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug?] Index retrieval is not self-consistent. #176
Comments
What is fname ?
What do you call self-consistent ?
Approximate knn index do not have a top1 accuracy of 100%. If you need
that, use an exact index instead. With only 1M points it can make sense if
you don't need too much speed.
If you do need speed, then you'll have to live with less than 100% of
accuracy and indeed you can tweak the option to pick something that works
for you (read the readme for this)
…On Tue, Oct 24, 2023 at 12:44 PM ysig ***@***.***> wrote:
Hi,
I had very bad scores using the autofaiss build_index method and except
that the bug could have been caused from the fact that when you
build_index you don't save the fname that corresponds to each index ... I
also noticed something utterly strange. In particular, the index is not
self-consistent. That means if I use the same set of embeddings that I used
to build an index to do retrieval I don't always get the same point (yet
the reconstruction error is zero when building the index).
import numpy as npfrom autofaiss import build_index
data_path = 'images'features_dir = 'features'
files = [f for f in os.listdir(data_path)]full = [join(path, f) for f in os.listdir(data_path)]index = build_index(embeddings=np.concatenate([np.load(f) for f in full], axis=0), nb_cores=12, save_on_disk=False)[0]
how_many = 0for f in os.listdir(features_dir):
query_vector = np.load(join(features_dir, f))
neighbors = []
distances, indices = index.search(query_vector, 1)
distances, indices = np.squeeze(distances), np.squeeze(indices)
how_many += int(files[indices[0]] != f)
print(which_many)
Which outputs for my data around 178 (out of 1000000 points).
Is this a bug - or do I need different input parameters to make this work
properly?
I use faiss==1.7.4 and autofaiss==2.15.8.
Thank you,
ysig
—
Reply to this email directly, view it on GitHub
<#176>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437QJNKRVC47NJDDX2B3YA6LZDAVCNFSM6AAAAAA6NPQR66VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQOJYGEZTKNQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi @rom1504 and thanks for your answer. fname is file name. Well for nearest neighbors I don't think it makes sense to not have 100% accuracy otherwise I'm not sure it make sense to use it? I need reasonable speed, I have 5M data points. What should I use? |
there is no need to save file names. The files are read in alphabetical order "Well for nearest neighbors I don't think it makes sense to not have 100% accuracy" it does make a lot of sense. Almost no use case requires 100% accuracy. Usually a lower accuracy for the knn part does not even impact the overall quality of the system. What is reasonable speed? query speed of 1min 1s 1ms ? if you want perfect knn accuracy (note that says nothing on your system quality) you can use a faiss flat index I encourage you to read some more on approximate knn first, there are some links in the readme |
@rom1504 Thanks again for your detailed answer!
Now my purpose/scope it to measure embedding knn accuracy for a baseline on a certain task. Flat index will be very slow however: I have around 500,000 test points and around 5M train points split in 5 indexes so given that my budget then is Could you propose something? Thank you, |
Hi,
I had very bad scores using the autofaiss
build_index
method and except that the bug could have been caused from the fact that when youbuild_index
you don't save the fname that corresponds to each index ... I also noticed something utterly strange. In particular, the index is not self-consistent. That means if I use the same set of embeddings that I used to build an index to do retrieval I don't always get the same point (yet the reconstruction error is zero when building the index).Which outputs for my data around
178
(out of 1000000 points).Is this a bug - or do I need different input parameters to make this work properly?
I use
faiss==1.7.4
andautofaiss==2.15.8
.Thank you,
ysig
The text was updated successfully, but these errors were encountered: