Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong searching result #784

Closed
lucasjinreal opened this issue Apr 11, 2019 · 3 comments
Closed

Wrong searching result #784

lucasjinreal opened this issue Apr 11, 2019 · 3 comments
Labels

Comments

@lucasjinreal
Copy link

I test with a tiny example, search the closest 5 records from 10000 database of 3 to query records, the codes like this:

#include "faiss/IndexBinaryFlat.h"
#include "faiss/IndexFlat.h"
#include <cstdio>
#include <cstdlib>
#include <iostream>

using namespace std;

int main()
{

    // define serverl high dimension
    // search 10 records from 100000 database
    int n_dim = 64;
    float *xb = new float[10000 * n_dim];
    // we just find 3 vectors cloest
    float *xq = new float[3 * n_dim];

    for (int i = 0; i < 10000; i++)
    {
        for (int j = 0; j < n_dim; j++)
        {
            xb[n_dim * i + j] = drand48();
        }
        xb[n_dim * i] += i / 1000.;
    }

    // Just for logging first 4 lines
    for (int i = 0; i < 4; i++)
    {
        for (int j = 0; j < n_dim; j++)
        {
            cout << xb[n_dim * i + j] << " ";
        }
        cout << endl;
    }

    for (int i = 0; i < 3; i++)
    {
        for (int j = 0; j < n_dim; j++)
        {
            xq[n_dim * i + j] = drand48();
        }
        xq[n_dim * i] += i / 1000.;
    }

    cout << "\n\n to query data: \n";
    for (int i = 0; i < 3; i++)
    {
        for (int j = 0; j < n_dim; j++)
        {
            cout << xq[n_dim * i + j] << " ";
        }
        cout << endl;
    }
    cout << "data init done.\n";

    faiss::IndexFlatL2 index(n_dim);
    // create a Index object with 128 dimensions
    index.add(10000, xb);

    cout << "ntotal: " << index.ntotal << endl;

    const int k = 5;
    // I contains 3 to query, get frist 5
    long *I = new long[3 * k];
    float *D = new float[3 * k];
    // we will search frist 5 most closest vector of provide 3 xq vectors
    index.search(3, xq, k, D, I);

    // log out result
    cout << "I: \n";
    for (int i = 0; i < 3; i++)
    {
        for (int j = 0; j < k; j++)
        {
            // cout << I[i * k + j] << " ";
            printf("%5ld ", I[i * k + j]);
        }
        cout << endl;
    }

    cout << "Data: \n";
    for (int i = 0; i < 3; i++)
    {
        cout << to_string(i) << ": \n";
        /* code */
        for (int j = 0; j < k; j++)
        {
            /* code */
            cout << "find at " << I[i*k+j] << " row\n";
            for(int r = 0; r < n_dim; r++)
            {
                /* code */
                cout << xb[I[i*k+j] + r] << " ";
            }
            cout << endl;
            
        }
        cout << endl;
    }
}

It gives me result, 3 to query records all got top 5 searching result, but when I calculate their distance between to query and result, they are not quite well sorted:

3.0978563097278187
3.093108952977928
3.2624702213728587
3.473762431224948
2.7602607208508108


Abviously, the distance are not the closest, the distance I calculate is simply norm distance (L2 norm which is Euclidian distance).

What could be the problem?

@lucasjinreal
Copy link
Author

lucasjinreal commented Apr 11, 2019

Why the distance caluclated in faiss not same with L2 calculated in python?

to query data:
0.485543 0.0627164 0.124517 0.374388 0.923875 0.848847 0.760384 0.183709 0.37061 0.545615 0.481537 0.575421 0.280282 0.748553 0.491535 0.336497 0.847062 0.327912 0.714115 0.856108 0.535879 0.458547 0.932135 0.217623 0.00414496 0.699389 0.606141 0.742162 0.660105 0.566523 0.517314 0.872063 0.197572 0.587617 0.192859 0.175106 0.762044 0.230964 0.173476 0.0461282 0.280734 0.019257 0.858785 0.0704261 0.4729 0.356399 0.531472 0.446937 0.143957 0.295788 0.407074 0.818588 0.257566 0.482012 0.204077 0.0575501 0.627404 0.516504 0.877116 0.255546 0.731134 0.884103 0.231197 0.390369 

in database searched result data:
0.2158 0.0498956 0.209561 0.574563 0.91482 0.41561 0.654159 0.355701 0.586702 0.391946 0.0338502 0.285516 0.091604 0.444385 0.925937 0.921284 0.605928 0.861146 0.0147172 0.686701 0.844568 0.129647 0.059656 0.0130721 0.391159 0.865676 0.0150574 0.924289 0.886719 0.662069 0.654911 0.0803399 0.0929771 0.00325213 0.52667 0.789756 0.654019 0.545893 0.798389 0.726753 0.253315 0.255246 0.138354 0.482949 0.467587 0.100876 0.504214 0.858395 0.0100477 0.82157 0.964318 0.0194594 0.929619 0.0630804 0.0379622 0.190257 0.276666 0.447314 0.649382 0.151123 0.487155 0.649658 0.53428 0.356802 
distance: 6.27085

the distance will never be 6.2 since they are all less than 1...

@lucasjinreal
Copy link
Author

Even using sum of square, the result still not the same

@mdouze
Copy link
Contributor

mdouze commented Apr 11, 2019

You messed up the array indexing. When you display the result vector it should be

xb[I[i*k+j] * n_dim + r]

@mdouze mdouze added the invalid label Apr 11, 2019
@beauby beauby closed this as completed Apr 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants