Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiProbe for L2 Similarity #73

Closed
alexklibisz opened this issue Jun 13, 2020 · 3 comments
Closed

MultiProbe for L2 Similarity #73

alexklibisz opened this issue Jun 13, 2020 · 3 comments
Labels
algorithms Adding and improving ANN algorithms

Comments

@alexklibisz
Copy link
Owner

alexklibisz commented Jun 13, 2020

This is the original paper: https://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf
These lectures (16 and 17) should also be useful: https://www.youtube.com/watch?v=c5DHtx5VxX8
There was an unmerged PR in the main ES repo implementing multiprobe: https://github.com/elastic/elasticsearch/pull/44374/files

@alexklibisz alexklibisz added the enhancement New feature or request label Jun 13, 2020
@alexklibisz alexklibisz added algorithms Adding and improving ANN algorithms and removed enhancement New feature or request labels Jul 29, 2020
@alexklibisz
Copy link
Owner Author

alexklibisz commented Jul 30, 2020

Some rough ideas on how to implement this:

Keep the same L2Lsh Mapping structure. This seems to be strictly a query-time optimization.

Add a parameter probes: Int to the L2Lsh query structure, controlling how many additional hashes of length k (k is called M in the original paper..) are generated for each of the L tables.

Hashing function will generate the additional probes hashes right after it generates the standard hash.
Roughly:

for l in range(L):
  generate the original hash and do some bookkeeping for below..
  for p in range(probes):
    generate additional hash

Otherwise the query looks exactly the same. It takes the generated hashes and goes on to look them up the same as any other approx. query.

Implement the naive version first (enumerate, score, sort all possible perturbations). Make sure that you can get equivalent recall on SIFT with fewer tables and ideally but not necessarily shorter query time. Then go back and implement the optimized version which precomputes perturbation sets, estimates the scores, etc. Make sure the optimized matches the naive.

@alexklibisz alexklibisz pinned this issue Jul 31, 2020
@alexklibisz
Copy link
Owner Author

Implemented the naive version in #123.

As far I can tell, the naivety is not a bottleneck in the current benchmarking configuration (k = 2, probes = 3).
In this configuration the overwhelming bottleneck is (understandably) still the countMatches method.
The only thing I could find related to the multiprobe implementation was taking 0.1% of the runtime on search threads.

Even with k = 3, probes = 27, any footprint from the hashWithProbes method is minimal compared to countMatches.
image

So for now there are likely more worthwhile things to implement than the optimized version.

@alexklibisz
Copy link
Owner Author

🎆 🥳 🎈

Implemented Algorithm 1 from Qin et. al. in #124. This generates the perturbation sets iteratively instead of generating them exhaustively and sorting all of them. I'd say this is good enough or now. There doesn't seem to be a need to implement the expected scores optimization from section 4.5, and it's also not obvious to me why/how it works.

@alexklibisz alexklibisz unpinned this issue Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithms Adding and improving ANN algorithms
Projects
None yet
Development

No branches or pull requests

1 participant