MultiProbe for L2 Similarity #73

alexklibisz · 2020-06-13T13:07:48Z

This is the original paper: https://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf
These lectures (16 and 17) should also be useful: https://www.youtube.com/watch?v=c5DHtx5VxX8
There was an unmerged PR in the main ES repo implementing multiprobe: https://github.com/elastic/elasticsearch/pull/44374/files

alexklibisz · 2020-07-30T02:14:50Z

Some rough ideas on how to implement this:

Keep the same L2Lsh Mapping structure. This seems to be strictly a query-time optimization.

Add a parameter probes: Int to the L2Lsh query structure, controlling how many additional hashes of length k (k is called M in the original paper..) are generated for each of the L tables.

Hashing function will generate the additional probes hashes right after it generates the standard hash.
Roughly:

for l in range(L):
  generate the original hash and do some bookkeeping for below..
  for p in range(probes):
    generate additional hash

Otherwise the query looks exactly the same. It takes the generated hashes and goes on to look them up the same as any other approx. query.

Implement the naive version first (enumerate, score, sort all possible perturbations). Make sure that you can get equivalent recall on SIFT with fewer tables and ideally but not necessarily shorter query time. Then go back and implement the optimized version which precomputes perturbation sets, estimates the scores, etc. Make sure the optimized matches the naive.

alexklibisz · 2020-08-02T03:06:25Z

Implemented the naive version in #123.

As far I can tell, the naivety is not a bottleneck in the current benchmarking configuration (k = 2, probes = 3).
In this configuration the overwhelming bottleneck is (understandably) still the countMatches method.
The only thing I could find related to the multiprobe implementation was taking 0.1% of the runtime on search threads.

Even with k = 3, probes = 27, any footprint from the hashWithProbes method is minimal compared to countMatches.

So for now there are likely more worthwhile things to implement than the optimized version.

alexklibisz · 2020-08-07T00:06:43Z

🎆 🥳 🎈

Implemented Algorithm 1 from Qin et. al. in #124. This generates the perturbation sets iteratively instead of generating them exhaustively and sorting all of them. I'd say this is good enough or now. There doesn't seem to be a need to implement the expected scores optimization from section 4.5, and it's also not obvious to me why/how it works.

alexklibisz added the enhancement New feature or request label Jun 13, 2020

alexklibisz added algorithms Adding and improving ANN algorithms and removed enhancement New feature or request labels Jul 29, 2020

alexklibisz pinned this issue Jul 31, 2020

alexklibisz closed this as completed Aug 7, 2020

alexklibisz unpinned this issue Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiProbe for L2 Similarity #73

MultiProbe for L2 Similarity #73

alexklibisz commented Jun 13, 2020 •

edited

Loading

alexklibisz commented Jul 30, 2020 •

edited

Loading

alexklibisz commented Aug 2, 2020

alexklibisz commented Aug 7, 2020

MultiProbe for L2 Similarity #73

MultiProbe for L2 Similarity #73

Comments

alexklibisz commented Jun 13, 2020 • edited Loading

alexklibisz commented Jul 30, 2020 • edited Loading

alexklibisz commented Aug 2, 2020

alexklibisz commented Aug 7, 2020

alexklibisz commented Jun 13, 2020 •

edited

Loading

alexklibisz commented Jul 30, 2020 •

edited

Loading