Distributed MinHashLSH #198

ctrajan · 2023-03-10T14:32:18Z

If there is an implementation distributed MinHashLSh ? If not, shard the base dataset into several machines is possible?

For example, if my dataset has 10 billion data, which can't fit in the memory, can I shard the dataset in 10 machines (one MinHashLSH index in each machine, which have 1 billion different data in each index). When the query comes, it searches in 10 machines, and gather the search results, the gathered result is the same as search in ONE MinHashLSH which contains 10 billions data?

ekzhu · 2023-03-11T05:31:54Z

It is a good idea to partition the data into multiple machines and build separate indexes. In the example you gave, the result from 10 machines is going to be better than looking at only one machine, if you combine the results and rank them by estimated/exact Jaccard similarity.

micimize · 2023-03-14T21:28:03Z

@ekzhu would this be better than, say, using the redis storage layer connected to a redis cluster? Or is the latter just not a good idea at all?

ekzhu · 2023-03-17T09:41:50Z

I have not tried redis cluster. I think this depends on how the index is sharded in redis. But I guess code is probably easier on the client side.

…

On Tue, Mar 14, 2023 at 2:28 PM Michael Joseph Rosenthal < ***@***.***> wrote: @ekzhu <https://github.com/ekzhu> would this be better than, say, using the redis storage layer connected to a redis cluster? Or is the latter just not a good idea at all? — Reply to this email directly, view it on GitHub <#198 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACOGLRNLJID32GAGWBFPWTW4DPG3ANCNFSM6AAAAAAVWRAJRM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ekzhu added the question label Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed MinHashLSH #198

Distributed MinHashLSH #198

ctrajan commented Mar 10, 2023

ekzhu commented Mar 11, 2023

micimize commented Mar 14, 2023

ekzhu commented Mar 17, 2023 via email

Distributed MinHashLSH #198

Distributed MinHashLSH #198

Comments

ctrajan commented Mar 10, 2023

ekzhu commented Mar 11, 2023

micimize commented Mar 14, 2023

ekzhu commented Mar 17, 2023 via email