Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed MinHashLSH #198

Open
ctrajan opened this issue Mar 10, 2023 · 3 comments
Open

Distributed MinHashLSH #198

ctrajan opened this issue Mar 10, 2023 · 3 comments
Labels

Comments

@ctrajan
Copy link

ctrajan commented Mar 10, 2023

If there is an implementation distributed MinHashLSh ? If not, shard the base dataset into several machines is possible?

For example, if my dataset has 10 billion data, which can't fit in the memory, can I shard the dataset in 10 machines (one MinHashLSH index in each machine, which have 1 billion different data in each index). When the query comes, it searches in 10 machines, and gather the search results, the gathered result is the same as search in ONE MinHashLSH which contains 10 billions data?

@ekzhu
Copy link
Owner

ekzhu commented Mar 11, 2023

It is a good idea to partition the data into multiple machines and build separate indexes. In the example you gave, the result from 10 machines is going to be better than looking at only one machine, if you combine the results and rank them by estimated/exact Jaccard similarity.

@micimize
Copy link
Contributor

@ekzhu would this be better than, say, using the redis storage layer connected to a redis cluster? Or is the latter just not a good idea at all?

@ekzhu
Copy link
Owner

ekzhu commented Mar 17, 2023 via email

@ekzhu ekzhu added the question label Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants