New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to restore minhashes from MinHashLSH with redis backend? #51
Comments
Hi @bung87 do you think you could give a bit more detail about your question? What's your exact use case? |
Question is : articles may store to redis Should I store both articles and lsh's minhashes to redis? in production also I need store key of each articles. things am trying to do below; def duplicates(content, debug=False):
if type(content) == list:
articles = content
else:
articles = split_articles(content)
minhashs = create_minhashs(articles, debug=debug)
result = defaultdict(list)
matched = []
lsh = create_lsh(minhashs, threshold=0.75, num_perm=128, debug=debug)
while minhashs:
cur_index = len(minhashs) - 1
cur = minhashs.pop()
if not cur or cur_index in matched:
continue
# debug and print("creat lsh done %s" % cur_index)
r = lsh.query(cur)
if r:
for idx in r:
if idx != cur_index and idx not in matched and idx not in result:
result[cur_index].append(idx)
matched.append(idx)
return result |
@bung87 I'm still a bit unsure what your objective is. In general though, don't use the Redis backend unless you need to. If you can solve your problem within a Python process that will be the way to go! |
hmm ..actually I just want make ** only one copy of data(keys and hashes)**,since datasketch just provide query interface,for finding duplicates,I must first make a minhash which needs the text as input.so in this case I must feed lsh with minhashes,and also store text dataset for each query, what I need is this project does, and it does just make one copy of data. in production that may not be a problem (in mine case). I just want make sure as if there is a clear way to achieve this. thank you! |
@bung87 I for one am still not clear what you wish to accomplish. What makes you think multiple copies of data are being stored? What data exactly is being stored multiple times and where is it being stored? |
I think it just needs keys and minhashes to find duplicates,but for now I also needs store text for create Minhash as query function requires. |
If you want to find all matches, you can do something like what you're saying (if I understand you). Create an empty dictionary that will map hashbands (sequences of minhashes) to input identifiers. For each input object, store a unique id and generate the minhashes. Pass a sliding window over the minhashes and combine each sequence of n minhash values into a hashbands (this will concatenate several minhashes into one object). Then for each hashband you get, add your input object's id to the list of ids in which that hashband occurs. This gives you a mapping from hashbands to lists of ids, where ids that occur in the same list have the same hashband (ie are estimated to have some jaccard similarity). I'm doing exactly this with billions of hashbands on some supercompute clusters and it works quite nicely. Perhaps this is not what you are trying to articulate though... |
thanks, I do exactly what you say,I put mine code above there and it works,here I open this issue just make sure I am not missing a simple clean way. |
I what to find duplicates,so I need pop a hash then compare left hashes.
or should I make a copy of original data to redis?
or just not use the redis storge wait the dataset feed up then use MinHashLSH
The text was updated successfully, but these errors were encountered: