how to restore minhashes from MinHashLSH with redis backend? #51

bung87 · 2018-05-31T12:34:02Z

I what to find duplicates,so I need pop a hash then compare left hashes.

or should I make a copy of original data to redis?
or just not use the redis storge wait the dataset feed up then use MinHashLSH

ae-foster · 2018-06-05T12:32:58Z

Hi @bung87 do you think you could give a bit more detail about your question? What's your exact use case?

bung87 · 2018-06-05T14:50:58Z

Question is :
here's two type of data:

articles may store to redis
create_lsh may use redis backed

Should I store both articles and lsh's minhashes to redis?

in production also I need store key of each articles.

things am trying to do below;

def duplicates(content, debug=False):
 
    if type(content) == list:
        articles = content
    else:
        articles = split_articles(content)
    minhashs = create_minhashs(articles, debug=debug)
    result = defaultdict(list)

    matched = []
    lsh = create_lsh(minhashs, threshold=0.75, num_perm=128, debug=debug)
    while minhashs:
        cur_index = len(minhashs) - 1
        cur = minhashs.pop()
        if not cur or cur_index in matched:
            continue
        # debug and print("creat lsh done %s" % cur_index)
        r = lsh.query(cur)
        if r:
            for idx in r:
                if idx != cur_index and idx not in matched and idx not in result:
                    result[cur_index].append(idx)
                    matched.append(idx)
    return result

ae-foster · 2018-06-08T15:22:24Z

@bung87 I'm still a bit unsure what your objective is. In general though, don't use the Redis backend unless you need to. If you can solve your problem within a Python process that will be the way to go!

bung87 · 2018-06-08T16:06:49Z

hmm ..actually I just want make ** only one copy of data(keys and hashes)**,since datasketch just provide query interface,for finding duplicates,I must first make a minhash which needs the text as input.so in this case I must feed lsh with minhashes,and also store text dataset for each query,

what I need is this project does, and it does just make one copy of data.
https://github.com/mattilyra/LSH/blob/master/examples/Introduction.ipynb

in production that may not be a problem (in mine case). I just want make sure as if there is a clear way to achieve this.

thank you!

duhaime · 2018-06-08T16:09:38Z

@bung87 I for one am still not clear what you wish to accomplish. What makes you think multiple copies of data are being stored? What data exactly is being stored multiple times and where is it being stored?

bung87 · 2018-06-08T16:13:34Z

I think it just needs keys and minhashes to find duplicates,but for now I also needs store text for create Minhash as query function requires.
if use redis as backend it store hashes(data here),as query needs minhash (which needs text) (data here)
the way I think it could be lsh->insert minhashes->find_duplicates ,no more query.

duhaime · 2018-06-08T16:26:23Z

If you want to find all matches, you can do something like what you're saying (if I understand you).

Create an empty dictionary that will map hashbands (sequences of minhashes) to input identifiers. For each input object, store a unique id and generate the minhashes. Pass a sliding window over the minhashes and combine each sequence of n minhash values into a hashbands (this will concatenate several minhashes into one object). Then for each hashband you get, add your input object's id to the list of ids in which that hashband occurs.

This gives you a mapping from hashbands to lists of ids, where ids that occur in the same list have the same hashband (ie are estimated to have some jaccard similarity). I'm doing exactly this with billions of hashbands on some supercompute clusters and it works quite nicely.

Perhaps this is not what you are trying to articulate though...

bung87 · 2018-06-08T16:35:44Z

thanks, I do exactly what you say,I put mine code above there and it works,here I open this issue just make sure I am not missing a simple clean way.

bung87 closed this as completed Jun 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to restore minhashes from MinHashLSH with redis backend? #51

how to restore minhashes from MinHashLSH with redis backend? #51

bung87 commented May 31, 2018

ae-foster commented Jun 5, 2018

bung87 commented Jun 5, 2018

ae-foster commented Jun 8, 2018

bung87 commented Jun 8, 2018

duhaime commented Jun 8, 2018

bung87 commented Jun 8, 2018 •

edited

duhaime commented Jun 8, 2018

bung87 commented Jun 8, 2018

how to restore minhashes from MinHashLSH with redis backend? #51

how to restore minhashes from MinHashLSH with redis backend? #51

Comments

bung87 commented May 31, 2018

ae-foster commented Jun 5, 2018

bung87 commented Jun 5, 2018

ae-foster commented Jun 8, 2018

bung87 commented Jun 8, 2018

duhaime commented Jun 8, 2018

bung87 commented Jun 8, 2018 • edited

duhaime commented Jun 8, 2018

bung87 commented Jun 8, 2018

bung87 commented Jun 8, 2018 •

edited