Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to restore minhashes from MinHashLSH with redis backend? #51

Closed
bung87 opened this issue May 31, 2018 · 8 comments
Closed

how to restore minhashes from MinHashLSH with redis backend? #51

bung87 opened this issue May 31, 2018 · 8 comments

Comments

@bung87
Copy link

bung87 commented May 31, 2018

I what to find duplicates,so I need pop a hash then compare left hashes.

or should I make a copy of original data to redis?
or just not use the redis storge wait the dataset feed up then use MinHashLSH

@ae-foster
Copy link
Contributor

Hi @bung87 do you think you could give a bit more detail about your question? What's your exact use case?

@bung87
Copy link
Author

bung87 commented Jun 5, 2018

Question is :
here's two type of data:

articles may store to redis
create_lsh may use redis backed

Should I store both articles and lsh's minhashes to redis?

in production also I need store key of each articles.

things am trying to do below;

def duplicates(content, debug=False):
 
    if type(content) == list:
        articles = content
    else:
        articles = split_articles(content)
    minhashs = create_minhashs(articles, debug=debug)
    result = defaultdict(list)

    matched = []
    lsh = create_lsh(minhashs, threshold=0.75, num_perm=128, debug=debug)
    while minhashs:
        cur_index = len(minhashs) - 1
        cur = minhashs.pop()
        if not cur or cur_index in matched:
            continue
        # debug and print("creat lsh done %s" % cur_index)
        r = lsh.query(cur)
        if r:
            for idx in r:
                if idx != cur_index and idx not in matched and idx not in result:
                    result[cur_index].append(idx)
                    matched.append(idx)
    return result

@ae-foster
Copy link
Contributor

@bung87 I'm still a bit unsure what your objective is. In general though, don't use the Redis backend unless you need to. If you can solve your problem within a Python process that will be the way to go!

@bung87
Copy link
Author

bung87 commented Jun 8, 2018

hmm ..actually I just want make ** only one copy of data(keys and hashes)**,since datasketch just provide query interface,for finding duplicates,I must first make a minhash which needs the text as input.so in this case I must feed lsh with minhashes,and also store text dataset for each query,

what I need is this project does, and it does just make one copy of data.
https://github.com/mattilyra/LSH/blob/master/examples/Introduction.ipynb

in production that may not be a problem (in mine case). I just want make sure as if there is a clear way to achieve this.

thank you!

@duhaime
Copy link

duhaime commented Jun 8, 2018

@bung87 I for one am still not clear what you wish to accomplish. What makes you think multiple copies of data are being stored? What data exactly is being stored multiple times and where is it being stored?

@bung87
Copy link
Author

bung87 commented Jun 8, 2018

I think it just needs keys and minhashes to find duplicates,but for now I also needs store text for create Minhash as query function requires.
if use redis as backend it store hashes(data here),as query needs minhash (which needs text) (data here)
the way I think it could be lsh->insert minhashes->find_duplicates ,no more query.

@duhaime
Copy link

duhaime commented Jun 8, 2018

If you want to find all matches, you can do something like what you're saying (if I understand you).

Create an empty dictionary that will map hashbands (sequences of minhashes) to input identifiers. For each input object, store a unique id and generate the minhashes. Pass a sliding window over the minhashes and combine each sequence of n minhash values into a hashbands (this will concatenate several minhashes into one object). Then for each hashband you get, add your input object's id to the list of ids in which that hashband occurs.

This gives you a mapping from hashbands to lists of ids, where ids that occur in the same list have the same hashband (ie are estimated to have some jaccard similarity). I'm doing exactly this with billions of hashbands on some supercompute clusters and it works quite nicely.

Perhaps this is not what you are trying to articulate though...

@bung87
Copy link
Author

bung87 commented Jun 8, 2018

thanks, I do exactly what you say,I put mine code above there and it works,here I open this issue just make sure I am not missing a simple clean way.

@bung87 bung87 closed this as completed Jun 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants