# Minhash LSH in Action

Now that we've covered the main points of the algorithm we can see it in action.
We will be using the implementation provided by the [datasketch](http://ekzhu.com/datasketch/lsh.html) package on the [Comics Goodreads Dataset](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home?authuser=1#h.p_evDuwuTozQVZ).

We will model each comic as a *bag of words* of its description and use Minhash LSH to find duplicate entries.



First of all, let's read in the data:

In [1]:
import pandas as pd
data = pd.read_json("/data/comics/goodreads_books_comics_graphic.json", lines=True)[['title', 'description']]
data.head()

Unnamed: 0,title,description
0,The Switchblade Mamma,Lillian Ann Cross is forced to live the worst ...
1,Cruelle,"Florence Dupre Latour raconte comment, de son ..."
2,Captain America: Winter Soldier (The Ultimate ...,The questions plaguing Captain America's dream...
3,Bounty Hunter 4/3: My Life in Combat from Mari...,The fight for Jason Delgado's life and soul be...
4,"Superman Archives, Vol. 2",These are the stories that catapulted Superman...


Then, let's concat the title and description to get a single `text` field.

In [2]:
data['text'] = data.title.str.cat(data.description, sep=' \n ')
data['text'].head()

0    The Switchblade Mamma \n Lillian Ann Cross is ...
1    Cruelle \n Florence Dupre Latour raconte comme...
2    Captain America: Winter Soldier (The Ultimate ...
3    Bounty Hunter 4/3: My Life in Combat from Mari...
4    Superman Archives, Vol. 2 \n These are the sto...
Name: text, dtype: object

## How do we build minhashes from text?
For this simple exercise, we'll just tokenize the text and make the bag of words assumption. I.e. each document will be mapped to the set of its tokens.
So we have two steps:
1. Tokenize the text. I like the [fastai](https://github.com/fastai/fastai) tokenizer
2. Build a minhash from the tokens

In [3]:
from fastai.text.data import Tokenizer
from datasketch import MinHash, LeanMinHash
from ipywidgets import interact

tokenizer = Tokenizer()
tok_func = tokenizer.tok_func('en')

def tokenize(s:str):
    """
    Tokenizes a string and returns a list of strings
    """
    return tokenizer.process_text(s.lower(), tok_func)

def get_mh(shingles:list) -> LeanMinHash:
    """
    Builds a minhash signature from the shingles
    :param shingles - list of strings
    :return the minhash
    """
    mh = MinHash()
    for s in shingles:
        mh.update(s.encode('utf8'))
    mh = LeanMinHash(mh)
    return mh

@interact(text="hello world")
def interact_tokenize(text):
    tokens = tokenize(text)
    
    mh = get_mh(tokens)
    display(f"Tokens: {tokens}")
    display(f"Minhash: {mh.digest()}")

interactive(children=(Text(value='hello world', description='text'), Output()), _dom_classes=('widget-interact…

In [8]:
mh = get_mh(["hello", "kitty"])
mh2 = get_mh(["hello", "world"])

In [9]:
mh.jaccard(mh2)

0.3203125

## Building the LSH index
Now that we can build minhashes for the documents, let's build up the MinhashLSH index.

The `datasketch` package allows us to simply specify a threshold for near duplicates and it selects the optimal settings for `b` and `r`. Of course, this behaviour can be overriden.

In [10]:
from datasketch import MinHashLSH
from fastprogress import progress_bar

lsh = MinHashLSH(threshold=0.5)
minhashes = []
for idx, row in progress_bar(data.iterrows(), total=len(data)):
    text = row['text']
    shingles = tokenize(text)
    mh = get_mh(shingles)
    minhashes.append(mh)
    lsh.insert(idx, mh)

In [11]:
from IPython.core.display import HTML

def jac_sim(s1, s2):
    return len(s1.intersection(s2))/len(s1.union(s2))

def find_similar_by_id(idx):
    query = data.iloc[idx]['title'] + " " + data.iloc[idx]['description']
    candidates = find_similar(query)
    candidates.drop(index=idx, inplace=True)
    return  candidates

def find_similar(query):
    mh = get_mh(tokenize(query))
    candidates = lsh.query(mh)
    candidate_mh = [minhashes[idx] for idx in candidates]
    candidate_sims = [mh.jaccard(other_mh) for other_mh in candidate_mh]
    res = data.iloc[candidates].copy()
    res['similarity'] = candidate_sims
    return res.sort_values("similarity", ascending=False)

def show(idx, cand):
    display(f"{idx}, Similarity: {cand['similarity']}")
    display(f"{cand['text']}")
    display(HTML('<hr>'))

idx= 66 # near dups
idx= 70
idx = 81
idx = 50988

@interact(idx=(1, len(data), 1))
def interact_query(idx):
    cur = data.iloc[idx]
    cur['similarity'] = 1.0
    show("Original", cur)
    candidates = find_similar_by_id(idx)
    for idx, cand in candidates.iterrows():
        show(idx, cand)

interactive(children=(IntSlider(value=44706, description='idx', max=89411, min=1), Output()), _dom_classes=('w…

In [None]:
import itertools
for mh, mh2 in progress_bar(itertools.combinations(minhashes,2), total=len(minhashes)*(len(minhashes)-1)/2):
    mh.jaccard(mh2)


In [12]:
for mh in progress_bar(minhashes):
    neardups = lsh.query(mh)
    for nd in neardups:
        mh.jaccard(minhashes[nd])