# Minhash LSH for Set Similarity

The Minhash-LSH algorithm is very suitable for finding duplicates in an efficient manner. This notebook will outline the main concepts and considerations related to the algorithm.

>The method relies
on the notion of locality-sensitive hash functions. Intuitively, a hash function is locality-sensitive if
its probability of collision is higher for “nearby” points than for points that are “far apart”   
>-- <A href="https://arxiv.org/abs/1509.02897"><cite>Andoni et al, Practical and Optimal LSH for Angular Distance</cite></A>

# Imports

In [193]:
import pandas as pd
import numpy as np
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

# Step 1: Representing sets via their Characteristic matrix

For our running examples, let's use the universe of the first 15 english letters: A-P.

The characteristic matrix of a set is a binary representation where the elements of the universe are enumerated and each set has a `0` for missing elements and a `1` for present elements. E.g. the set {A,B,D} would be 

<table> 
    <tr> <td> A </td> <td> 1 </tr>
    <tr> <td> B </td> <td> 1 </tr>
    <tr> <td> C </td> <td> 0 </tr>
    <tr> <td> D </td> <td> 1 </tr>
    <tr> <td> E </td> <td> 0 </tr>
    <tr> <td> F </td> <td> ... </tr>
</table>

In [24]:
UNIVERSE = list("ABCDEFGHIJKLMNOP")
UNIVERSE

['A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P']

In [25]:
def sample_set():
    """
    Samples a random set from the universe.
    """
    return pd.DataFrame(np.random.randint(0, 2, size=len(UNIVERSE)), index=UNIVERSE)

In [47]:
def pp_set(s):
    """
    Pretty prints sets - highlights the present elements
    """
    def highlight_ones(x):
        return f"background-color: {'yellow' if x==1 else 'white'}"
    display(s.style.applymap(highlight_ones))
s = sample_set()
pp_set(sample_set())

Unnamed: 0,0
A,0
B,1
C,1
D,0
E,0
F,1
G,1
H,0
I,0
J,0


# Step 2: Jaccard Similarity

The Jaccard Similarity of a pair of sets is defined as:

$$ jac\_sim(A,B) = \frac{|A\cap B|}{|A \cup B|}$$

Examples:

$$ jac\_sim(\{A,B,C\}, \{A,B,D\}) = \frac{1}{2}$$
$$ jac\_sim(\{A,C\}, \{A,B,D\}) = \frac{1}{4}$$
$$ jac\_sim(\{C\}, \{A,B,D\}) = \frac{0}{4}$$
$$ jac\_sim(\{A,B,C,D\}, \{A,B,C,E\}) = \frac{3}{5}$$

# Step 3: Minhash

The operation to get the minhash of a set:
1. Permute the enumeration of the Universe.
2. The minhash of a set is the first present element under the new enumeration.

Example:

Suppose *ABCDEFGHIJKLMNOP* is permuted to *HIDCABJEFKLMNOPG*.
Then, {ABD}
<table> 
    <tr> <td> A </td> <td> 1 </tr>
    <tr> <td> B </td> <td> 1 </tr>
    <tr> <td> C </td> <td> 0 </tr>
    <tr> <td> D </td> <td> 1 </tr>
    <tr> <td> E </td> <td> 0 </tr>
    <tr> <td> F </td> <td> ... </tr>
</table>
will become 
<table> 
    <tr> <td> H </td> <td> 0 </tr>
    <tr> <td> I </td> <td> 0 </tr>
    <tr> <td style="background-color:yellow"> D </td> <td style="background-color:yellow"> 1 </tr>
    <tr> <td> C </td> <td> 0 </tr>
    <tr> <td> A </td> <td> 1 </tr>
    <tr> <td> B </td> <td> 1 </tr>
    <tr> <td> J </td> <td> ... </tr>
</table>

and so the minhash of ABD under this permutation is `D`.

In [51]:
from IPython.display import display, HTML

CSS = """
.output {
    flex-direction: row;
}
"""

HTML('<style>{}</style>'.format(CSS))

In [70]:
def permute(s):
    return s.reindex(np.random.permutation(s.index))
    
def pp_minhash(s, permute_set=True):
    p = permute(s) if permute_set else s
    def find_first(rows):
        res = ['background-color:white']  * len(rows)
        res[rows.tolist().index(1)] = 'background-color:yellow'
        return res
    display(p.style.apply(find_first, axis=0))

pp_set(s)
p = permute(s)
display('=> permute =>')
pp_set(p)
display('=> minhash =>')
pp_minhash(s, permute_set=False)

Unnamed: 0,0
A,1
B,0
C,0
D,0
E,0
F,0
G,0
H,1
I,0
J,1


'=> permute =>'

Unnamed: 0,0
D,0
O,0
J,1
E,0
H,1
B,0
M,1
N,1
C,0
K,1


'=> minhash =>'

Unnamed: 0,0
A,1
B,0
C,0
D,0
E,0
F,0
G,0
H,1
I,0
J,1


# Step 4: Minhash equality is jaccard similarity

In [77]:
pair = pd.concat([sample_set(), sample_set()], axis=1)
pair.columns=['A','B']
pp_set(pair)

Unnamed: 0,A,B
A,0,1
B,1,0
C,1,1
D,1,1
E,0,0
F,1,1
G,1,1
H,0,0
I,0,1
J,0,1


In [78]:
def pp_jaccard(df):
    def highlight_eq_rows(row):
        l = row.iloc[0]
        r = row.iloc[1]
        if l == 0 and r == 0:
            return ['']* 2
        elif l != r:
            return ['background-color: darkorange']* 2
        else:
            return ['background-color: lightgreen'] * 2
    return df.style.apply(highlight_eq_rows, axis=1)

In [79]:
pp_jaccard(pair)

Unnamed: 0,A,B
A,0,1
B,1,0
C,1,1
D,1,1
E,0,0
F,1,1
G,1,1
H,0,0
I,0,1
J,0,1


In [121]:
pp_minhash(pair)

Unnamed: 0,A,B
K,1,1
G,1,1
N,1,0
J,0,1
I,0,1
D,1,1
C,1,1
M,1,1
A,0,1
E,0,0


In [122]:
def get_mh(s):
    p = permute(s)
    return p.index[p.apply(lambda x: x.tolist().index(1), axis=0)]

In [7]:
def jac_sim(pair):
    pair['A'] 
    return (pair['A'] & pair['B']).sum() / (pair['A'] | pair['B']).sum()

In [8]:
jac_sim(pair)

NameError: name 'pair' is not defined

In [188]:
from math import sqrt
def wilson(p, n, z = 1.96):
    denominator = 1 + z**2/n
    centre_adjusted_probability = p + z*z / (2*n)
    adjusted_standard_deviation = sqrt((p*(1 - p) + z*z / (4*n)) / n)
    
    return f"{centre_adjusted_probability:.2f}+-{2*adjusted_standard_deviation:.2f}"

In [191]:
eq = []
for i in range(1000):
    mh_equality = get_mh(pair)
    eq.append(mh_equality[0] == mh_equality[1])
display(wilson(np.sum(eq)/len(eq), len(eq)))


'0.57+-0.03'

# Step 5. Locality Sensitive Hashing

So, we've come up with a way to compare sets based on their *Minhash signature*. However, we still need to do the full $\frac{n*(n-1)}{2}$ comparisons.    
This is where Locality Sensitive Hashing comes into play. Suppose we have minhash signatures of length 100. Then, we can divide each signature into `b` bands of `r` rows, such that `b*r = 100`