# PyMinHash example notebook

This example notebook shows how to use PyMinHash to find matches of strings.

First, import Pandas and fix some settings.

In [1]:
%config Completer.use_jedi = False

import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_row', 500)
pd.set_option('display.max_colwidth', 200)

PyMinHash comes with a toy dataset containing various name and address combinations of Stoxx50 companies.

In [2]:
from pyminhash.datasets import load_data
df = load_data()
df.head()

Unnamed: 0,name
0,adidas ag adi dassler strasse 1 91074 germany
1,adidas ag adi dassler strasse 1 91074 herzogenaurach
2,adidas ag adi dassler strasse 1 91074 herzogenaurach germany
3,airbus se 2333 cs leiden netherlands
4,airbus se 2333 cs netherlands


We're going to match various representations that belong to the same company. For this, we import create a `MinHash` object and tell it to use 10 hash tables. More hash tables means more accurate Jaccard similarity calculation but also requires more time and memory.

In [3]:
from pyminhash.pyminhash import MinHash
myHasher = MinHash(n_hash_tables=10)

The `fit_predict` method needs the dataframe and the name of the column to which minhashing should be applied. The result is a dataframe containing all pairs that have a non-zero Jaccard similarity:

In [4]:
result = myHasher.fit_predict(df, 'name')
result.head()

Unnamed: 0,row_number_1,row_number_2,name_1,name_2,jaccard_sim
33,31,33,bayerische motoren werke aktiengesellschaft munich germany,bayerische motoren werke aktiengesellschaft petuelring 130 munich germany,1.0
611,13,14,anheuser busch inbev sa nv brouwerijplein 1 3000 leuven belgium,anheuser busch inbev sa nv brouwerijplein 1 leuven belgium,1.0
1330,77,78,koninklijke ahold delhaize n v provincialeweg 11,koninklijke ahold delhaize n v provincialeweg 11 netherlands,1.0
93,49,50,deutsche post ag platz der deutschen post 53113 germany,deutsche post ag platz der deutschen post bonn germany,1.0
750,42,43,danone s a 15 rue du helder 75439 paris france,danone s a 15 rue du helder paris france,1.0


As one can see below, for a Jaccard similarity of 1.0, all words in the shortest string appear in the longest string. For lower Jaccard similarity values, the match is less than perfect. Note that Jaccard similarity has granularity of `1/n_hash_tables`, in this example 0.1. For more accurate Jaccard similarity use a larger value of `n_hash_tables`.

In [5]:
result.groupby('jaccard_sim').head(2)

Unnamed: 0,row_number_1,row_number_2,name_1,name_2,jaccard_sim
33,31,33,bayerische motoren werke aktiengesellschaft munich germany,bayerische motoren werke aktiengesellschaft petuelring 130 munich germany,1.0
611,13,14,anheuser busch inbev sa nv brouwerijplein 1 3000 leuven belgium,anheuser busch inbev sa nv brouwerijplein 1 leuven belgium,1.0
546,126,127,vinci sa 1 cours ferdinand de lesseps 92851 france,vinci sa 1 cours ferdinand de lesseps france,0.9
359,10,11,amadeus it group s a salvador de madariaga 1 28027 madrid,amadeus it group s a salvador de madariaga 1 28027 madrid spain,0.9
305,79,81,koninklijke philips n v amstelplein 2 1096 bc,koninklijke philips n v amstelplein 2 1096 bc netherlands,0.8
356,31,32,bayerische motoren werke aktiengesellschaft munich germany,bayerische motoren werke aktiengesellschaft petuelring 130 80788 munich,0.8
567,75,129,kering sa 40 rue de sevres 75007 paris france,vivendi sa 42 avenue de friedland 75380 paris france,0.7
292,7,8,allianz se koniginstrasse 28 munich germany,allianz se munich germany,0.7
570,84,129,l air liquide s a paris france,vivendi sa 42 avenue de friedland 75380 paris france,0.6
593,126,129,vinci sa 1 cours ferdinand de lesseps 92851 france,vivendi sa 42 avenue de friedland 75380 paris france,0.6
