# Tutorial

This tutorial shows how to use PyMinHash to find matches of strings.

First, import Pandas and fix some settings.

In [1]:
%config Completer.use_jedi = False

import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_row', 500)
pd.set_option('display.max_colwidth', 200)

PyMinHash comes with a toy dataset containing various name and address combinations of Stoxx50 companies.

In [2]:
from pyminhash.datasets import load_data
df = load_data()
df.head()

Unnamed: 0,name
0,adidas ag adi dassler strasse 1 91074 germany
1,adidas ag adi dassler strasse 1 91074 herzogenaurach
2,adidas ag adi dassler strasse 1 91074 herzogenaurach germany
3,airbus se 2333 cs leiden netherlands
4,airbus se 2333 cs netherlands


We're going to match various representations that belong to the same company. For this, we import create a `MinHash` object and tell it to use 10 hash tables. More hash tables means more accurate Jaccard similarity calculation but also requires more time and memory.

In [3]:
from pyminhash.pyminhash import MinHash
myHasher = MinHash(n_hash_tables=10)

The `fit_predict` method needs the dataframe and the name of the column to which minhashing should be applied. The result is a dataframe containing all pairs that have a non-zero Jaccard similarity:

In [4]:
result = myHasher.fit_predict(df, 'name')
result.head()

Unnamed: 0,row_number_1,row_number_2,name_1,name_2,jaccard_sim
406,67,68,industria de diseno textil s a avenida de la diputacion s n arteixo 15143 a coruna spain,industria de diseno textil s a avenida de la diputacion s n arteixo a coruna spain,1.0
1343,77,78,koninklijke ahold delhaize n v provincialeweg 11,koninklijke ahold delhaize n v provincialeweg 11 netherlands,1.0
928,60,61,essilorluxottica 1 6 rue paul cezanne 75008 paris france,essilorluxottica 1 6 rue paul cezanne paris france,1.0
595,126,127,vinci sa 1 cours ferdinand de lesseps 92851 france,vinci sa 1 cours ferdinand de lesseps france,0.9
322,31,33,bayerische motoren werke aktiengesellschaft munich germany,bayerische motoren werke aktiengesellschaft petuelring 130 munich germany,0.9


As one can see below, for a Jaccard similarity of 1.0, all words in the shortest string appear in the longest string. For lower Jaccard similarity values, the match is less than perfect. Note that Jaccard similarity has granularity of `1/n_hash_tables`, in this example 0.1. For more accurate Jaccard similarity use a larger value of `n_hash_tables`.

In [5]:
result.groupby('jaccard_sim').head(2)

Unnamed: 0,row_number_1,row_number_2,name_1,name_2,jaccard_sim
406,67,68,industria de diseno textil s a avenida de la diputacion s n arteixo 15143 a coruna spain,industria de diseno textil s a avenida de la diputacion s n arteixo a coruna spain,1.0
1343,77,78,koninklijke ahold delhaize n v provincialeweg 11,koninklijke ahold delhaize n v provincialeweg 11 netherlands,1.0
595,126,127,vinci sa 1 cours ferdinand de lesseps 92851 france,vinci sa 1 cours ferdinand de lesseps france,0.9
322,31,33,bayerische motoren werke aktiengesellschaft munich germany,bayerische motoren werke aktiengesellschaft petuelring 130 munich germany,0.9
146,62,64,fresenius se co kgaa else kroner strasse 1 61352 bad homburg vor der hohe germany,fresenius se co kgaa else kroner strasse 1 bad homburg vor der hohe germany,0.8
709,24,25,banco santander s a 28660 madrid,banco santander s a 28660 madrid spain,0.8
724,12,14,anheuser busch inbev sa nv 3000 leuven belgium,anheuser busch inbev sa nv brouwerijplein 1 leuven belgium,0.7
777,19,101,axa sa 75008 paris france,safran sa 2 boulevard du general martial valin paris france,0.7
852,98,103,orange s a 75015 paris france,safran sa paris france,0.6
861,18,130,axa sa 75008 paris,vivendi sa 75380 paris,0.6
