Contents
========
- [Introduction]
- [Different Ways to Block Using Overlap Blocker]
   - [Blocking the Input Tables to Produce Candidate Set](#Blocking-the-Input-Tables-to-Produce-Candidate-Set)
       - [Handling Missing Values](#Handling-Missing-Values)
       - [Updating Stopwords]         
   - [Blocking a Candidate Set](#Blocking-a-Candidate-Set)
       - [Handling Missing Values]((#Handling-Missing-Values)
       - [Updating Stopwords]
   - [Blocking a Tuple Pair](#Blocking-a-Tuple-Pair)

# Introduction

This IPython notebook illustrates how to perform blocking using Overlap blocker.

In [1]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd



# Read Input Tables

In [2]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
path_B = datasets_dir + os.sep + 'person_table_B.csv'

In [3]:
# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')

In [4]:
A.head()

Unnamed: 0,ID,name,birth_year,hourly_wage,address,zipcode
0,a1,Kevin Smith,1989,30.0,"607 From St, San Francisco",94107
1,a2,Michael Franklin,1988,27.5,"1652 Stockton St, San Francisco",94122
2,a3,William Bridge,1986,32.0,"3131 Webster St, San Francisco",94107
3,a4,Binto George,1987,32.5,"423 Powell St, San Francisco",94122
4,a5,Alphonse Kemper,1984,35.0,"1702 Post Street, San Francisco",94122


# Ways To Do Overlap Blocking

There are three different ways to do overlap blocking:

1. Block two tables to produce a `candidate set` of tuple pairs.
2. Block a `candidate set` of tuple pairs to typically produce a reduced candidate set of tuple pairs.
3. Block two tuples to check if a tuple pair would get blocked.

## Block Tables to Produce a Candidate Set of Tuple Pairs

In [5]:
# Instantiate overlap blocker object
ob = em.OverlapBlocker()

For the given two tables, we will assume that two persons with no sufficient overlap between their addresses do not refer to the same real world person. So, we apply overlap blocking on `address`. Specifically, we tokenize the address by word and include the tuple pairs if the addresses have at least 3 overlapping tokens. That is, we block all the tuple pairs that do not share at least 3 tokens in `address`.

In [6]:
# Specify the tokenization to be 'word' level and set overlap_size to be 3.
C1 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, 
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'])

0%  100%
[######] | ETA: 00:00:00
Total time elapsed: 00:00:00


In [7]:
# Display first 5 tuple pairs in the candidate set.
C1.head()

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_name,ltable_birth_year,ltable_address,rtable_name,rtable_birth_year,rtable_address
0,0,a1,b1,Kevin Smith,1989,"607 From St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
1,1,a2,b1,Michael Franklin,1988,"1652 Stockton St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
2,2,a3,b1,William Bridge,1986,"3131 Webster St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
3,3,a4,b1,Binto George,1987,"423 Powell St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
4,4,a1,b2,Kevin Smith,1989,"607 From St, San Francisco",Bill Bridge,1986,"3131 Webster St, San Francisco"


### Using Q-Gram Tokenizer

In the above, we used word-level tokenizer. Overlap blocker also supports q-gram based tokenizer and it can be used as follows:

In [8]:
# Set the word_level to be False and set the value of q (using q_val)
C2 = ob.block_tables(A, B, 'address', 'address', word_level=False, q_val=3, overlap_size=3, 
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'])

0%  100%
[######] | ETA: 00:00:00
Total time elapsed: 00:00:00


In [9]:
# Display first 5 tuple pairs
C2.head()

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_name,ltable_birth_year,ltable_address,rtable_name,rtable_birth_year,rtable_address
0,0,a1,b1,Kevin Smith,1989,"607 From St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
1,1,a2,b1,Michael Franklin,1988,"1652 Stockton St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
2,2,a3,b1,William Bridge,1986,"3131 Webster St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
3,3,a4,b1,Binto George,1987,"423 Powell St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
4,4,a5,b1,Alphonse Kemper,1984,"1702 Post Street, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"


### Updating the Stop Words

Commands in the Overlap Blocker removes some stop words by default. You can avoid this by specifying `rem_stop_words` parameter to False

In [10]:
# Set the parameter to remove stop words to False
C3 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, rem_stop_words=False,
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'])

0%  100%
[######] | ETA: 00:00:00
Total time elapsed: 00:00:00


In [11]:
# Display first 5 tuple pairs
C3.head()

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_name,ltable_birth_year,ltable_address,rtable_name,rtable_birth_year,rtable_address
0,0,a1,b1,Kevin Smith,1989,"607 From St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
1,1,a2,b1,Michael Franklin,1988,"1652 Stockton St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
2,2,a3,b1,William Bridge,1986,"3131 Webster St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
3,3,a4,b1,Binto George,1987,"423 Powell St, San Francisco",Mark Levene,1987,"108 Clement St, San Francisco"
4,4,a1,b2,Kevin Smith,1989,"607 From St, San Francisco",Bill Bridge,1986,"3131 Webster St, San Francisco"


You can check what stop words are getting removed like this:

In [12]:
ob.stop_words

['a',
 'an',
 'and',
 'are',
 'as',
 'at',
 'be',
 'by',
 'for',
 'from',
 'has',
 'he',
 'in',
 'is',
 'it',
 'its',
 'on',
 'that',
 'the',
 'to',
 'was',
 'were',
 'will',
 'with']

You can update this stop word list (with some domain specific stop words) and do the blocking.

In [13]:
# Include Franciso as one of the stop words
ob.stop_words.append('francisco')

In [14]:
ob.stop_words

['a',
 'an',
 'and',
 'are',
 'as',
 'at',
 'be',
 'by',
 'for',
 'from',
 'has',
 'he',
 'in',
 'is',
 'it',
 'its',
 'on',
 'that',
 'the',
 'to',
 'was',
 'were',
 'will',
 'with',
 'francisco']

In [15]:
# Set the word level tokenizer to be True
C4 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, 
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'])

0%  100%
[######] | ETA: 00:00:00
Total time elapsed: 00:00:00


In [16]:
C4.head()

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_name,ltable_birth_year,ltable_address,rtable_name,rtable_birth_year,rtable_address
0,0,a3,b2,William Bridge,1986,"3131 Webster St, San Francisco",Bill Bridge,1986,"3131 Webster St, San Francisco"
1,1,a2,b3,Michael Franklin,1988,"1652 Stockton St, San Francisco",Mike Franklin,1988,"1652 Stockton St, San Francisco"


### Handling Missing Values

If the input tuples have missing values in the blocking attribute, then they are ignored by default. You can set `allow_missing_values` to be True to include all possible tuple pairs with missing values.

In [17]:
# Introduce some missing value
A.ix[0, 'address'] = pd.np.NaN

In [18]:
# Set the word level tokenizer to be True
C5 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, allow_missing=True,
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'])

0%  100%
[######] | ETA: 00:00:00
Total time elapsed: 00:00:00
0%  100%
[#] | ETA: 00:00:00

Finding pairs with missing value...



Total time elapsed: 00:00:00


In [19]:
len(C5)

8

In [20]:
C5

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_name,ltable_birth_year,ltable_address,rtable_name,rtable_birth_year,rtable_address
0,0,a3,b2,William Bridge,1986,"3131 Webster St, San Francisco",Bill Bridge,1986,"3131 Webster St, San Francisco"
1,1,a2,b3,Michael Franklin,1988,"1652 Stockton St, San Francisco",Mike Franklin,1988,"1652 Stockton St, San Francisco"
0,2,a1,b1,Kevin Smith,1989,,Mark Levene,1987,"108 Clement St, San Francisco"
1,3,a1,b2,Kevin Smith,1989,,Bill Bridge,1986,"3131 Webster St, San Francisco"
2,4,a1,b3,Kevin Smith,1989,,Mike Franklin,1988,"1652 Stockton St, San Francisco"
3,5,a1,b4,Kevin Smith,1989,,Joseph Kuan,1982,"108 South Park, San Francisco"
4,6,a1,b5,Kevin Smith,1989,,Alfons Kemper,1984,"170 Post St, Apt 4, San Francisco"
5,7,a1,b6,Kevin Smith,1989,,Michael Brodie,1987,"133 Clement Street, San Francisco"


## Block Tables to Produce a Candidate Set of Tuple Pairs