# Basic EM workflow 3 (Restaurants data set)

# Introduction

This IPython notebook explains a basic workflow two tables using *py_entitymatching*. Our goal is to come up with a workflow to match restaurants from Fodors and Zagat sites. Specifically, we want to maximize F1. The datasets contain information about the restaurants.

First, we need to import *py_entitymatching* package and other libraries as follows:

In [1]:
import sys
import py_entitymatching as em
import pandas as pd
import os

In [2]:
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )

python version: 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0]
pandas version: 0.22.0
magellan version: 0.3.0


# Read input tables

We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.

In [3]:
# Get the paths
path_A = 'data/amazon_products.csv'
path_B = 'data/walmart_products.csv'

In [4]:
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')

In [5]:
print('Number of tuples in A    : ' + str(len(A)))
print('Number of tuples in B    : ' + str(len(B)))
print('Number of tuples in A X B: ' + str(len(A)*len(B)))

Number of tuples in A    : 2976
Number of tuples in B    : 4847
Number of tuples in A X B: 14424672


In [6]:
A.head(3)

Unnamed: 0,id,product title,brand,model,operating system,combo
0,a0,acer predator helios 300 gaming laptop 15.6 full hd intel core i7-7700hq cpu 16gb ddr4 ram 256gb...,acer,nh.q28aa.001,windows 10,acer predator helios 300 gaming laptop 15.6 full hd intel core i7-7700hq cpu 16gb ddr4 ram 256gb...
1,a1,acer aspire e 15 15.6 full hd 8th gen intel core i5-8250u geforce mx150 8gb ram memory 256gb ssd...,acer,e5-576g-5762,windows 10,acer aspire e 15 15.6 full hd 8th gen intel core i5-8250u geforce mx150 8gb ram memory 256gb ssd...
2,a2,asus vivobook f510ua fhd laptop intel core i5-8250u 8gb ram 1tb hdd usb-c nanoedge display finge...,asus,f510ua-ah51,windows 10,asus vivobook f510ua fhd laptop intel core i5-8250u 8gb ram 1tb hdd usb-c nanoedge display finge...


In [7]:
B.head(3)

Unnamed: 0,id,product title,brand,model,operating system,combo
0,w0,iview i896qw 8.95 2-in-1 32gb tablet intel atom bay trail z3735f processor windows 10,iview,,microsoft windows,iview i896qw 8.95 2-in-1 32gb tablet intel atom bay trail z3735f processor windows 10 iview micr...
1,w1,dell - inspiron 2-in-1 11.6 touch-screen laptop - intel pentium - 4gb memory - 500gb hd - red,dell,i3168-3270red,windows 10,dell - inspiron 2-in-1 11.6 touch-screen laptop - intel pentium - 4gb memory - 500gb hd - red i3...
2,w2,iview maximus ii 11.6 laptop touchscreen 2-in-1 windows 10 intel bay trail z3735f processor 2gb ...,iview,max2-bk,windows 10,iview maximus ii 11.6 laptop touchscreen 2-in-1 windows 10 intel bay trail z3735f processor 2gb ...


In [8]:
# Display the keys of the input tables
em.get_key(A), em.get_key(B)

('id', 'id')

# Block tables to get candidate set

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.
*py_entitymatching* provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.

For the matching problem at hand, we know that two restaurants with different city names will not match. So we decide the apply blocking over names:

In [9]:
# Blocking plan

# A, B -- attribute equiv. blocker [model] --------------------|---> candidate set

In [10]:
# Create overlap blocker
ob = em.OverlapBlocker()

# Use block_tables to apply blocking over two input tables.
C1 = ob.block_tables(A,B, 'model', 'model', 
                    l_output_attrs=['id','model','combo'], 
                    r_output_attrs=['id','model','combo'],
                    overlap_size=8,
                    q_val=5,
                    word_level=False,
                    show_progress=False,
                    n_jobs=-1
                    )
len(C1)

2763

In [11]:
C1.head(3)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_model,ltable_combo,rtable_model,rtable_combo
0,0,a229,w1,i3168-3271blu,dell i3168-3271blu 11.6 hd 2-in-1 laptop (intel pentium n3710 1.6ghz processor 4 gb ddr3l sdram ...,i3168-3270red,dell - inspiron 2-in-1 11.6 touch-screen laptop - intel pentium - 4gb memory - 500gb hd - red i3...
1,1,a1922,w7,i5578-5902gry,dell i5578-5902gry inspiron pro 15.6 fhd with touch laptop computer (core i5-7200u 8gb ddr4 256g...,i5578-3093gry,dell - inspiron 15 5000 2-in-1 15.6 touch-screen laptop - intel core i3 - 4gb memory - 500gb 540...
2,2,a229,w12,i3168-3271blu,dell i3168-3271blu 11.6 hd 2-in-1 laptop (intel pentium n3710 1.6ghz processor 4 gb ddr3l sdram ...,i3168-3271blu,dell - inspiron 11 3000 2-in-1 blue 11.6-inch hd intel pentium processor n3710 4gb 1600mhz ddr3l...


## Debug blocker output

The number of tuple pairs considered for matching is reduced to 10165 (from 176423), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in *py_entitymatching* as follows:

In [12]:
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, output_size=10, 
                       attr_corres=[
                           ('product title','product title'),
                           ('brand', 'brand'), 
                           ('model','model'),
                           ('combo','combo')],
                      verbose=True)
#### Display first few tuple pairs from the debug_blocker's output
dbg.head(3)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_brand,ltable_model,ltable_combo,rtable_product title,rtable_brand,rtable_model,rtable_combo
0,0,a777,w634,top performance dell inspiron 15.6 touchscreen laptop 7th intel core i3-7100u 2.4ghz 8 gb ddr4 r...,dell,t8tjg,top performance dell inspiron 15.6 touchscreen laptop 7th intel core i3-7100u 2.4ghz 8 gb ddr4 r...,top performance dell inspiron 15.6 touchscreen laptop 7th intel core i3-7100u 2.4ghz 8 gb ddr4 r...,dell,,top performance dell inspiron 15.6 touchscreen laptop 7th intel core i3-7100u 2.4ghz 8 gb ddr4 r...
1,1,a1116,w434,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,michaelelectronics2,,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,michaelelectronics2,hp-15-02395-fhd-me28,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
2,2,a1341,w2999,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...,michaelelectronics2,,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...,michaelelectronics2,asu-14-07054-fhd-me4,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...


From the debug blocker's output we observe that the current blocker drops quite a few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.

For the considered dataset, we know that for the restaurants to match the  names must overlap between them. We could use overlap blocker for this purpose. Finally, we would want to union the outputs from the attribute equivalence blocker and the overlap blocker to get a consolidated candidate set.

In [13]:
# Updated blocking sequence
# A, B ------ overlap blocker [combo] --------> C1--
#                                                   |----> C
# A, B ------ overlap blocker [model] --------> C2--

In [158]:
# Create overlap blocker
ob = em.OverlapBlocker()

# Block tables using 'combo' attribute 
C2 = ob.block_tables(A, B, 'combo', 'combo', 
                    l_output_attrs=['id','model','combo'], 
                    r_output_attrs=['id','model','combo'],
                    overlap_size=30,
                    q_val=5,
                    word_level=False,
                    show_progress=True,
                    n_jobs=-1
                    )
len(C2)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


717854

In [159]:
# Debug again
dbg = em.debug_blocker(C2, A, B, output_size=10)
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_brand,ltable_model,ltable_operating system,ltable_combo,rtable_product title,rtable_brand,rtable_model,rtable_operating system,rtable_combo
0,0,a2873,w3359,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 carbon - 14 - core i7 5600u - 8 gb ram - 256 gb ssd,lenovo,20bs0038us,microsoft windows,lenovo thinkpad x1 carbon - 14 - core i7 5600u - 8 gb ram - 256 gb ssd 20bs0038us microsoft windows
1,1,a2873,w69,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad p40 yoga - 14 - core i7 6500u - 8 gb ram - 256 gb ssd,lenovo,20gq000bus,windows 10,lenovo thinkpad p40 yoga - 14 - core i7 6500u - 8 gb ram - 256 gb ssd 20gq000bus windows 10
2,2,a2873,w3068,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 carbon - 14 - core i7 6500u - 8 gb ram - 256 gb ssd,lenovo,20k4002rus,windows 7,lenovo thinkpad x1 carbon - 14 - core i7 6500u - 8 gb ram - 256 gb ssd 20k4002rus windows
3,3,a2873,w830,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad t570 - 15.6 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20h9000tus,windows 10 pro (english),lenovo thinkpad t570 - 15.6 - core i7 7600u - 8 gb ram - 256 gb ssd 20h9000tus windows 10 pro (e...
4,4,a2873,w493,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20jd000qus,windows 10 pro (english),lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 256 gb ssd 20jd000qus windows 10 pro (...
5,5,a2873,w2996,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad t460s - 14 - core i7 6600u - 8 gb ram - 256 gb ssd,lenovo,20f9003cus,,lenovo thinkpad t460s - 14 - core i7 6600u - 8 gb ram - 256 gb ssd 20f9003cus
6,6,a2873,w2517,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad t470 - 14 - core i7 6500u - 8 gb ram - 256 gb ssd,lenovo,20jm000bus,windows 10,lenovo thinkpad t470 - 14 - core i7 6500u - 8 gb ram - 256 gb ssd 20jm000bus windows 10
7,7,a2873,w3408,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 512 gb ssd,lenovo,20jd000vus,microsoft windows windows 10 pro (english),lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 512 gb ssd 20jd000vus microsoft window...
8,8,a763,w375,hp elitebook 840 g4 - 14 - core i5 7200u - 8 gb ram - 256 gb ssd - 1ge40ut#aba,hp,1ge40ut#aba,windows 10 pro,hp elitebook 840 g4 - 14 - core i5 7200u - 8 gb ram - 256 gb ssd - 1ge40ut#aba windows 10 pro,hp elitebook x360 1030 g2 - 13.3 - core i5 7200u - 8 gb ram - 128 gb ssd - us,hp,1nm36ut#aba,windows 10,hp elitebook x360 1030 g2 - 13.3 - core i5 7200u - 8 gb ram - 128 gb ssd - us 1nm36ut#aba windows
9,9,a2873,w3220,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 carbon - 14 - core i7 7500u - 8 gb ram - 256 gb ssd,lenovo,20hr0053us,windows 10 pro (english),lenovo thinkpad x1 carbon - 14 - core i7 7500u - 8 gb ram - 256 gb ssd 20hr0053us windows 10 pro...


In [132]:
def match_combo(attribute='product title',q_val=3, threshold=.5,debug=False):
    def jaccard_matcher(ltuple,rtuple, attribute=attribute, q_val=q_val, threshold=threshold,debug=debug):
        buffer = '#' * (q_val-1)
        l_attribute = buffer + ltuple[attribute] + buffer
        r_attribute = buffer + rtuple[attribute] + buffer
        l_grams = set()
        r_grams = set()
        # create sets of grams
        for attribute, grams in [(l_attribute,l_grams), (r_attribute,r_grams)]:
            for i in range(0,len(attribute)-(q_val-1)):
                grams.add(attribute[i:i+q_val])
                
        # compute jaccard
        intersection = list(set(l_grams) & set(r_grams))
        union = list(set(l_grams) | set(r_grams))
        if debug:
            print(union)
            print(intersection)
            print(len(intersection) / len(union))
        return len(intersection) / len(union) < threshold
        
    return jaccard_matcher


In [149]:
# match_combo(debug=True)(A.iloc[1044],B.iloc[915])

In [150]:
# match_combo(debug=True)(A.iloc[1047],B.iloc[1550])

In [160]:
bb = em.BlackBoxBlocker()
bb.set_black_box_function(match_combo())
C3 = bb.block_candset(C2,  
                    show_progress=True,
                    n_jobs=-1
                    )
len(C3)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:21


4308

In [161]:
C3.head(3)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_model,ltable_combo,rtable_model,rtable_combo
152,152,a2842,w1,7568,dell - inspiron 2-in-1 15.6 touch-screen laptop (intel core i5 8gb memory 500gb hard drive black...,i3168-3270red,dell - inspiron 2-in-1 11.6 touch-screen laptop - intel pentium - 4gb memory - 500gb hd - red i3...
1943,1943,a2842,w7,7568,dell - inspiron 2-in-1 15.6 touch-screen laptop (intel core i5 8gb memory 500gb hard drive black...,i5578-3093gry,dell - inspiron 15 5000 2-in-1 15.6 touch-screen laptop - intel core i3 - 4gb memory - 500gb 540...
5764,5764,a322,w32,flex 5-1470,lenovo flex 5 2-in-1 laptop: core i5-8250u 8gb ram 128gb ssd 14-inch full hd touch display windo...,81c9000cus,lenovo flex 5 2-in-1 laptop: core i5-8250u 8gb ram 128gb ssd 14-inch full hd touch display windo...


We add another blocker.

In [162]:
# Combine blocker outputs
C = em.combine_blocker_outputs_via_union([C1,C3])
len(C)

6905

In [163]:
C.head(3)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_model,ltable_combo,rtable_model,rtable_combo
0,0,a0,w1389,nh.q28aa.001,acer predator helios 300 gaming laptop 15.6 full hd intel core i7-7700hq cpu 16gb ddr4 ram 256gb...,nh.q1taa.001,refurbished acer 17.3 intel core i7 2.8 ghz 16 gb ram 1 tb hdd + 256 gb ssd windows 10 home nh.q...
1,1,a0,w1396,nh.q28aa.001,acer predator helios 300 gaming laptop 15.6 full hd intel core i7-7700hq cpu 16gb ddr4 ram 256gb...,nh.q1aaa.001,manufacturer refurbished acer predator g9-793-79pe 17.3 intel core i7-6700hq 2.6 ghz 16gb ram 1t...
2,2,a0,w1504,nh.q28aa.001,acer predator helios 300 gaming laptop 15.6 full hd intel core i7-7700hq cpu 16gb ddr4 ram 256gb...,nh.q1faa.001,refurbished acer 17.3 intel core i7 2.9 ghz 32 gb ram 1 tb hdd + 512 gb ssd windows 10 home nh.q...


We observe that the number of tuple pairs considered for matching is increased to 12530 (from 10165). Now let us debug the blocker output again to check if the current blocker sequence is dropping any potential matches.

In [164]:
# Debug again
dbg = em.debug_blocker(C, A, B, output_size=10)
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_brand,ltable_model,ltable_operating system,ltable_combo,rtable_product title,rtable_brand,rtable_model,rtable_operating system,rtable_combo
0,0,a2873,w3068,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 carbon - 14 - core i7 6500u - 8 gb ram - 256 gb ssd,lenovo,20k4002rus,windows 7,lenovo thinkpad x1 carbon - 14 - core i7 6500u - 8 gb ram - 256 gb ssd 20k4002rus windows
1,1,a2873,w69,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad p40 yoga - 14 - core i7 6500u - 8 gb ram - 256 gb ssd,lenovo,20gq000bus,windows 10,lenovo thinkpad p40 yoga - 14 - core i7 6500u - 8 gb ram - 256 gb ssd 20gq000bus windows 10
2,2,a1513,w3435,lenovo thinkpad x1 yoga 20jf000dus 14 touchscreen lcd 2560 x 1440 - ips - intel core i7 (7th gen...,,191200416057,unknown,lenovo thinkpad x1 yoga 20jf000dus 14 touchscreen lcd 2560 x 1440 - ips - intel core i7 (7th gen...,lenovo thinkpad x1 carbon 20hrs0fh00 14 lcd ultrabook - intel core i7 (7th gen) i7-7600u dual-co...,lenovo,20hrs0fh00,windows 10 pro (english),lenovo thinkpad x1 carbon 20hrs0fh00 14 lcd ultrabook - intel core i7 (7th gen) i7-7600u dual-co...
3,3,a2241,w2126,mini laptopgoodlife623 gpd pocket 7 touch screen aluminum shell umpc windows 10 system cpu x7-z8...,gpd,,windows 10,mini laptopgoodlife623 gpd pocket 7 touch screen aluminum shell umpc windows 10 system cpu x7-z8...,gpd pocket 8gb/128gb 7 inch aluminum shell mini laptop touch screen umpc windows 10 system,gpd,,windows 10,gpd pocket 8gb/128gb 7 inch aluminum shell mini laptop touch screen umpc windows 10 system
4,4,a2257,w2148,newest hp flagship high performance 15.6 inch hd touchscreen laptop pc intel core i3-7100u dual-...,hp,,windows 10,newest hp flagship high performance 15.6 inch hd touchscreen laptop pc intel core i3-7100u dual-...,hp 17.3 inch hd flagship high performance laptop pc intel core i7-7500u 2.7ghz dual-core 8gb ddr...,hp,,windows 10,hp 17.3 inch hd flagship high performance laptop pc intel core i7-7500u 2.7ghz dual-core 8gb ddr...
5,5,a2681,w1195,dell inspiron i5559-7081slv 15.6 inch touchscreen laptop (intel core i7 8 gb ram 1 tb hdd silver...,dell,i5559-7081slv,windows 10,dell inspiron i5559-7081slv 15.6 inch touchscreen laptop (intel core i7 8 gb ram 1 tb hdd silver...,refurbished dell inspiron i5559 15.6 inch laptop (intel core i7 8 gb ram 1 tb hdd silver matte),dell,,microsoft windows,refurbished dell inspiron i5559 15.6 inch laptop (intel core i7 8 gb ram 1 tb hdd silver matte) ...
6,6,a2873,w493,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20jd000qus,windows 10 pro (english),lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 256 gb ssd 20jd000qus windows 10 pro (...
7,7,a2873,w2996,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad t460s - 14 - core i7 6600u - 8 gb ram - 256 gb ssd,lenovo,20f9003cus,,lenovo thinkpad t460s - 14 - core i7 6600u - 8 gb ram - 256 gb ssd 20f9003cus
8,8,a2873,w2517,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad t470 - 14 - core i7 6500u - 8 gb ram - 256 gb ssd,lenovo,20jm000bus,windows 10,lenovo thinkpad t470 - 14 - core i7 6500u - 8 gb ram - 256 gb ssd 20jm000bus windows 10
9,9,a2873,w3408,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd,lenovo,20hd0057us,unknown,thinkpad t470 - 14 - core i7 7600u - 8 gb ram - 256 gb ssd lenovo 20hd0057us unknown,lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 512 gb ssd,lenovo,20jd000vus,microsoft windows windows 10 pro (english),lenovo thinkpad x1 yoga - 14 - core i7 7600u - 8 gb ram - 512 gb ssd 20jd000vus microsoft window...


In [None]:
# Debug again
dbg = em.debug_blocker(C, A, B, output_size=100)
dbg.head()

In [157]:
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_brand,ltable_model,ltable_operating system,ltable_combo,rtable_product title,rtable_brand,rtable_model,rtable_operating system,rtable_combo
0,0,a763,w375,hp elitebook 840 g4 - 14 - core i5 7200u - 8 gb ram - 256 gb ssd - 1ge40ut#aba,hp,1ge40ut#aba,windows 10 pro,hp elitebook 840 g4 - 14 - core i5 7200u - 8 gb ram - 256 gb ssd - 1ge40ut#aba windows 10 pro,hp elitebook x360 1030 g2 - 13.3 - core i5 7200u - 8 gb ram - 128 gb ssd - us,hp,1nm36ut#aba,windows 10,hp elitebook x360 1030 g2 - 13.3 - core i5 7200u - 8 gb ram - 128 gb ssd - us 1nm36ut#aba windows
1,1,a1364,w2749,asus zenbook 3 deluxe ux490ua-xh74-bl 14 ultrabook (intel 8th gen i7 quad core 16gb ram 512gb ss...,michaelelectronics2,,windows 10,asus zenbook 3 deluxe ux490ua-xh74-bl 14 ultrabook (intel 8th gen i7 quad core 16gb ram 512gb ss...,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...,michaelelectronics2,asu-14-07054-fhd-me1,microsoft windows,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...
2,2,a763,w188,hp elitebook 840 g4 - 14 - core i5 7200u - 8 gb ram - 256 gb ssd - 1ge40ut#aba,hp,1ge40ut#aba,windows 10 pro,hp elitebook 840 g4 - 14 - core i5 7200u - 8 gb ram - 256 gb ssd - 1ge40ut#aba windows 10 pro,hp elitebook x360 1030 g2 - 13.3 - core i5 7300u - 8 gb ram - 256 gb ssd - us,hp,1nm38ut#aba,windows 10,hp elitebook x360 1030 g2 - 13.3 - core i5 7300u - 8 gb ram - 256 gb ssd - us 1nm38ut#aba windows
3,3,a1729,w844,hp envy x360 15t convertible 2 in 1 laptop / tablet pc (intel 8th gen i7 quad core 20gb ram 1tb ...,michaelelectronics2,,windows 10,hp envy x360 15t convertible 2 in 1 laptop / tablet pc (intel 8th gen i7 quad core 20gb ram 1tb ...,hp envy x360 15t convertible 2 in 1 high performance laptop pc (intel 8th gen i7 quad core 16gb ...,michaelelectronics2,hp-15-02374-fhd-me1,microsoft windows,hp envy x360 15t convertible 2 in 1 high performance laptop pc (intel 8th gen i7 quad core 16gb ...
4,4,a1364,w2999,asus zenbook 3 deluxe ux490ua-xh74-bl 14 ultrabook (intel 8th gen i7 quad core 16gb ram 512gb ss...,michaelelectronics2,,windows 10,asus zenbook 3 deluxe ux490ua-xh74-bl 14 ultrabook (intel 8th gen i7 quad core 16gb ram 512gb ss...,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...,michaelelectronics2,asu-14-07054-fhd-me4,microsoft windows,asus zenbook ux430ua-dh74 thin and light ultrabook 14 fhd laptop pc (intel 8th gen i7 quad core ...
5,5,a1775,w1750,acer chromebook 14 cb3-431-c7vz - 14 - celeron n3160 - 4 gb ram - 32 gb ss silver,acer,nx.gc2aa.010;cb3-431-c7vz,chrome,acer chromebook 14 cb3-431-c7vz - 14 - celeron n3160 - 4 gb ram - 32 gb ss silver nx.gc2aa.010;c...,hp chromebook 14 g4 - 14 - celeron n2840 - 4 gb ram - 32 gb ssd,hp,t4m33ut#aba,chrome,hp chromebook 14 g4 - 14 - celeron n2840 - 4 gb ram - 32 gb ssd t4m33ut#aba
6,6,a1792,w194,hp pavilion x2 10.1-inch detachable 2 in 1 laptop (32gb) (includes office 365 personal for 1-year),hp,k3n12ua#aba,windows 8,hp pavilion x2 10.1-inch detachable 2 in 1 laptop (32gb) (includes office 365 personal for 1-yea...,refurbished hp pavilion x2 10.1-inch detachable 2 in 1 laptop (32gb),hp,,microsoft windows,refurbished hp pavilion x2 10.1-inch detachable 2 in 1 laptop (32gb) microsoft windows
7,7,a2145,w2241,dell 4c99r latitude 5480 laptop 14 hd intel core i5-7300u 8gb ddr4 256gb solid state drive windo...,dell,4c99r,windows 10 pro,dell 4c99r latitude 5480 laptop 14 hd intel core i5-7300u 8gb ddr4 256gb solid state drive windo...,refurbished dell e7450 14 laptop windows 10 pro intel core i5-5300u processor 8gb ram 256gb soli...,latitude,wa5-30768,windows 10,refurbished dell e7450 14 laptop windows 10 pro intel core i5-5300u processor 8gb ram 256gb soli...
8,8,a2706,w2241,dell v4jhf latitude 7480 laptop 14 fhd intel core i7-7600u 8gb ddr4 256gb solid state drive wind...,dell,v4jhf,windows 10 pro,dell v4jhf latitude 7480 laptop 14 fhd intel core i7-7600u 8gb ddr4 256gb solid state drive wind...,refurbished dell e7450 14 laptop windows 10 pro intel core i5-5300u processor 8gb ram 256gb soli...,latitude,wa5-30768,windows 10,refurbished dell e7450 14 laptop windows 10 pro intel core i5-5300u processor 8gb ram 256gb soli...
9,9,a2706,w2064,dell v4jhf latitude 7480 laptop 14 fhd intel core i7-7600u 8gb ddr4 256gb solid state drive wind...,dell,v4jhf,windows 10 pro,dell v4jhf latitude 7480 laptop 14 fhd intel core i7-7600u 8gb ddr4 256gb solid state drive wind...,refurbished dell e7440 14 laptop windows 10 pro intel core i7-4600u processor 8gb ram 256gb soli...,latitude,e7440,windows 10,refurbished dell e7440 14 laptop windows 10 pro intel core i7-4600u processor 8gb ram 256gb soli...


We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.


# Matching tuple pairs in the candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes.
This typically involves the following five steps:
1. Sampling and labeling the candidate set
2. Splitting the labeled data into development and evaluation set
3. Selecting the best learning based matcher using the development set
4. Evaluating the selected matcher using the evaluation set

## Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.

In [19]:
# Sample  candidate set
# S = em.sample_table(C, 500)
# em.to_csv_metadata(S, 'data/new.csv')
# S.head()

For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.

In [20]:
# # Load the pre-labeled data
path_G = 'data/labeled.csv'
G = em.read_csv_metadata(path_G, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)

300

## Splitting the labeled data into development and evaluation set

In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.

In [21]:
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(G, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']

In [22]:
G.head()


Unnamed: 0,_id,ltable_id,rtable_id,ltable_model,ltable_combo,rtable_model,rtable_combo,match
0,27,a1011,w685,t8tjg,2018 flagship lenovo thinkpad 11e (4h generation) 11.6 notebook intel i3-7100u 128gb m.2 ssd 4gb...,t8tjg,dell chromebook 11 3189 - 11.6 - celeron n3060 - 4 gb ram - 64 gb ssd dell t8tjg chrome os,0
1,63,a1044,w915,80xb000bus,lenovo flex 5 15.6 full hd touch multimode 2-in-1 notebook computer intel core i7-7500u 16gb ram...,80ys000bus,topseller n23 n3060 1.60g 4gb 32gb 11.6in hd display chrome lenovo 80ys000bus chrome os,0
2,99,a1047,w1550,n850hp6,prostar clevo gaming laptop n850hp6 15.6” fhd ips (1920x1080) led matte type display intel core ...,n850hk1,prostar clevo gaming laptop n850hk1 15.6” fhd ips (1920x1080) matte type display intel core i7-7...,1
3,100,a1047,w1553,n850hp6,prostar clevo gaming laptop n850hp6 15.6” fhd ips (1920x1080) led matte type display intel core ...,n850hp6,prostar clevo gaming laptop n850hp6 15.6” fhd ips (1920x1080) led matte type display intel core ...,1
4,134,a1047,w1623,n850hp6,prostar clevo gaming laptop n850hp6 15.6” fhd ips (1920x1080) led matte type display intel core ...,p650hp6-g,prostar clevo p650hp6-g vr ready gaming 15.6 fhd/ips/matte display with g-sync intel core i7-770...,1


## Selecting the best learning-based matcher 

Selecting the best learning-based matcher typically involves the following steps:

1. Creating a set of learning-based matchers
2. Creating features
3. Converting the development set into feature vectors
4. Selecting the best learning-based matcher using k-fold cross validation

### Creating a set of learning-based matchers

In [23]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

### Creating features

Next, we need to create a set of features for the development set. *py_entitymatching* provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.

In [24]:
match_t = em.get_tokenizers_for_matching([2,3,4,5,6,7,8,9])
match_s = em.get_sim_funs_for_matching()
atypes1 = em.get_attr_types(A)
atypes2 = em.get_attr_types(B)
match_c = em.get_attr_corres(A, B)
feature_table = em.get_features(A, B, atypes1, atypes2, match_c, match_t, match_s)

This function generates a function that returns true if both or neither tuples' attribute contain any of the passed in values and false otherwise.

In [25]:
def generateContainsValueFeature(values, name=None, attribute='combo'):
    if type(values) is str:
        values = [values]
    def containsValueFeature(a,b):
        return int(any([value.lower() in a[attribute].lower() for value in values]) 
                   == any([value.lower() in b[attribute].lower() for value in values]))
    return containsValueFeature, name if name else values[0]

Use this to generate many new features.

In [26]:
brands = ['toshiba','hp','dell','lenovo','prostar','acer','samsung','apple','asus','panasonic','msi']
models = ['zenbook','flex','ativ','predator','x360','stealth','omen','xps','carbon','x1','yoga','envy','thinkpad','latitude','inspiron','elitebook','clevo','spectre','macbook','pavilion','ideapad','legion']
thinkpads = [['p50','p51','p71'],['460','470','560','570']]
asus = ['swift','aspire','spin']
sizes = ['13','15','17']
operating_systems = ['chrome','windows','mac']
cpus = ['i3','i5','i7','celeron','pentium']
miscellaneous = ['2-in-1','gtx','touch']
keywords = brands + models + thinkpads + sizes + operating_systems + cpus + miscellaneous
new_features = [generateContainsValueFeature(value) for value in keywords]

for feature in new_features:
    em.add_blackbox_feature(feature_table, feature[1], feature[0])

In [27]:
# List the names of the features generated
feature_table['feature_name']

0                                            id_id_lev_dist
1                                             id_id_lev_sim
2                                                 id_id_jar
3                                                 id_id_jwn
4                                                 id_id_exm
5                                     id_id_jac_qgm_3_qgm_3
6               product_title_product_title_jac_qgm_3_qgm_3
7           product_title_product_title_cos_dlm_dc0_dlm_dc0
8                               brand_brand_jac_qgm_3_qgm_3
9                           brand_brand_cos_dlm_dc0_dlm_dc0
10                          brand_brand_jac_dlm_dc0_dlm_dc0
11                                          brand_brand_mel
12                                     brand_brand_lev_dist
13                                      brand_brand_lev_sim
14                                          brand_brand_nmw
15                                           brand_brand_sw
16                              model_mo

### Converting the development set to feature vectors

In [28]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)
H.fillna(0, inplace=True)
H.head()

Unnamed: 0,_id,ltable_id,rtable_id,id_id_lev_dist,id_id_lev_sim,id_id_jar,id_id_jwn,id_id_exm,id_id_jac_qgm_3_qgm_3,product_title_product_title_jac_qgm_3_qgm_3,...,mac,i3,i5,i7,celeron,pentium,2-in-1,gtx,touch,match
282,7036,a888,w3248,4,0.2,0.483333,0.483333,0,0.083333,0.504237,...,1,1,1,1,1,1,1,1,1,0
44,1158,a1347,w3193,5,0.0,0.433333,0.433333,0,0.0,0.607656,...,1,1,1,1,1,1,1,1,0,1
272,6861,a816,w3647,5,0.0,0.483333,0.483333,0,0.0,0.266667,...,1,1,0,1,1,1,1,1,1,1
189,4762,a2426,w3223,4,0.2,0.6,0.6,0,0.0,0.515152,...,1,1,0,0,1,1,1,1,1,1
152,3872,a2104,w3121,4,0.2,0.466667,0.466667,0,0.0,0.835106,...,1,1,0,0,1,1,1,1,1,1


### Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use the 'precision' metric to select the best matcher.

In [29]:
# Select the best ML matcher using CV
result = em.select_matcher(
        matchers=[dt, rf, svm, ln, lg, nb], 
        table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', 
        metric_to_select_matcher='precision', 
        random_state=0)
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.939552,0.952699,0.945567
1,RF,0.920646,0.941168,0.928877
2,SVM,0.792807,0.936075,0.855203
3,LinReg,0.936647,0.925582,0.929576
4,LogReg,0.928572,0.978449,0.952272
5,NaiveBayes,0.891493,0.971497,0.929445


### Debugging matcher

We observe that the best matcher is not maximizing F1. We debug the matcher to see what might be wrong.
To do this, first we split the feature vectors into train and test.

In [30]:
#  Split feature vectors into train and test
UV = em.split_train_test(H, train_proportion=0.5)
U = UV['train']
V = UV['test']

Next, we debug the matcher using GUI. For the purposes of this guide, we use random forest matcher for debugging purposes.

In [31]:
# Debug rf using GUI
# em.vis_debug_rf(rf, U, V, 
#         exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
#         target_attr='match')

##  Evaluating the matching output

From the GUI, we observe that phone numbers seem to be an important attribute, but they are in different format. Current features does not capture and adding a feature incorporating this difference in format can potentially improve 
the F1 numbers.

Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.

In [32]:
H = em.extract_feature_vecs(I, feature_table=feature_table, attrs_after='match', show_progress=False)
H.fillna(0, inplace=True)

In [33]:
# Select the best ML matcher using CV
result = em.select_matcher(
        [dt, rf, svm, ln, lg, nb], 
        table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', 
        metric_to_select_matcher='precision', 
        random_state=0)
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.939552,0.952699,0.945567
1,RF,0.920646,0.941168,0.928877
2,SVM,0.792807,0.936075,0.855203
3,LinReg,0.936647,0.925582,0.929576
4,LogReg,0.928572,0.978449,0.952272
5,NaiveBayes,0.891493,0.971497,0.929445


Evaluating the matching outputs for the evaluation set typically involves the following four steps:
1. Converting the evaluation set to feature vectors
2. Training matcher using the feature vectors extracted from the development set
3. Predicting the evaluation set using the trained matcher
4. Evaluating the predicted matches

### Converting the evaluation set to  feature vectors

As before, we convert to the feature vectors (using the feature table and the evaluation set)

In [34]:
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(
        J, 
        feature_table=feature_table,
        attrs_after='match', 
        show_progress=False)
L.fillna(0, inplace=True)
L.head()

Unnamed: 0,_id,ltable_id,rtable_id,id_id_lev_dist,id_id_lev_sim,id_id_jar,id_id_jwn,id_id_exm,id_id_jac_qgm_3_qgm_3,product_title_product_title_jac_qgm_3_qgm_3,...,mac,i3,i5,i7,celeron,pentium,2-in-1,gtx,touch,match
208,5189,a2500,w2256,4,0.2,0.6,0.6,0,0.0,0.010753,...,1,1,0,1,1,1,1,1,1,0
188,4747,a2426,w3192,5,0.0,0.466667,0.466667,0,0.0,0.442688,...,1,1,0,0,1,1,1,1,1,0
12,437,a1138,w3417,5,0.0,0.466667,0.466667,0,0.0,0.583333,...,1,1,1,1,1,1,1,1,0,1
221,5444,a2671,w2981,3,0.4,0.6,0.6,0,0.076923,0.363248,...,1,1,1,1,1,1,1,1,1,0
239,5908,a2944,w1770,5,0.0,0.0,0.0,0,0.0,0.0,...,1,1,0,1,0,1,1,1,1,0


### Training the selected matcher

Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.

In [35]:
# Train using feature vectors from I 
rf.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
       target_attr='match')

### Predicting the matches

Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).

In [36]:
# Predict on L 
predictions = rf.predict(
        table=L, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
        append=True, 
        target_attr='predicted', 
        inplace=False)

### Evaluating the predictions

Finally, we evaluate the accuracy of predicted outputs

In [37]:
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 96.36% (53/55)
Recall : 92.98% (53/57)
F1 : 94.64%
False positives : 2 (out of 55 positive predictions)
False negatives : 4 (out of 35 negative predictions)
