# Entity matching (EM) for Laptops

This jupyter notebook contains the commands used for each step in the entity matching process for laptop products from Amazon and Walmart. 
We used the Basic EM workflow 3 as our guide for this process.

# Step 1: Reading in the input tables A, B.

In [1]:
import sys
import py_entitymatching as em
import pandas as pd
import os

path_A = 'data/amazon_products.csv'
path_B = 'data/walmart_products.csv'

# Load the csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
print('len(A):' + str(len(A)))
print('len(B):' + str(len(B)))
print('len (A X B):' + str(len(A)*len(B)))

len(A):3000
len(B):4847
len (A X B):14541000


# Step 2: Block tables to get candidate set

In this step, we apply a blocking sequence on the input tables to get the candidate set C.

In [2]:
# Create an initial blocker
ob = em.OverlapBlocker()

# Use block_tables to apply blocking over two input tables.
C1 = ob.block_tables(A,B, 'model', 'model', 
                     l_output_attrs=['id','model','extended_title'], 
                     r_output_attrs=['id','model','extended_title'],
                     overlap_size=8,
                     q_val=5,
                     word_level=False,
                     show_progress=False,
                     n_jobs=-1
                     )
len(C1)

2597

## Debugging blocker output

In [3]:
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, 
                        attr_corres=[
                            ('product_title','product_title'),
                            ('brand', 'brand'), 
                            ('model','model'),
                            ('extended_title','extended_title')],
                            verbose=True)
### Display first few tuple pairs from the debug_blocker's output
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product_title,ltable_brand,ltable_model,ltable_extended_title,rtable_product_title,rtable_brand,rtable_model,rtable_extended_title
0,0,a888,w3286,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 wqhd ips 2560x1440 in...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 wqhd ips 2560x1440 in...,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 fhd ips 1920x108...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 fhd ips 1920x108...
1,1,a389,w3226,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560x1...,lenovo,,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560x1...
2,2,a1997,w3302,lenovo thinkpad t460s windows 10 intel core i7 6600u 8gb 256gb ssd 14 ips fhd 1920x1080 ac wifi,lenovo,,lenovo thinkpad t460s windows 10 intel core i7 6600u 8gb 256gb ssd 14 ips fhd 1920x1080,lenovo thinkpad t460s windows 10 intel core i7 6600u 12gb 1tb ssd 14 ips fhd 1920x1080 ac wifi,lenovo,,lenovo thinkpad t460s windows 10 intel core i7 6600u 12gb 1tb ssd 14 ips fhd 1920x1080
3,3,a1356,w3150,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440...
4,4,a888,w3438,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 wqhd ips 2560x1440 in...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 wqhd ips 2560x1440 in...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 2tb nvme ssd 16gb 14 wqhd ips 2560x1...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 2tb nvme ssd 16gb 14 wqhd ips 2560x1...
5,5,a1356,w3180,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb ssd 16gb 14 wqhd ips 2560x1440...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb ssd 16gb 14 wqhd ips 2560x1440...
6,6,a1356,w3117,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440 i...,lenovo,,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440 i...
7,7,a888,w3221,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 wqhd ips 2560x1440 in...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 wqhd ips 2560x1440 in...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560...
8,8,a389,w3337,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x144...,lenovo,,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x144...
9,9,a389,w3221,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560...


Looking at the debug blocker's output, we observe that the initial blocker is dropping a lot of potential matches.
Blocking on the column brand alone seems incorrect.

In [4]:
# # Create overlap blocker
ob = em.OverlapBlocker()

# # Block tables using 'extended_title' attribute 
C2 = ob.block_tables(A, B, 'extended_title', 'extended_title', 
                     l_output_attrs=['id','model','extended_title'], 
                     r_output_attrs=['id','model','extended_title'],
                     overlap_size=25,
                     q_val=5,
                     word_level=False,
                     show_progress=False,
                     n_jobs=-1
                    )
# Updated blocking sequence
# A, B ------ overlap blocker [extended_title] ---------> C1--|
#                                                             |----> C
# A, B ------ overlap blocker [model] ------------------> C2--|

C = em.combine_blocker_outputs_via_union([C1,C2])
print ('Reduction on AxB by applying the blocker:' + str(int((len(A)*len(B))/len(C))) + 'x')

Reduction on AxB by applying the blocker:42x


In [5]:
# Debug blocker output
dbg = em.debug_blocker(C, A, B, 
                        attr_corres=[
                            ('product_title','product_title'),
                            ('brand', 'brand'), 
                            ('model','model'),
                            ('extended_title','extended_title')],
                            verbose=True)
### Display first few tuple pairs from the debug_blocker's output
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product_title,ltable_brand,ltable_model,ltable_extended_title,rtable_product_title,rtable_brand,rtable_model,rtable_extended_title
0,0,a1166,w4520,acer intel core i5 1.60 ghz 8 gb 256 gb ssd windows 10,acer,nx.gr7aa.007,acer intel core i5 1.60 ghz 8 gb 256 gb ssd windows 10 nx.gr7aa.007,refurbished acer amd fx series 3 ghz 8 gb 256 gb ssd windows 10 home,acer,nh.q2uaa.003,refurbished acer amd fx series 3 ghz 8 gb 256 gb ssd windows 10 nh.q2uaa.003
1,1,a2024,w2957,lenovo thinkpad t570 20h9000nus core i7 7600u 2.8 ghz win 10 8 gb 256 gb ssd 15.6 ips fhd,lenovo,20h9000nus,lenovo thinkpad t570 20h9000nus core i7 7600u 2.8 ghz win 10 8 gb 256 gb ssd 15.6 ips fhd unknown,lenovo thinkpad p51s 15.6 core i7 7500u 8 gb 256 gb ssd,lenovo,20hb000aus,lenovo thinkpad p51s 15.6 core i7 7500u 8 gb 256 gb ssd 20hb000aus windows 10
2,2,a2209,w644,hp 3fb86ut zbook x2 g4 tablet core i7 7600u 2.8 ghz win 10 16 gb 512 gb ssd,,3fb86ut,hp 3fb86ut zbook x2 g4 tablet core i7 7600u 2.8 ghz win 10 16 gb 512 gb ssd unknown,hp zbook 15u g4 15.6 core i7 7500u 16 gb 512 gb ssd,hp,1bs35ut#aba,hp zbook 15u g4 15.6 core i7 7500u 16 gb 512 gb ssd 1bs35ut#aba windows 10
3,3,a1731,w809,newest hp 15.6 hd intel pentium 8gb 1tb hdd writerhd windows 10,hp laptop,,newest hp 15.6 hd intel pentium 8gb 1tb hdd writerhd windows 10 laptop,2017 hp touchscreen 15.6 hd intel i3 7100u dual core 8gb 1tb hdd hdmi windows 10,hp,2017,2017 hp touchscreen 15.6 hd intel i3 7100u dual core 8gb 1tb hdd hdmi windows 10
4,4,a410,w2530,2017 hp pc 15.6 hd intel pentium quad core 8gb 500gb hdd hdmi windows 10,hp,t8tjg,2017 hp pc 15.6 hd intel pentium quad core 8gb 500gb hdd hdmi windows 10 t8tjg,2017 hp 17.3 hd wled backlight pc intel core i5 8gb ddr4 1tb hdd usb 3.1 hdmi rw windows 10,hp,,2017 hp 17.3 hd wled backlight pc intel core i5 8gb ddr4 1tb hdd usb 3.1 hdmi rw windows 10
5,5,a2544,w32,2017 lenovo flex 4 2 1 14 full hd touchscreen pc intel core i7 7500u 8gb 512gb ssd windows 10,lenovo,,2017 lenovo flex 4 2 1 14 full hd touchscreen pc intel core i7 7500u 8gb 512gb ssd windows 10,lenovo flex 5 2 1 core i5 8250u 8gb 128gb ssd 14 full hd touch windows 10,lenovo,81c9000cus,lenovo flex 5 2 1 core i5 8250u 8gb 128gb ssd 14 full hd touch windows 10 81c9000cus microsoft
6,6,a2144,w2300,acer aspire v nitro vn7 15.6 fhd touchscreen intel core i7 6500u 2.50 ghz nvidia geforce gtx 950...,acer,,acer aspire v nitro vn7 15.6 fhd touchscreen intel core i7 6500u 2.50 ghz nvidia geforce gtx 950...,acer 15.6 intel core i7 2.6ghz 16 gb 1 tb hdd 256 gb ssd windows 10 g9 593 72vt,acer,nh.q1caa.001,acer 15.6 intel core i7 2.6ghz 16 gb 1 tb hdd 256 gb ssd windows 10 g9 593 72vt nh.q1caa.001
7,7,a2144,w2632,acer aspire v nitro vn7 15.6 fhd touchscreen intel core i7 6500u 2.50 ghz nvidia geforce gtx 950...,acer,,acer aspire v nitro vn7 15.6 fhd touchscreen intel core i7 6500u 2.50 ghz nvidia geforce gtx 950...,refurbished acer 17.3 intel core i7 2.8 ghz 16 gb 1 tb hdd 256 gb ssd windows 10 home,acer,nh.q26aa.002,refurbished acer 17.3 intel core i7 2.8 ghz 16 gb 1 tb hdd 256 gb ssd windows 10 nh.q26aa.002
8,8,a2144,w1389,acer aspire v nitro vn7 15.6 fhd touchscreen intel core i7 6500u 2.50 ghz nvidia geforce gtx 950...,acer,,acer aspire v nitro vn7 15.6 fhd touchscreen intel core i7 6500u 2.50 ghz nvidia geforce gtx 950...,refurbished acer 17.3 intel core i7 2.8 ghz 16 gb 1 tb hdd 256 gb ssd windows 10 home,acer,nh.q1taa.001,refurbished acer 17.3 intel core i7 2.8 ghz 16 gb 1 tb hdd 256 gb ssd windows 10 nh.q1taa.001
9,9,a2841,w22,2018 hp envy x360 2 1 convertible 15.6 full hd touchscreen pc intel core i7 8550u quad core 12gb...,hp,,2018 hp envy x360 2 1 convertible 15.6 full hd touchscreen pc intel core i7 8550u quad core 12gb...,hp envy x360 15 aq165nr 15.6 touchscreen 2 1 windows 10 intel core i7 7500u dual core 8gb 1tb drive,hp,w2k50ua#aba,hp envy x360 15 aq165nr 15.6 touchscreen 2 1 windows 10 intel core i7 7500u dual core 8gb 1tb w2...


The blocker seem to have reduced the input size by 45x. However, on debugging the blocker we observe that it is still dropping potential matches. So, we are going to debug the blocker further.

In [6]:
def match_extended_title(attribute='extended_title',q_val=3, threshold=.5,debug=False):
    def jaccard_matcher(ltuple,rtuple, attribute=attribute, q_val=q_val, threshold=threshold,debug=debug):
        buffer = '#' * (q_val-1)
        l_attribute = buffer + ltuple[attribute] + buffer
        r_attribute = buffer + rtuple[attribute] + buffer
        l_grams = set()
        r_grams = set()
        # create sets of grams
        for attribute, grams in [(l_attribute,l_grams), (r_attribute,r_grams)]:
            for i in range(0,len(attribute)-(q_val-1)):
                grams.add(attribute[i:i+q_val])
               
        # compute jaccard
        intersection = list(set(l_grams) & set(r_grams))
        union = list(set(l_grams) | set(r_grams))
        if debug:
            print(union)
            print(intersection)
            print(len(intersection) / len(union))
        return len(intersection) / len(union) < threshold
    
    return jaccard_matcher

bb = em.BlackBoxBlocker()
bb.set_black_box_function(match_extended_title())
C3 = bb.block_candset(C2,  
                     show_progress=False,
                     n_jobs=-1
                     )
# Updated blocking sequence
# A, B --- overlap blocker [model] ---------------------> C1--------------------------------|
#                                                                                     union |---> C
# A, B --- overlap blocker [extended_title] ---> C2---> jaccard blocker [extended_title] ---|
C = em.combine_blocker_outputs_via_union([C1,C3])

In [7]:
# Debug again
dbg = em.debug_blocker(C, A, B)
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product_title,ltable_brand,ltable_model,ltable_operating_system,ltable_extended_title,rtable_product_title,rtable_brand,rtable_model,rtable_operating_system,rtable_extended_title
0,0,a298,w3394,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 1tb nvme ssd 14 ips wqhd 2560x1440 ...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 1tb nvme ssd 14 ips wqhd 2560x1440
1,1,a298,w3458,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 2tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 2tb nvme ssd 14 ips wqhd 2560x1440
2,2,a298,w3439,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 2tb nvme ssd 14 ips wqhd 2560x1440 ...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 2tb nvme ssd 14 ips wqhd 2560x1440
3,3,a298,w3327,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 256gb nvme ssd 14 ips wqhd 2560x14...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 256gb nvme ssd 14 ips wqhd 2560x1440
4,4,a298,w3400,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 1tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 1tb nvme ssd 14 ips wqhd 2560x1440
5,5,a298,w3422,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 1tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 1tb nvme ssd 14 ips wqhd 2560x1440
6,6,a298,w3443,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 2tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 2tb nvme ssd 14 ips wqhd 2560x1440
7,7,a298,w3250,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 256gb nvme ssd 14 ips wqhd 2560x14...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 256gb nvme ssd 14 ips wqhd 2560x1440
8,8,a298,w3015,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...,lenovo,2017,windows 7,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...
9,9,a298,w3014,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...,lenovo,2017,windows 7,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...


Looking at the dedug output, we observe that the current blocking sequence does not seem to drop a lot of potential matches and has also reduced the candidate set to a good extent. Thus, we stop the blocking step and proceed with the matching step.

# Step 3: Reading in the labeled sample

We randomly sample 450 tuple pairs for labeling purposes and write the sample data to a csv. We then manually labeled the csv and use the labeled csv from then on.

In [8]:
# # Sample  candidate set
S = em.sample_table(C, 450)
## em.to_csv_metadata(S, 'data/labeled.csv')

In [9]:
# # Loading the saved labeled data
path_S = 'data/labeled.csv'
S = em.read_csv_metadata(path_S, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')

#showing some examples of labeled sample
S.head(5)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_model,ltable_extended_title,rtable_model,rtable_extended_title,match
0,5,a0,w2300,nh.q28aa.001,acer predator helios 300 15.6 full hd intel core i7 7700hq 16gb ddr4 256gb ssd geforce gtx 1060 ...,nh.q1caa.001,acer 15.6 intel core i7 2.6ghz 16 gb 1 tb hdd 256 gb ssd windows 10 g9 593 72vt nh.q1caa.001,1
1,26,a1011,w685,t8tjg,2018 lenovo thinkpad 11e 4h 11.6 intel i3 7100u 128gb m.2 ssd 4gb ddr4 802.11ac win10 t8tjg windows,t8tjg,dell chromebook 11 3189 11.6 celeron n3060 4 gb 64 gb ssd t8tjg os,0
2,30,a102,w1746,c300sa dh02,asus chromebook c300sa compact 13.3 intel celeron 4gb 16gb emmc asus c300sa dh02,c300sa dh02,c300sa dh02 cel n3060 4gb 16gb 13.3in chrome asus chrome,1
3,44,a1031,w4581,a515 51 596k,acer 15.6 intel core i5 3.40ghz 8gb 256gb ssd windows 10 a515 51 596k home,nx.gnpaa.016,refurbished acer aspire 3 intel core i5 2.5 ghz 8gb 256 gb ssd windows 10 nx.gnpaa.016,1
4,65,a1047,w1400,n850hp6,prostar clevo n850hp6 15.6” fhd ips 1920x1080 intel core i7 7700hq 16gb ddr4 gtx 1060 120gb ssd ...,n855hj,prostar clevo n855hj 15.6” full hd 1920x1080 intel core i7 7700hq 8gb ddr4 gtx 1050 1tb hdd wind...,1


# Step 4: Splitting labeled data into development and evaluation set

In this step, we split the labeled data set S into a development set I and an evaluation set J.

In [10]:
IJ = em.split_train_test(S, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']

# Step 5: Creating a set of ML-matchers.

In [11]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

# Step 6: Selecting the best matcher using I.

This step includes:
    1. Creating a feature table F.
    2. Converting I into a set H of feature vectors (using the features in F).
    3. Selecting the best matcher in the first iteration using cross-validation.
    4. Debugging the matcher
    5. Selecting the best matcher again using cross-validation.    
Repeating the steps 4 and 5 for all the debug iterations.

## Initial iteration:
### a. Creating a feature table F.

In [12]:
match_t = em.get_tokenizers_for_matching([2,3,5])
match_s = em.get_sim_funs_for_matching()
atypes1 = em.get_attr_types(A)
atypes2 = em.get_attr_types(B)
match_c = em.get_attr_corres(A, B)
feature_table = em.get_features(A, B, atypes1, atypes2, match_c, match_t, match_s)

### b. Converting I into a set H of feature vectors (using the features in F).

In [13]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)
#Filling in the missing values if any in H.
H.fillna(0, inplace=True)

### c. Selecting the best matcher in the first iteration using cross-validation

In [14]:
# Select the best ML matcher using CV
result = em.select_matcher(
        matchers=[dt, rf, svm, ln, lg, nb], 
        table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', 
        metric_to_select_matcher='precision', 
        random_state=0)
# result
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.931906,0.849995,0.888477
1,RF,0.923791,0.940841,0.931533
2,SVM,0.872856,0.970963,0.91878
3,LinReg,0.894443,0.935375,0.913853
4,LogReg,0.888296,0.932618,0.908653
5,NaiveBayes,0.919928,0.773327,0.839341


### d. Debugging matcher

In [15]:
#  Split feature vectors into train and test
UV = em.split_train_test(H, train_proportion=0.5)
U = UV['train']
V = UV['test']

In [16]:
# Debug rf using GUI
#em.vis_debug_dt(result['selected_matcher'], U, V, 
#                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
#                target_attr='match')

Making features from non-attributes like the IDs of each row resulted in false positive matches due to high similarity between the ids. There are also some dirty attributes which had erroneus values resulting in false negatives.

## Iteration 2:
### a. Creating a feature table F.

In [17]:
# Remove bad features from auto-feature-generation and trying different tokenizers for matching
AA = A.drop(['id','product_title','model','operating_system'],axis=1)
BB = B.drop(['id','product_title','model','operating_system'],axis=1)

match_t = em.get_tokenizers_for_matching([3,5,10])
match_s = em.get_sim_funs_for_matching()
atypes1 = em.get_attr_types(AA)
atypes2 = em.get_attr_types(BB)
match_c = em.get_attr_corres(AA, BB)
feature_table = em.get_features(AA, BB, atypes1, atypes2, match_c, match_t, match_s)

### b. Converting I into a set H of feature vectors (using the features in F).

In [18]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)
#Filling in the missing values if any in H.
H.fillna(0, inplace=True)

### c. Selecting the best matcher using cross-validation

In [19]:
# Select the best ML matcher using CV
result = em.select_matcher(
        matchers=[dt, rf, svm, ln, lg, nb], 
        table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', 
        metric_to_select_matcher='precision', 
        random_state=0)
# result
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.910556,0.912291,0.911003
1,RF,0.908812,0.929106,0.917972
2,SVM,0.827146,0.988857,0.89962
3,LinReg,0.898446,0.973204,0.933452
4,LogReg,0.88414,0.988857,0.932342
5,NaiveBayes,0.91127,0.927804,0.918882


### d. Debugging matcher

In [20]:
#  Split feature vectors into train and test
UV = em.split_train_test(H, train_proportion=0.5)
U = UV['train']
V = UV['test']
result['drill_down_cv_stats']['precision']['Matcher'][0]

<py_entitymatching.matcher.dtmatcher.DTMatcher at 0x7fb0eca32be0>

In [21]:
# Debug rf using GUI
    #em.vis_debug_dt(result['drill_down_cv_stats']['precision']['Matcher'][0], U, V, 
#                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
#                target_attr='match')

Previous iteration seems to have improved F score and recall. Debugging further. 
A lot of useful information seems to be lost because of data being dirty.
So trying to add more features using custom functions.

### Iteration 3:
### a. Creating a feature table F.

In [22]:
# Remove bad features from auto-feature-generation and trying different tokenizers for matching
AA = A.drop(['id','product_title','model','operating_system'],axis=1)
BB = B.drop(['id','product_title','model','operating_system'],axis=1)
match_t = em.get_tokenizers_for_matching([3,5,10])
match_s = em.get_sim_funs_for_matching()
atypes1 = em.get_attr_types(AA)
atypes2 = em.get_attr_types(BB)
match_c = em.get_attr_corres(AA, BB)
feature_table = em.get_features(AA, BB, atypes1, atypes2, match_c, match_t, match_s)

This function generates a function that returns 1 if both or neither tuples' attribute contain any of the passed in values and 0 otherwise.

In [23]:
def generateContainsValueFeature(values, name=None, attribute='extended_title'):
    if type(values) is str:
        values = [values]
    def containsValueFeature(a,b):
        return int(any([value.lower() in a[attribute].lower() for value in values]) 
                   == any([value.lower() in b[attribute].lower() for value in values]))
    return containsValueFeature, name if name else values[0]

Use this to generate many new features.

In [24]:
brands = ['lg','toshiba','hp','dell','lenovo','prostar','acer','samsung','apple','asus','panasonic','msi']
models = ['zephyrus','zenbook','flex','omen','xps','x1','carbon','yoga',
          'latitude','inspiron','elitebook','clevo','spectre',
          'macbook','pavilion','ideapad','legion']
thinkpads = [['p40','p50','p51','p71'],['430','460','470','560','570']]
asus = ['swift','aspire','spin']
models = models + thinkpads + asus
sizes = [' 13',' 14',' 15',' 17']
operating_systems = ['chrome','windows','mac']
cpus = [['i3','i5','i7'],['celeron','pentium'],['m3','m5'],'amd']
miscellaneous = ['2-in-1','gtx','touch']
keywords = models \
    + sizes  \
    + operating_systems \
    + cpus \
    + miscellaneous
new_features = [generateContainsValueFeature(value) for value in keywords]

for feature in new_features:
    em.add_blackbox_feature(feature_table, feature[1], feature[0])

This function generates a function that returns 1 if both tuples contain the same value for any of the values passed in and 0 otherwise.

In [25]:
def generateContainsValueFromValuesetFeature(valueset, name=None, attribute='extended_title'):
    def sharesValue(a,b,values):
        if type(values) is str:
            values = [values]
        return int(any([value.lower() in a[attribute].lower() and value.lower() in b[attribute].lower() for value in values]))
    def containsValueFromValueset(a,b):
        return any([sharesValue(a,b,values) for values in valueset[1]])
    return containsValueFromValueset, valueset[0]

In [26]:
valuesets = [('brands',brands), 
             ('models', models), 
             ('sizes', sizes), 
             ('cpus', cpus), 
             ('operating_systems',operating_systems)]

new_features = [generateContainsValueFromValuesetFeature(valueset) for valueset in valuesets]

for feature in new_features:
    em.add_blackbox_feature(feature_table, feature[1], feature[0])

### b. Converting I into a set H of feature vectors (using the features in F).

In [27]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)
#Filling in the missing values if any in H.
H.fillna(0, inplace=True)

### c. Selecting the best matcher using cross-validation

In [28]:
# Select the best ML matcher using CV
result = em.select_matcher(
        matchers=[dt, rf, svm, ln, lg, nb], 
        table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', 
        metric_to_select_matcher='precision', 
        random_state=0)
# result
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.954048,0.93798,0.945281
1,RF,0.963564,0.972,0.967336
2,SVM,0.893285,0.992,0.939676
3,LinReg,0.962487,0.952694,0.957178
4,LogReg,0.913782,0.96,0.935088
5,NaiveBayes,0.946487,0.94223,0.943907


Previous iteration seems to have improved F score, precision, and recall. Stopping further debugging.

# Step 7: Evaluating the best matcher Y using J. 

This step includes:
1. Converting J into a set L of feature vectors.
2. Filling in the missing values if any in L.
3. Training the best matcher Y using I.
4. Computing the accuracy of Y


## a. Converting J into a set L of feature vectors.

As before, we convert to the feature vectors (using the feature table and the evaluation set)

In [29]:
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(
        J, 
        feature_table=feature_table,
        attrs_after='match', 
        show_progress=False)

## b. Filling in the missing values if any in L.

In [30]:
#Filling in the missing values if any in L.
L.fillna(0, inplace=True)

## c. Predicting the matches

### For each of the six learning methods, we train the matcher based on that method on I, and then report its precision/recall/F-1 on J. 

In [31]:
for i in range (0,6):
    learning_method = result['drill_down_cv_stats']['precision']['Matcher'][i]
    
    # we train the matcher based on that method on I
    learning_method.fit(table=H, 
           exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
           target_attr='match')

    # we then report its precision/recall/F-1 on J.
    predictions = learning_method.predict(
            table=L, 
            exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
            append=True, 
            target_attr='predicted', 
            inplace=False)

    # Evaluate the predictions
    eval_result = em.eval_matches(predictions, 'match', 'predicted')
    print('Learning method:' + result['drill_down_cv_stats']['precision']['Name'][i])
    em.print_eval_summary(eval_result)
    print('')

Learning method:DecisionTree
Precision : 93.62% (88/94)
Recall : 94.62% (88/93)
F1 : 94.12%
False positives : 6 (out of 94 positive predictions)
False negatives : 5 (out of 41 negative predictions)

Learning method:RF
Precision : 94.51% (86/91)
Recall : 92.47% (86/93)
F1 : 93.48%
False positives : 5 (out of 91 positive predictions)
False negatives : 7 (out of 44 negative predictions)

Learning method:SVM
Precision : 80.56% (87/108)
Recall : 93.55% (87/93)
F1 : 86.57%
False positives : 21 (out of 108 positive predictions)
False negatives : 6 (out of 27 negative predictions)

Learning method:LinReg
Precision : 93.68% (89/95)
Recall : 95.7% (89/93)
F1 : 94.68%
False positives : 6 (out of 95 positive predictions)
False negatives : 4 (out of 40 negative predictions)

Learning method:LogReg
Precision : 88.78% (87/98)
Recall : 93.55% (87/93)
F1 : 91.1%
False positives : 11 (out of 98 positive predictions)
False negatives : 6 (out of 37 negative predictions)

Learning method:NaiveBayes
Precisi

### For the final best matcher Y selected, train it on I, then report its precision/recall/F-1 on J.

In [32]:
# Train best matcher on I 
print ('Final best matcher Y:')
result['selected_matcher'].fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
       target_attr='match')

# Predict on L 
predictions = result['selected_matcher'].predict(
        table=L, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
        append=True, 
        target_attr='predicted', 
        inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Final best matcher Y:
Precision : 94.51% (86/91)
Recall : 92.47% (86/93)
F1 : 93.48%
False positives : 5 (out of 91 positive predictions)
False negatives : 7 (out of 44 negative predictions)
