# Entity matching (EM) for Laptops

This jupyter notebook contains the commands used for each step in the entity matching process for laptop products from Amazon and Walmart. 
We used the Basic EM workflow 3 as our guide for this process.

# Step 1: Reading in the input tables A, B.

In [4]:
import sys
import py_entitymatching as em
import pandas as pd
import os

path_A = 'data/amazon_products.csv'
path_B = 'data/walmart_products.csv'

# Load the csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
print('len(A):' + str(len(A)))
print('len(B):' + str(len(B)))
print('len (A X B):' + str(len(A)*len(B)))

len(A):3000
len(B):4847
len (A X B):14541000


# Step 2: Block tables to get candidate set

In this step, we apply a blocking sequence on the input tables to get the candidate set C.

In [5]:
# Create an initial blocker
ob = em.OverlapBlocker()

# Use block_tables to apply blocking over two input tables.
C1 = ob.block_tables(A,B, 'model', 'model', 
                     l_output_attrs=['id','model','extended_title'], 
                     r_output_attrs=['id','model','extended_title'],
                     overlap_size=8,
                     q_val=5,
                     word_level=False,
                     show_progress=False,
                     n_jobs=-1
                     )
len(C1)

2594

## Debugging blocker output

In [49]:
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, 
                        attr_corres=[
                            ('product_title','product_title'),
                            ('brand', 'brand'), 
                            ('model','model'),
                            ('extended_title','extended_title')],
                            verbose=True)
### Display first few tuple pairs from the debug_blocker's output
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product_title,ltable_brand,ltable_model,ltable_extended_title,rtable_product_title,rtable_brand,rtable_model,rtable_extended_title
0,0,a389,w3337,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x144...,lenovo,,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x144...
1,1,a389,w3221,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560...
2,2,a389,w3226,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560x1...,lenovo,,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 256gb nvme ssd 16gb 14 wqhd ips 2560x1...
3,3,a1838,w3122,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 fhd ips 1920x1080 int...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 fhd ips 1920x1080 int...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 512gb nvme ssd 16gb 14 fhd ips 1920x...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 512gb nvme ssd 16gb 14 fhd ips 1920x...
4,4,a1356,w3178,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 fhd ips 1920x1080 int...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 fhd ips 1920x1080 int...
5,5,a1356,w3150,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440...
6,6,a389,w3335,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 512gb ssd 16gb 14 wqhd ips 2560x1440 ...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x1...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x1...
7,7,a1838,w3078,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 fhd ips 1920x1080 int...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb ssd 16gb 14 fhd ips 1920x1080 int...,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 fhd ips 1920x...,lenovo,,lenovo thinkpad x1 carbon 4 windows 8.1 intel core i7 6600u 256gb nvme ssd 16gb 14 fhd ips 1920x...
8,8,a1356,w3117,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440 i...,lenovo,,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 128gb ssd 16gb 14 wqhd ips 2560x1440 i...
9,9,a1356,w3179,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo,,lenovo thinkpad x1 carbon 4 windows 10 intel core i7 6600u 1tb nvme ssd 16gb 14 wqhd ips 2560x14...,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 256gb ssd 16gb 14 wqhd ips 2560x1440 i...,lenovo,,lenovo thinkpad x1 carbon 4 windows 7 intel core i7 6600u 256gb ssd 16gb 14 wqhd ips 2560x1440 i...


Looking at the debug blocker's output, we observe that the initial blocker is dropping a lot of potential matches.
Blocking on the column brand alone seems incorrect.

In [50]:
# # Create overlap blocker
ob = em.OverlapBlocker()

# # Block tables using 'extended_title' attribute 
C2 = ob.block_tables(A, B, 'extended_title', 'extended_title', 
                     l_output_attrs=['id','model','extended_title'], 
                     r_output_attrs=['id','model','extended_title'],
                     overlap_size=25,
                     q_val=5,
                     word_level=False,
                     show_progress=False,
                     n_jobs=-1
                    )
# Updated blocking sequence
# A, B ------ overlap blocker [extended_title] ---------> C1--|
#                                                             |----> C
# A, B ------ overlap blocker [model] ------------------> C2--|

C = em.combine_blocker_outputs_via_union([C1,C2])
print ('Reduction on AxB by applying the blocker:' + str(int((len(A)*len(B))/len(C))) + 'x')

Reduction on AxB by applying the blocker:45x


In [51]:
# Debug blocker output
dbg = em.debug_blocker(C, A, B, 
                        attr_corres=[
                            ('product_title','product_title'),
                            ('brand', 'brand'), 
                            ('model','model'),
                            ('extended_title','extended_title')],
                            verbose=True)
### Display first few tuple pairs from the debug_blocker's output
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product_title,ltable_brand,ltable_model,ltable_extended_title,rtable_product_title,rtable_brand,rtable_model,rtable_extended_title
0,0,a166,w1725,hp chromebook t4m32ut#aba 14 intel celeron 4 gb 16 gb ssd chrome black,hp,t4m32ut#aba,hp chromebook t4m32ut#aba 14 intel celeron 4 gb 16 gb ssd chrome,hp chromebook 11 g5 11.6 celeron n3060 4 gb 16 gb ssd,hp,x9u02ut#aba,hp chromebook 11 g5 11.6 celeron n3060 4 gb 16 gb ssd x9u02ut#aba os
1,1,a820,w4318,hp 15 amd a9 9420 apu 8gb 1tb windows 10 15 bw030nr,hp,15 bw030nr,hp 15 amd a9 9420 apu 8gb 1tb windows 10 15 bw030nr,manufacturer hp 15 bw070nr 15.6 amd a9 9420 3.0ghz 4gb 1tb windows 10,hp,15 bw070nr rf,manufacturer hp 15 bw070nr 15.6 amd a9 9420 3.0ghz 4gb 1tb windows 10 15 bw070nr rf microsoft
2,2,a1166,w4520,acer intel core i5 1.60 ghz 8 gb 256 gb ssd windows 10,acer,nx.gr7aa.007,acer intel core i5 1.60 ghz 8 gb 256 gb ssd windows 10 nx.gr7aa.007,refurbished acer amd fx series 3 ghz 8 gb 256 gb ssd windows 10 home,acer,nh.q2uaa.003,refurbished acer amd fx series 3 ghz 8 gb 256 gb ssd windows 10 nh.q2uaa.003
3,3,a1739,w526,lenovo thinkpad p71 17.3 core i7 7700hq 8 gb 256 gb ssd,lenovo,20hk001jus,lenovo thinkpad p71 17.3 core i7 7700hq 8 gb 256 gb ssd lenovo 20hk001jus not applicable,lenovo thinkpad x1 tablet 12 core i7 7y75 8 gb 256 gb ssd,lenovo,20jb002nus,lenovo thinkpad x1 tablet 12 core i7 7y75 8 gb 256 gb ssd 20jb002nus windows 10
4,4,a1144,w953,hp 250 g5 15.6 intel core i3 5005u windows 10 8gb 1tb drive,hp,x9u74ut#aba,hp 250 g5 15.6 intel core i3 5005u windows 10 8gb 1tb x9u74ut#aba,hp pavilion pc intel core i3 7100u 8gb ddr4 1tb hdd 15.6 hd touchscreen windows 10,hp,,hp pavilion pc intel core i3 7100u 8gb ddr4 1tb hdd 15.6 hd touchscreen windows 10
5,5,a1739,w425,lenovo thinkpad p71 17.3 core i7 7700hq 8 gb 256 gb ssd,lenovo,20hk001jus,lenovo thinkpad p71 17.3 core i7 7700hq 8 gb 256 gb ssd lenovo 20hk001jus not applicable,lenovo thinkpad x1 tablet 12 core i7 7y75 8 gb 256 gb ssd,lenovo,20jb002lus,lenovo thinkpad x1 tablet 12 core i7 7y75 8 gb 256 gb ssd 20jb002lus windows 10
6,6,a1731,w809,newest hp 15.6 hd intel pentium 8gb 1tb hdd writerhd windows 10,hp laptop,,newest hp 15.6 hd intel pentium 8gb 1tb hdd writerhd windows 10 laptop,2017 hp touchscreen 15.6 hd intel i3 7100u dual core 8gb 1tb hdd hdmi windows 10,hp,2017,2017 hp touchscreen 15.6 hd intel i3 7100u dual core 8gb 1tb hdd hdmi windows 10
7,7,a1144,w1177,hp 250 g5 15.6 intel core i3 5005u windows 10 8gb 1tb drive,hp,x9u74ut#aba,hp 250 g5 15.6 intel core i3 5005u windows 10 8gb 1tb x9u74ut#aba,hp pavilion pc intel core i3 7100u 8gb ddr4 1tb hdd 15.6 hd touchscreen windows 10,hp,,hp pavilion pc intel core i3 7100u 8gb ddr4 1tb hdd 15.6 hd touchscreen windows 10
8,8,a551,w2500,hp pavilion 15.6 full hd pc intel core i5 6200u 8gb 1tb hdd windows 10,hp,,hp pavilion 15.6 full hd pc intel core i5 6200u 8gb 1tb hdd windows 10,hp 15.6 15 ac143wm pc intel core i5 5200u 6gb 1tb windows 10 home,hp,n5z06ua#aba,hp 15.6 15 ac143wm pc intel core i5 5200u 6gb 1tb windows 10 n5z06ua#aba
9,9,a1876,w1652,msi pe70 7rd 027 17.3 work intel core i7 7700hq gtx 1050 16gb ddr4 128gb ssd 1tb windows 10 pro,msi,pe70 7rd 027,msi pe70 7rd 027 17.3 work intel core i7 7700hq gtx 1050 16gb ddr4 128gb ssd 1tb windows 10,msi pe62 7rd 1095 15.6 intel core i7 7700hq 2.8ghz 32gb ddr4 1tb hdd 512gb ssd gtx 1050 usb3.0 w...,msi,,msi pe62 7rd 1095 15.6 intel core i7 7700hq 2.8ghz 32gb ddr4 1tb hdd 512gb ssd gtx 1050 usb3.0 w...


The blocker seem to have reduced the input size by 45x. However, on debugging the blocker we observe that it is still dropping potential matches. So, we are going to debug the blocker further.

In [52]:
def match_extended_title(attribute='extended_title',q_val=3, threshold=.5,debug=False):
    def jaccard_matcher(ltuple,rtuple, attribute=attribute, q_val=q_val, threshold=threshold,debug=debug):
        buffer = '#' * (q_val-1)
        l_attribute = buffer + ltuple[attribute] + buffer
        r_attribute = buffer + rtuple[attribute] + buffer
        l_grams = set()
        r_grams = set()
        # create sets of grams
        for attribute, grams in [(l_attribute,l_grams), (r_attribute,r_grams)]:
            for i in range(0,len(attribute)-(q_val-1)):
                grams.add(attribute[i:i+q_val])
               
        # compute jaccard
        intersection = list(set(l_grams) & set(r_grams))
        union = list(set(l_grams) | set(r_grams))
        if debug:
            print(union)
            print(intersection)
            print(len(intersection) / len(union))
        return len(intersection) / len(union) < threshold
    
    return jaccard_matcher

bb = em.BlackBoxBlocker()
bb.set_black_box_function(match_extended_title())
C3 = bb.block_candset(C2,  
                     show_progress=False,
                     n_jobs=-1
                     )
# Updated blocking sequence
# A, B --- overlap blocker [model] ---------------------> C1--------------------------------|
#                                                                                     union |---> C
# A, B --- overlap blocker [extended_title] ---> C2---> jaccard blocker [extended_title] ---|
C = em.combine_blocker_outputs_via_union([C1,C3])

In [53]:
# Debug again
dbg = em.debug_blocker(C, A, B)
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product_title,ltable_brand,ltable_model,ltable_operating_system,ltable_extended_title,rtable_product_title,rtable_brand,rtable_model,rtable_operating_system,rtable_extended_title
0,0,a298,w3443,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 2tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 2tb nvme ssd 14 ips wqhd 2560x1440
1,1,a298,w3400,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 1tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 1tb nvme ssd 14 ips wqhd 2560x1440
2,2,a298,w3422,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 1tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 1tb nvme ssd 14 ips wqhd 2560x1440
3,3,a298,w3439,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 2tb nvme ssd 14 ips wqhd 2560x1440 ...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 2tb nvme ssd 14 ips wqhd 2560x1440
4,4,a298,w3458,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 2tb nvme ssd 14 ips wqhd 2560x1440...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 2tb nvme ssd 14 ips wqhd 2560x1440
5,5,a298,w3394,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 1tb nvme ssd 14 ips wqhd 2560x1440 ...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 8gb 1tb nvme ssd 14 ips wqhd 2560x1440
6,6,a298,w3250,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 256gb nvme ssd 14 ips wqhd 2560x14...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 12gb 256gb nvme ssd 14 ips wqhd 2560x1440
7,7,a298,w3327,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 256gb nvme ssd 14 ips wqhd 2560x14...,lenovo,,windows 7,pcie lenovo thinkpad t460s windows 7 intel core i7 6600u 20gb 256gb nvme ssd 14 ips wqhd 2560x1440
8,8,a298,w3014,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...,lenovo,2017,windows 7,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...
9,9,a298,w3015,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,lenovo,20fb002lus,windows 7 professional,lenovo thinkpad x1 carbon 20fb002lus 14 wqhd 2560x1440 ips intel core i7 6600u 512gb ssd 16gb wi...,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...,lenovo,2017,windows 7,2017 lenovo thinkpad x1 carbon 5th gen windows 7 intel core i7 7600u 180gb ssd 8gb 14 wqhd ips 2...


In [54]:
len(C)

7326

Looking at the dedug output, we observe that the current blocking sequence does not seem to drop a lot of potential matches and has also reduced the candidate set to a good extent. Thus, we stop the blocking step and proceed with the matching step.

# Step 3: Reading in the labeled sample

We randomly sample 450 tuple pairs for labeling purposes and write the sample data to a csv. We then label the csv and use the labeled csv from then on.

In [55]:
# # Sample  candidate set
S = em.sample_table(C, 450)
## em.to_csv_metadata(S, 'data/labeled.csv')

In [56]:
# # Loading the labeled data
path_S = 'data/labeled.csv'
S = em.read_csv_metadata(path_S, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')

#showing some examples of labeled sample
S.head(5)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_model,ltable_extended_title,rtable_model,rtable_extended_title,match
0,5,a0,w2300,nh.q28aa.001,acer predator helios 300 15.6 full hd intel core i7 7700hq 16gb ddr4 256gb ssd geforce gtx 1060 ...,nh.q1caa.001,acer 15.6 intel core i7 2.6ghz 16 gb 1 tb hdd 256 gb ssd windows 10 g9 593 72vt nh.q1caa.001,1
1,26,a1011,w685,t8tjg,2018 lenovo thinkpad 11e 4h 11.6 intel i3 7100u 128gb m.2 ssd 4gb ddr4 802.11ac win10 t8tjg windows,t8tjg,dell chromebook 11 3189 11.6 celeron n3060 4 gb 64 gb ssd t8tjg os,0
2,30,a102,w1746,c300sa dh02,asus chromebook c300sa compact 13.3 intel celeron 4gb 16gb emmc asus c300sa dh02,c300sa dh02,c300sa dh02 cel n3060 4gb 16gb 13.3in chrome asus chrome,1
3,44,a1031,w4581,a515 51 596k,acer 15.6 intel core i5 3.40ghz 8gb 256gb ssd windows 10 a515 51 596k home,nx.gnpaa.016,refurbished acer aspire 3 intel core i5 2.5 ghz 8gb 256 gb ssd windows 10 nx.gnpaa.016,1
4,65,a1047,w1400,n850hp6,prostar clevo n850hp6 15.6” fhd ips 1920x1080 intel core i7 7700hq 16gb ddr4 gtx 1060 120gb ssd ...,n855hj,prostar clevo n855hj 15.6” full hd 1920x1080 intel core i7 7700hq 8gb ddr4 gtx 1050 1tb hdd wind...,1


# Step 4: Splitting labeled data into development and evaluation set

In this step, we split the labeled data set S into a development set I and an evaluation set J.

In [57]:
IJ = em.split_train_test(S, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']

# Step 5: Creating a set of ML-matchers.

In [58]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

### Creating features

Next, we need to create a set of features for the development set. *py_entitymatching* provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.

In [27]:
# Remove bad features from auto-feature-generation
AA = A.drop(['id','product_title','model','operating_system'],axis=1)
BB = B.drop(['id','product_title','model','operating_system'],axis=1)

AA.keys(), BB.keys()

(Index(['brand', 'extended_title'], dtype='object'),
 Index(['brand', 'extended_title'], dtype='object'))

In [28]:
match_t = em.get_tokenizers_for_matching([3,5,10])
match_s = em.get_sim_funs_for_matching()
atypes1 = em.get_attr_types(AA)
atypes2 = em.get_attr_types(BB)
match_c = em.get_attr_corres(AA, BB)
feature_table = em.get_features(AA, BB, atypes1, atypes2, match_c, match_t, match_s)

In [29]:
feature_table['feature_name']

0                          brand_brand_jac_qgm_3_qgm_3
1                      brand_brand_cos_dlm_dc0_dlm_dc0
2                      brand_brand_jac_dlm_dc0_dlm_dc0
3                                      brand_brand_mel
4                                 brand_brand_lev_dist
5                                  brand_brand_lev_sim
6                                      brand_brand_nmw
7                                       brand_brand_sw
8        extended_title_extended_title_jac_qgm_3_qgm_3
9    extended_title_extended_title_cos_dlm_dc0_dlm_dc0
Name: feature_name, dtype: object

This function generates a function that returns 1 if both or neither tuples' attribute contain any of the passed in values and 0 otherwise.

In [30]:
def generateContainsValueFeature(values, name=None, attribute='extended_title'):
    if type(values) is str:
        values = [values]
    def containsValueFeature(a,b):
        return int(any([value.lower() in a[attribute].lower() for value in values]) 
                   == any([value.lower() in b[attribute].lower() for value in values]))
    return containsValueFeature, name if name else values[0]

Use this to generate many new features.

In [31]:
brands = ['lg','toshiba','hp','dell','lenovo','prostar','acer','samsung','apple','asus','panasonic','msi']
models = ['zephyrus','zenbook','flex','omen','xps','x1','carbon','yoga',
          'latitude','inspiron','elitebook','clevo','spectre',
          'macbook','pavilion','ideapad','legion']
thinkpads = [['p40','p50','p51','p71'],['430','460','470','560','570']]
asus = ['swift','aspire','spin']
models = models + thinkpads + asus
sizes = [' 13',' 14',' 15',' 17']
operating_systems = ['chrome','windows','mac']
cpus = [['i3','i5','i7'],['celeron','pentium'],['m3','m5'],'amd']
miscellaneous = ['2-in-1','gtx','touch']
keywords = models \
    + sizes  \
    + operating_systems \
    + cpus \
    + miscellaneous
new_features = [generateContainsValueFeature(value) for value in keywords]

for feature in new_features:
    em.add_blackbox_feature(feature_table, feature[1], feature[0])

This function generates a function that returns 1 if both tuples contain the same value for any of the values passed in and 0 otherwise.

In [32]:
def generateContainsValueFromValuesetFeature(valueset, name=None, attribute='extended_title'):
    def sharesValue(a,b,values):
        if type(values) is str:
            values = [values]
        return int(any([value.lower() in a[attribute].lower() and value.lower() in b[attribute].lower() for value in values]))
    def containsValueFromValueset(a,b):
        return any([sharesValue(a,b,values) for values in valueset[1]])
    return containsValueFromValueset, valueset[0]

In [33]:
valuesets = [('brands',brands), 
             ('models', models), 
             ('sizes', sizes), 
             ('cpus', cpus), 
             ('operating_systems',operating_systems)]

new_features = [generateContainsValueFromValuesetFeature(valueset) for valueset in valuesets]

for feature in new_features:
    em.add_blackbox_feature(feature_table, feature[1], feature[0])

In [34]:
# List the names of the features generated
feature_table['feature_name']

0                           brand_brand_jac_qgm_3_qgm_3
1                       brand_brand_cos_dlm_dc0_dlm_dc0
2                       brand_brand_jac_dlm_dc0_dlm_dc0
3                                       brand_brand_mel
4                                  brand_brand_lev_dist
5                                   brand_brand_lev_sim
6                                       brand_brand_nmw
7                                        brand_brand_sw
8         extended_title_extended_title_jac_qgm_3_qgm_3
9     extended_title_extended_title_cos_dlm_dc0_dlm_dc0
10                                             zephyrus
11                                              zenbook
12                                                 flex
13                                                 omen
14                                                  xps
15                                                   x1
16                                               carbon
17                                              

### Converting the development set to feature vectors

In [35]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)
H.fillna(0, inplace=True)
H.head()

Unnamed: 0,_id,ltable_id,rtable_id,brand_brand_jac_qgm_3_qgm_3,brand_brand_cos_dlm_dc0_dlm_dc0,brand_brand_jac_dlm_dc0_dlm_dc0,brand_brand_mel,brand_brand_lev_dist,brand_brand_lev_sim,brand_brand_nmw,...,amd,2-in-1,gtx,touch,brands,models,sizes,cpus,operating_systems,match
221,3372,a1997,w2968,1.0,1.0,1.0,1.0,0.0,1.0,6.0,...,1,1,1,0,True,True,True,True,True,1
439,6763,a891,w2599,0.04,0.0,0.0,0.60625,13.0,0.1875,-7.0,...,1,1,1,1,True,False,False,True,True,0
191,2875,a1919,w3250,1.0,1.0,1.0,1.0,0.0,1.0,6.0,...,1,1,1,1,True,True,True,True,True,1
239,3626,a1997,w3468,1.0,1.0,1.0,1.0,0.0,1.0,6.0,...,1,1,1,1,True,True,True,True,True,1
433,6651,a878,w2294,1.0,1.0,1.0,1.0,0.0,1.0,6.0,...,1,1,1,1,True,True,True,True,True,1


### Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use the 'precision' metric to select the best matcher.

In [36]:
# Select the best ML matcher using CV
result = em.select_matcher(
        matchers=[dt, rf, svm, ln, lg, nb], 
        table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', 
        metric_to_select_matcher='precision', 
        random_state=0)
result['cv_stats']
# result

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.954048,0.93798,0.945281
1,RF,0.963564,0.972,0.967336
2,SVM,0.893285,0.992,0.939676
3,LinReg,0.962487,0.952694,0.957178
4,LogReg,0.913782,0.96,0.935088
5,NaiveBayes,0.946487,0.94223,0.943907


### Debugging matcher

We observe that the best matcher is not maximizing F1. We debug the matcher to see what might be wrong.
To do this, first we split the feature vectors into train and test.

In [37]:
#  Split feature vectors into train and test
UV = em.split_train_test(H, train_proportion=0.5)
U = UV['train']
V = UV['test']

Next, we debug the matcher using GUI. For the purposes of this guide, we use random forest matcher for debugging purposes.

In [38]:
# Debug rf using GUI
# em.vis_debug_dt(result['selected_matcher'], U, V, 
#         exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
#         target_attr='match')

##  Evaluating the matching output

From the GUI, we observe that phone numbers seem to be an important attribute, but they are in different format. Current features does not capture and adding a feature incorporating this difference in format can potentially improve 
the F1 numbers.

Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.

In [39]:
# Select the best ML matcher using CV
result = em.select_matcher(
        [dt, rf, svm, ln, lg, nb], 
        table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', 
        metric_to_select_matcher='precision', 
        random_state=0)
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.954048,0.93798,0.945281
1,RF,0.963564,0.972,0.967336
2,SVM,0.893285,0.992,0.939676
3,LinReg,0.962487,0.952694,0.957178
4,LogReg,0.913782,0.96,0.935088
5,NaiveBayes,0.946487,0.94223,0.943907


Evaluating the matching outputs for the evaluation set typically involves the following four steps:
1. Converting the evaluation set to feature vectors
2. Training matcher using the feature vectors extracted from the development set
3. Predicting the evaluation set using the trained matcher
4. Evaluating the predicted matches

### Converting the evaluation set to  feature vectors

As before, we convert to the feature vectors (using the feature table and the evaluation set)

In [40]:
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(
        J, 
        feature_table=feature_table,
        attrs_after='match', 
        show_progress=False)
L.fillna(0, inplace=True)
L.head()

Unnamed: 0,_id,ltable_id,rtable_id,brand_brand_jac_qgm_3_qgm_3,brand_brand_cos_dlm_dc0_dlm_dc0,brand_brand_jac_dlm_dc0_dlm_dc0,brand_brand_mel,brand_brand_lev_dist,brand_brand_lev_sim,brand_brand_nmw,...,amd,2-in-1,gtx,touch,brands,models,sizes,cpus,operating_systems,match
124,1764,a1561,w15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1,1,1,1,True,True,False,True,True,0
54,705,a1269,w3092,1.0,1.0,1.0,1.0,0.0,1.0,6.0,...,1,1,1,0,True,True,True,True,True,1
268,4200,a2294,w1544,1.0,1.0,1.0,1.0,0.0,1.0,7.0,...,1,1,1,1,True,True,True,True,True,1
293,4614,a2439,w3188,1.0,1.0,1.0,1.0,0.0,1.0,6.0,...,1,1,1,1,True,True,True,True,True,1
230,3469,a1997,w3210,1.0,1.0,1.0,1.0,0.0,1.0,6.0,...,1,1,1,1,True,True,True,True,True,1


### Training the selected matcher

Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.

In [41]:
# Train using feature vectors from I 
result['selected_matcher'].fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
       target_attr='match')

### Predicting the matches

Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).

In [42]:
# Predict on L 
predictions = result['selected_matcher'].predict(
        table=L, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], 
        append=True, 
        target_attr='predicted', 
        inplace=False)

### Evaluating the predictions

Finally, we evaluate the accuracy of predicted outputs

In [43]:
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 94.51% (86/91)
Recall : 92.47% (86/93)
F1 : 93.48%
False positives : 5 (out of 91 positive predictions)
False negatives : 7 (out of 44 negative predictions)


In [44]:
# em.vis_debug_dt(dt, 
#                 U, V, 
#                 exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
#                 target_attr='match')