# Basic EM workflow 3 (Restaurants data set)

# Introduction

This IPython notebook explains a basic workflow two tables using *py_entitymatching*. Our goal is to come up with a workflow to match restaurants from Fodors and Zagat sites. Specifically, we want to maximize F1. The datasets contain information about the restaurants.

First, we need to import *py_entitymatching* package and other libraries as follows:

In [1]:
import sys
#sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')

import py_entitymatching as em
import pandas as pd
import os

In [2]:
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )

python version: 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0]
pandas version: 0.22.0
magellan version: 0.3.0


Matching two tables typically consists of the following three steps:

** 1. Reading the input tables **

** 2. Blocking the input tables to get a candidate set **

** 3. Matching the tuple pairs in the candidate set **

# Read input tables

We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.

In [3]:
# Get the paths
path_A = '/home/liang/Workspace/cs839-project/stage-2/crawlers/data/amazon_products.csv'
path_B = '/home/liang/Workspace/cs839-project/stage-2/crawlers/data/walmart_products.csv'

In [4]:
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


In [5]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))

Number of tuples in A: 2976
Number of tuples in B: 4847
Number of tuples in A X B (i.e the cartesian product): 14424672


In [6]:
A.head(2)

Unnamed: 0,id,product title,brand,model,combo
0,a0,acer predator helios 300 gaming laptop 15.6 full hd intel core i7-7700hq cpu 16gb ddr4 ram 256gb...,acer,nh.q28aa.001,acer predator helios 300 gaming laptop 15.6 full hd intel core i7-7700hq cpu 16gb ddr4 ram 256gb...
1,a1,acer aspire e 15 15.6 full hd 8th gen intel core i5-8250u geforce mx150 8gb ram memory 256gb ssd...,acer,e5-576g-5762,acer aspire e 15 15.6 full hd 8th gen intel core i5-8250u geforce mx150 8gb ram memory 256gb ssd...


In [7]:
B.head(2)

Unnamed: 0,id,product title,brand,model,combo
0,w0,iview i896qw 8.95 2-in-1 32gb tablet intel atom bay trail z3735f processor windows 10,iview,,iview i896qw 8.95 2-in-1 32gb tablet intel atom bay trail z3735f processor windows 10 iview
1,w1,dell - inspiron 2-in-1 11.6 touch-screen laptop - intel pentium - 4gb memory - 500gb hd - red,dell,i3168-3270red,dell - inspiron 2-in-1 11.6 touch-screen laptop - intel pentium - 4gb memory - 500gb hd - red de...


In [8]:
# Display the keys of the input tables
em.get_key(A), em.get_key(B)

('id', 'id')

In [9]:
# If the tables are large we can downsample the tables like this
A1, B1 = em.down_sample(A, B, 200, 1, show_progress=True)
len(A1), len(B1)

# But for the purposes of this notebook, we will use the entire table A and B

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


(146, 200)

# Block tables to get candidate set

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.
*py_entitymatching* provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.

For the matching problem at hand, we know that two restaurants with different city names will not match. So we decide the apply blocking over names:

In [10]:
# Blocking plan

# A, B -- attribute equiv. blocker [city] --------------------|---> candidate set

# block on: brand, ram, OS?, model? 

In [11]:
# Create overlap blocker
ob = em.OverlapBlocker()

# Block tables using 'name' attribute 
C1 = ob.block_tables(A, B, 'product title', 'product title', 
                    l_output_attrs=['product title'], 
                    r_output_attrs=['product title'],
                    overlap_size=75,
                    q_val=5,
                    word_level=False,
                    show_progress=True,
                    n_jobs=-1,
                    )
len(C1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04


6513

In [12]:
C1

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,rtable_product title
0,0,a2000,w6,hp stream with an ultra-portable design laptop 14 screen intel celeron n3060 processor 4gb ram 3...,11.6 convertible touchscreen laptop windows 10 home office 365 personal 1-year subscription incl...
1,1,a322,w32,lenovo flex 5 2-in-1 laptop: core i5-8250u 8gb ram 128gb ssd 14-inch full hd touch display windo...,lenovo flex 5 2-in-1 laptop: core i5-8250u 8gb ram 128gb ssd 14-inch full hd touch display windo...
2,2,a138,w41,asus vivobook flip 14 thin and light 2-in-1 hd touchscreen laptop intel 2.2ghz processor 4gb ram...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
3,3,a174,w41,asus vivobook e403na-us04 thin and lightweight 14” fhd laptop intel celeron n3350 processor 4gb ...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
4,4,a255,w41,asus vivobook flip tp200sa-dh01t 11.6 inch display thin and lightweight 2-in-1 hd touchscreen la...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
5,5,a621,w41,asus vivobook flip 14 tp401ca-dhm6t 14” thin and lightweight 2-in-1 hd touchscreen laptop intel ...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
6,6,a1167,w41,asus vivobook e203na ultra-thin and light laptop intel celeron n3350 processor 4gb lpddr3 ram 32...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
7,7,a648,w41,asus transformer book tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 full hd tou...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
8,8,a432,w72,dell inspiron 13 2-in-1 laptop: core i7-8550u 256gb ssd 8gb ram 13.3 full hd touch display windo...,dell inspiron 13 2-in-1 laptop: core i7-8550u 256gb ssd 8gb ram 13.3 full hd touch display
9,9,a1161,w76,lenovo yoga y720 13 premium thin light convertible 2 in 1 laptop pc (intel 8th gen i7 quad core ...,hp spectre x360 13t premium ultra light convertible 2-in-1 laptop/tablet (intel 8th gen quad cor...


## Debug blocker output

The number of tuple pairs considered for matching is reduced to 10165 (from 176423), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in *py_entitymatching* as follows:

In [13]:
# Debug blocker output
# dbg = em.debug_blocker(C1, A, B, output_size=200, 
#                        attr_corres=[
#                            ('product title','product title'),
#                            ('brand', 'brand'), 
#                            ('model','model'),
#                            ('combo','combo')],
#                       verbose=True)
#### Display first few tuple pairs from the debug_blocker's output
# dbg

From the debug blocker's output we observe that the current blocker drops quite a few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.

For the considered dataset, we know that for the restaurants to match the  names must overlap between them. We could use overlap blocker for this purpose. Finally, we would want to union the outputs from the attribute equivalence blocker and the overlap blocker to get a consolidated candidate set.

In [15]:
# Updated blocking sequence
# A, B ------ attribute equivalence [city] -----> C1--
#                                                     |----> C
# A, B ------ overlap blocker [name] --------> C2--

In [16]:
# Create overlap blocker
ob = em.OverlapBlocker()

# Block tables using 'name' attribute 
C2 = ob.block_tables(A, B, 'combo', 'combo', 
                    l_output_attrs=['product title', 'combo'], 
                    r_output_attrs=['product title', 'combo'],
                    overlap_size=125,
                    q_val=5,
                    word_level=False,
                    show_progress=True,
                    n_jobs=-1
                    )
len(C2)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04


727

In [17]:
# Display first two rows from C2
C2

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_combo,rtable_product title,rtable_combo
0,0,a255,w41,asus vivobook flip tp200sa-dh01t 11.6 inch display thin and lightweight 2-in-1 hd touchscreen la...,asus vivobook flip tp200sa-dh01t 11.6 inch display thin and lightweight 2-in-1 hd touchscreen la...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
1,1,a648,w41,asus transformer book tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 full hd tou...,asus transformer book tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 full hd tou...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
2,2,a1116,w99,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
3,3,a1791,w99,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
4,4,a1594,w108,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...
5,5,a2490,w108,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...
6,6,a1594,w110,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...
7,7,a2490,w110,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...
8,8,a116,w113,2018 newest hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel quad cor...,2018 newest hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel quad cor...,hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel core i7-8550u proces...,hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel core i7-8550u proces...
9,9,a1116,w114,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...


We add another blocker.

In [18]:
# Create overlap blocker
ob = em.OverlapBlocker()

# Block tables using 'name' attribute 
C3 = ob.block_tables(A, B, 'product title', 'product title', 
                    l_output_attrs=['product title'], 
                    r_output_attrs=['product title'],
                    overlap_size=20,
                    show_progress=True,
                    n_jobs=-1,
                    )
len(C3)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01


2066

In [19]:
C3

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,rtable_product title
0,0,a629,w36,dell i5368-1692gry 13.3 fhd 2-in-1 laptop (intel core i3-6100u 2.3ghz processor 4 gb ram 1 tb hd...,refurbished dell i5368-1692gry 13.3 fhd 2-in-1 laptop (intel core i3-6100u 2.3ghz processor 4 gb...
1,1,a255,w41,asus vivobook flip tp200sa-dh01t 11.6 inch display thin and lightweight 2-in-1 hd touchscreen la...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
2,2,a648,w41,asus transformer book tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 full hd tou...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
3,3,a555,w86,2018 latest hp spectre x360 13t touchscreen yoga style 2-in-1 windows 10 pro laptop & tablet - i...,hp spectre x360 13t premium ultra light convertible 2-in-1 laptop (intel 8th gen i7-8550u quad c...
4,4,a1116,w99,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
5,5,a1437,w99,cuk hp envy x360 15z convertible touch notebook (amd ryzen 5 2500u + amd radeon vega 8 16gb ram ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
6,6,a1791,w99,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
7,7,a116,w107,2018 newest hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel quad cor...,hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel core i7-8550u proces...
8,8,a1594,w108,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...
9,9,a2490,w108,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...


In [20]:
# Create overlap blocker
ob = em.OverlapBlocker()

# Block tables using 'name' attribute 
C4 = ob.block_tables(A, B, 'combo', 'combo', 
                    l_output_attrs=['product title', 'combo'], 
                    r_output_attrs=['product title', 'combo'],
                    overlap_size=20,
                    show_progress=True,
                    n_jobs=-1
                    )
len(C4)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01


2158

In [21]:
C4

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_combo,rtable_product title,rtable_combo
0,0,a629,w36,dell i5368-1692gry 13.3 fhd 2-in-1 laptop (intel core i3-6100u 2.3ghz processor 4 gb ram 1 tb hd...,dell i5368-1692gry 13.3 fhd 2-in-1 laptop (intel core i3-6100u 2.3ghz processor 4 gb ram 1 tb hd...,refurbished dell i5368-1692gry 13.3 fhd 2-in-1 laptop (intel core i3-6100u 2.3ghz processor 4 gb...,refurbished dell i5368-1692gry 13.3 fhd 2-in-1 laptop (intel core i3-6100u 2.3ghz processor 4 gb...
1,1,a255,w41,asus vivobook flip tp200sa-dh01t 11.6 inch display thin and lightweight 2-in-1 hd touchscreen la...,asus vivobook flip tp200sa-dh01t 11.6 inch display thin and lightweight 2-in-1 hd touchscreen la...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
2,2,a648,w41,asus transformer book tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 full hd tou...,asus transformer book tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 full hd tou...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...,asus vivobook tp200sa-dh01t-bl 11.6 inch display thin and lightweight 2-in-1 hd touchscreen lapt...
3,3,a555,w86,2018 latest hp spectre x360 13t touchscreen yoga style 2-in-1 windows 10 pro laptop & tablet - i...,2018 latest hp spectre x360 13t touchscreen yoga style 2-in-1 windows 10 pro laptop & tablet - i...,hp spectre x360 13t premium ultra light convertible 2-in-1 laptop (intel 8th gen i7-8550u quad c...,hp spectre x360 13t premium ultra light convertible 2-in-1 laptop (intel 8th gen i7-8550u quad c...
4,4,a1116,w99,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
5,5,a1437,w99,cuk hp envy x360 15z convertible touch notebook (amd ryzen 5 2500u + amd radeon vega 8 16gb ram ...,cuk hp envy x360 15z convertible touch notebook (amd ryzen 5 2500u + amd radeon vega 8 16gb ram ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
6,6,a1791,w99,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...,hp envy x360 15z premium yoga style 2-in-1 convertible laptop (amd ryzen 5 quad core apu radeon ...
7,7,a116,w107,2018 newest hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel quad cor...,2018 newest hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel quad cor...,hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel core i7-8550u proces...,hp envy x360 convertible 2-in-1 full hd ips 15.6 touchscreen notebook intel core i7-8550u proces...
8,8,a1594,w108,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...
9,9,a2490,w108,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...,hp envy x360 15t 2-in-1 convertible laptop pc (amd 7th gen quad-core fx 9800p apu amd radeon r7 ...


In [22]:
# Combine blocker outputs
C = em.combine_blocker_outputs_via_union([C1, C2, C3, C4])

In [23]:
len(C)

6713

We observe that the number of tuple pairs considered for matching is increased to 12530 (from 10165). Now let us debug the blocker output again to check if the current blocker sequence is dropping any potential matches.

In [24]:
# Debug again
dbg = em.debug_blocker(C, A, B, output_size=200)

In [25]:
# Display first few rows from the debugger output
dbg

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_brand,ltable_model,ltable_combo,rtable_product title,rtable_brand,rtable_model,rtable_combo
0,0,a826,w2550,acer swift 3 sf314-52-517z 14 laptop computer - silver intel core i5-8250u processor 1.6ghz 8gb ...,acer,nx.gqgaa.002,acer swift 3 sf314-52-517z 14 laptop computer - silver intel core i5-8250u processor 1.6ghz 8gb ...,acer sf314-51-52w2 swift 3 14 laptop windows 10 intel core i5-6200u processor 8gb ram 256gb soli...,acer,nx.gkbaa.002,acer sf314-51-52w2 swift 3 14 laptop windows 10 intel core i5-6200u processor 8gb ram 256gb soli...
1,1,a229,w4803,dell i3168-3271blu 11.6 hd 2-in-1 laptop (intel pentium n3710 1.6ghz processor 4 gb ddr3l sdram ...,dell,i3168-3271blu,dell i3168-3271blu 11.6 hd 2-in-1 laptop (intel pentium n3710 1.6ghz processor 4 gb ddr3l sdram ...,refurbished dell i3552-3240blk 15.6 hd laptop (intel pentium n3700 1.6ghz processor 4 gb ddr3l s...,dell,i3552-3240blk,refurbished dell i3552-3240blk 15.6 hd laptop (intel pentium n3700 1.6ghz processor 4 gb ddr3l s...
2,2,a229,w2131,dell i3168-3271blu 11.6 hd 2-in-1 laptop (intel pentium n3710 1.6ghz processor 4 gb ddr3l sdram ...,dell,i3168-3271blu,dell i3168-3271blu 11.6 hd 2-in-1 laptop (intel pentium n3710 1.6ghz processor 4 gb ddr3l sdram ...,new dell i3552-3240blk 15.6 hd laptop (intel pentium n3700 1.6ghz processor 4 gb ddr3l sdram 500...,dell,i3552-3240blk,new dell i3552-3240blk 15.6 hd laptop (intel pentium n3700 1.6ghz processor 4 gb ddr3l sdram 500...
3,3,a1463,w2343,acer predator helios 15.6 laptop intel core i7 2.80ghz 16gb ram 1tb hdd 256gb ssd (certified ref...,acer,g3-572-72yf,acer predator helios 15.6 laptop intel core i7 2.80ghz 16gb ram 1tb hdd 256gb ssd (certified ref...,refurbished acer predator 17 laptop intel i7 2.80ghz 16gb ram 1tb hdd 256gb ssd win10home,acer,nh.q29aa.002,refurbished acer predator 17 laptop intel i7 2.80ghz 16gb ram 1tb hdd 256gb ssd win10home acer n...
4,4,a2110,w743,lg gram intel i7 16gb ram 15.6 touchscreen laptop dark silver (15z970-a.aas7u1) with software + ...,lg,,lg gram intel i7 16gb ram 15.6 touchscreen laptop dark silver (15z970-a.aas7u1) with software + ...,lg 15z970-a.aas7u1 gram intel i7 16gb ram 15.6 touchscreen laptop dark silver,lg,15z970-a.aas7u1,lg 15z970-a.aas7u1 gram intel i7 16gb ram 15.6 touchscreen laptop dark silver lg 15z970-a.aas7u1
5,5,a1202,w4097,apple macbook air md760ll/a 13.3-inch laptop (old version),apple,md760ll/a,apple macbook air md760ll/a 13.3-inch laptop (old version) apple md760ll/a,apple macbook air md231ll/a 13.3-inch laptop (old version) (scratched and dented),apple,,apple macbook air md231ll/a 13.3-inch laptop (old version) (scratched and dented) apple
6,6,a136,w2310,dell inspiron 15 gaming laptop: core i7-7700hq 16gb ram 128gb ssd and 1tb hdd gtx 1060 6gb 15.6-...,dell,i7577-7722blk,dell inspiron 15 gaming laptop: core i7-7700hq 16gb ram 128gb ssd and 1tb hdd gtx 1060 6gb 15.6-...,dell inspiron 15 gaming laptop: core i5-7300hq gtx 1060 6gb 128gb ssd+1tb hdd 8gb ram 15.6 full ...,dell,i7577-5258blk-pus,dell inspiron 15 gaming laptop: core i5-7300hq gtx 1060 6gb 128gb ssd+1tb hdd 8gb ram 15.6 full ...
7,7,a854,w585,hp pavilion 15-inch laptop intel core i5-7200u 8gb ram 1tb hard drive windows 10 (15-cc010nr sil...,hp,15-cc010nr,hp pavilion 15-inch laptop intel core i5-7200u 8gb ram 1tb hard drive windows 10 (15-cc010nr sil...,hp 15-bs030nr 15.6 laptop touchscreen windows 10 intel core i5-7200u processor 8gb ram 1tb hard ...,hp,1kv01ua#aba,hp 15-bs030nr 15.6 laptop touchscreen windows 10 intel core i5-7200u processor 8gb ram 1tb hard ...
8,8,a951,w2265,dell 4m5j9 latitude 3480 14 hd laptop (intel core i5-7200u 8gb ddr4 128gb solid state drive wind...,dell,4m5j9,dell 4m5j9 latitude 3480 14 hd laptop (intel core i5-7200u 8gb ddr4 128gb solid state drive wind...,dell latitude 3580 15.6 hd laptop intel core i5-7200u 8gb ddr4 128gb solid state drive windows 1...,dell,lati35802t78m,dell latitude 3580 15.6 hd laptop intel core i5-7200u 8gb ddr4 128gb solid state drive windows 1...
9,9,a1467,w2265,dell 4g86p latitude 5580 laptop 15.6 fhd intel core i7-7820hq 8gb ddr4 256gb solid state drive w...,dell,4g86p,dell 4g86p latitude 5580 laptop 15.6 fhd intel core i7-7820hq 8gb ddr4 256gb solid state drive w...,dell latitude 3580 15.6 hd laptop intel core i5-7200u 8gb ddr4 128gb solid state drive windows 1...,dell,lati35802t78m,dell latitude 3580 15.6 hd laptop intel core i5-7200u 8gb ddr4 128gb solid state drive windows 1...


We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.


# Matching tuple pairs in the candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes.
This typically involves the following five steps:
1. Sampling and labeling the candidate set
2. Splitting the labeled data into development and evaluation set
3. Selecting the best learning based matcher using the development set
4. Evaluating the selected matcher using the evaluation set

## Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.

In [26]:
# Sample  candidate set
S = em.sample_table(C, 10)
S

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_combo,rtable_product title,rtable_combo
972,972,a1347,w3127,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 20gb ram 512gb ssd 14 ips fhd ...,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 20gb ram 512gb ssd 14 ips fhd ...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 8gb ram 256gb pc...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 8gb ram 256gb pc...
1621,1621,a1513,w3312,lenovo thinkpad x1 yoga 20jf000dus 14 touchscreen lcd 2560 x 1440 - ips - intel core i7 (7th gen...,lenovo thinkpad x1 yoga 20jf000dus 14 touchscreen lcd 2560 x 1440 - ips - intel core i7 (7th gen...,refurbished lenovo thinkpad x1 yoga 20jf000hus 14 touchscreen lcd 2 in 1 ultrabook - intel core ...,refurbished lenovo thinkpad x1 yoga 20jf000hus 14 touchscreen lcd 2 in 1 ultrabook - intel core ...
1749,1749,a1602,w1283,hp - 17z laptop 17.3 with hd+ touchscreen ( amd dual-core a6-9220 apu amd radeon r4 graphics 16g...,hp - 17z laptop 17.3 with hd+ touchscreen ( amd dual-core a6-9220 apu amd radeon r4 graphics 16g...,hp - 17z laptop 17.3 with hd + touchscreen (amd dual-core a6-9220 apu amd radeon r4 graphics 8gb...,hp - 17z laptop 17.3 with hd + touchscreen (amd dual-core a6-9220 apu amd radeon r4 graphics 8gb...
1862,1862,a1716,w3182,lenovo thinkpad p51 mobile workstation laptop - windows 10 pro - intel quad-core i7-7820hq 64gb ...,lenovo thinkpad p51 mobile workstation laptop - windows 10 pro - intel quad-core i7-7820hq 64gb ...,lenovo thinkpad t460s business performance windows 10 pro laptop - intel core i7-6600u 20gb ram ...,lenovo thinkpad t460s business performance windows 10 pro laptop - intel core i7-6600u 20gb ram ...
4543,4543,a2418,w3380,lenovo thinkpad p50 mobile workstation laptop - windows 7 pro - intel i7-6700hq 64gb ram 2tb ssd...,lenovo thinkpad p50 mobile workstation laptop - windows 7 pro - intel i7-6700hq 64gb ram 2tb ssd...,lenovo thinkpad t460s business performance windows 8.1 pro laptop - intel core i7-6600u 20gb ram...,lenovo thinkpad t460s business performance windows 8.1 pro laptop - intel core i7-6600u 20gb ram...
4942,4942,a2439,w3377,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 12gb ram 512gb ssd 14 ips fhd ...,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 12gb ram 512gb ssd 14 ips fhd ...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 12gb ram 1tb pci...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 12gb ram 1tb pci...
5057,5057,a2534,w3025,lenovo thinkpad p51 15.6 4k uhd mobile workstation laptop pc (intel i7 quad core 32gb ram 1tb hd...,lenovo thinkpad p51 15.6 4k uhd mobile workstation laptop pc (intel i7 quad core 32gb ram 1tb hd...,lenovo thinkpad p51s 15.6'' premium mobile workstation ultrabook laptop (intel i7 processor 32gb...,lenovo thinkpad p51s 15.6'' premium mobile workstation ultrabook laptop (intel i7 processor 32gb...
5159,5159,a2636,w1630,hp envy 15t high performance laptop pc with full hd touchscreen ( intel i7 processor 16gb ram 1t...,hp envy 15t high performance laptop pc with full hd touchscreen ( intel i7 processor 16gb ram 1t...,hp pavilion 15t gaming laptop with full hd touchscreen ( intel i7 quad core 32gb nvidia geforce ...,hp pavilion 15t gaming laptop with full hd touchscreen ( intel i7 quad core 32gb nvidia geforce ...
6352,6352,a566,w714,2018 newest premium hp 15.6 business flagship laptop hd+ wled-backlit touchscreen display intel ...,2018 newest premium hp 15.6 business flagship laptop hd+ wled-backlit touchscreen display intel ...,2018 newest premium hp 15.6? touchscreen hd laptop intel dual core i3-7100u processor 2.40ghz 8g...,2018 newest premium hp 15.6? touchscreen hd laptop intel dual core i3-7100u processor 2.40ghz 8g...
6588,6588,a888,w3336,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb ssd 16...,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb ssd 16...,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb nvme-p...,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb nvme-p...


Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.

In [27]:
# Label S
G = em.label_table(S, 'match')

Column name (match) is not present in dataframe
  table.set_value(idxv[i], cols[j], val)


For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.

In [28]:
# # Load the pre-labeled data
# path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/lbl_restnt_wf1.csv'
# G = em.read_csv_metadata(path_G, 
#                          key='_id',
#                          ltable=A, rtable=B, 
#                          fk_ltable='ltable_id', fk_rtable='rtable_id')
# len(G)
G

Unnamed: 0,_id,ltable_id,rtable_id,ltable_product title,ltable_combo,rtable_product title,rtable_combo,match
972,972,a1347,w3127,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 20gb ram 512gb ssd 14 ips fhd ...,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 20gb ram 512gb ssd 14 ips fhd ...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 8gb ram 256gb pc...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 8gb ram 256gb pc...,0
1621,1621,a1513,w3312,lenovo thinkpad x1 yoga 20jf000dus 14 touchscreen lcd 2560 x 1440 - ips - intel core i7 (7th gen...,lenovo thinkpad x1 yoga 20jf000dus 14 touchscreen lcd 2560 x 1440 - ips - intel core i7 (7th gen...,refurbished lenovo thinkpad x1 yoga 20jf000hus 14 touchscreen lcd 2 in 1 ultrabook - intel core ...,refurbished lenovo thinkpad x1 yoga 20jf000hus 14 touchscreen lcd 2 in 1 ultrabook - intel core ...,0
1749,1749,a1602,w1283,hp - 17z laptop 17.3 with hd+ touchscreen ( amd dual-core a6-9220 apu amd radeon r4 graphics 16g...,hp - 17z laptop 17.3 with hd+ touchscreen ( amd dual-core a6-9220 apu amd radeon r4 graphics 16g...,hp - 17z laptop 17.3 with hd + touchscreen (amd dual-core a6-9220 apu amd radeon r4 graphics 8gb...,hp - 17z laptop 17.3 with hd + touchscreen (amd dual-core a6-9220 apu amd radeon r4 graphics 8gb...,0
1862,1862,a1716,w3182,lenovo thinkpad p51 mobile workstation laptop - windows 10 pro - intel quad-core i7-7820hq 64gb ...,lenovo thinkpad p51 mobile workstation laptop - windows 10 pro - intel quad-core i7-7820hq 64gb ...,lenovo thinkpad t460s business performance windows 10 pro laptop - intel core i7-6600u 20gb ram ...,lenovo thinkpad t460s business performance windows 10 pro laptop - intel core i7-6600u 20gb ram ...,0
4543,4543,a2418,w3380,lenovo thinkpad p50 mobile workstation laptop - windows 7 pro - intel i7-6700hq 64gb ram 2tb ssd...,lenovo thinkpad p50 mobile workstation laptop - windows 7 pro - intel i7-6700hq 64gb ram 2tb ssd...,lenovo thinkpad t460s business performance windows 8.1 pro laptop - intel core i7-6600u 20gb ram...,lenovo thinkpad t460s business performance windows 8.1 pro laptop - intel core i7-6600u 20gb ram...,0
4942,4942,a2439,w3377,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 12gb ram 512gb ssd 14 ips fhd ...,lenovo thinkpad t470s windows 10 pro laptop - intel core i7-7600u 12gb ram 512gb ssd 14 ips fhd ...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 12gb ram 1tb pci...,pcie lenovo thinkpad t460s business windows 10 pro laptop - intel core i7-6600u 12gb ram 1tb pci...,0
5057,5057,a2534,w3025,lenovo thinkpad p51 15.6 4k uhd mobile workstation laptop pc (intel i7 quad core 32gb ram 1tb hd...,lenovo thinkpad p51 15.6 4k uhd mobile workstation laptop pc (intel i7 quad core 32gb ram 1tb hd...,lenovo thinkpad p51s 15.6'' premium mobile workstation ultrabook laptop (intel i7 processor 32gb...,lenovo thinkpad p51s 15.6'' premium mobile workstation ultrabook laptop (intel i7 processor 32gb...,1
5159,5159,a2636,w1630,hp envy 15t high performance laptop pc with full hd touchscreen ( intel i7 processor 16gb ram 1t...,hp envy 15t high performance laptop pc with full hd touchscreen ( intel i7 processor 16gb ram 1t...,hp pavilion 15t gaming laptop with full hd touchscreen ( intel i7 quad core 32gb nvidia geforce ...,hp pavilion 15t gaming laptop with full hd touchscreen ( intel i7 quad core 32gb nvidia geforce ...,1
6352,6352,a566,w714,2018 newest premium hp 15.6 business flagship laptop hd+ wled-backlit touchscreen display intel ...,2018 newest premium hp 15.6 business flagship laptop hd+ wled-backlit touchscreen display intel ...,2018 newest premium hp 15.6? touchscreen hd laptop intel dual core i3-7100u processor 2.40ghz 8g...,2018 newest premium hp 15.6? touchscreen hd laptop intel dual core i3-7100u processor 2.40ghz 8g...,1
6588,6588,a888,w3336,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb ssd 16...,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb ssd 16...,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb nvme-p...,lenovo thinkpad x1 carbon 4 business ultrabook - windows 10 pro - intel core i7-6600u 1tb nvme-p...,1


## Splitting the labeled data into development and evaluation set

In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.

In [29]:
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(G, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']

## Selecting the best learning-based matcher 

Selecting the best learning-based matcher typically involves the following steps:

1. Creating a set of learning-based matchers
2. Creating features
3. Converting the development set into feature vectors
4. Selecting the best learning-based matcher using k-fold cross validation

### Creating a set of learning-based matchers

In [30]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

### Creating features

Next, we need to create a set of features for the development set. *py_entitymatching* provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.

In [41]:
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=True)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,id,id,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,product title,product title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,brand,brand,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
3,model,model,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
4,combo,combo,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"


Do you want to proceed? (y/n):n

If the attribute correspondences or types have been inferred incorrectly,
use the get_features() function with your  own correspondences and attribute
types to get the correct features for your data


In [32]:
# List the names of the features generated
feature_table['feature_name']

0                                      id_id_lev_dist
1                                       id_id_lev_sim
2                                           id_id_jar
3                                           id_id_jwn
4                                           id_id_exm
5                               id_id_jac_qgm_3_qgm_3
6         product_title_product_title_jac_qgm_3_qgm_3
7     product_title_product_title_cos_dlm_dc0_dlm_dc0
8                         brand_brand_jac_qgm_3_qgm_3
9                     brand_brand_cos_dlm_dc0_dlm_dc0
10                    brand_brand_jac_dlm_dc0_dlm_dc0
11                                    brand_brand_mel
12                               brand_brand_lev_dist
13                                brand_brand_lev_sim
14                                    brand_brand_nmw
15                                     brand_brand_sw
16                        model_model_jac_qgm_3_qgm_3
17                    model_model_cos_dlm_dc0_dlm_dc0
18                    model_

### Converting the development set to feature vectors

In [34]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)

In [35]:
# Display first few rows
H.head(3)

Unnamed: 0,_id,ltable_id,rtable_id,id_id_lev_dist,id_id_lev_sim,id_id_jar,id_id_jwn,id_id_exm,id_id_jac_qgm_3_qgm_3,product_title_product_title_jac_qgm_3_qgm_3,...,model_model_cos_dlm_dc0_dlm_dc0,model_model_jac_dlm_dc0_dlm_dc0,model_model_mel,model_model_lev_dist,model_model_lev_sim,model_model_nmw,model_model_sw,combo_combo_jac_qgm_3_qgm_3,combo_combo_cos_dlm_dc0_dlm_dc0,match
6588,6588,a888,w3336,5,0.0,0.0,0.0,0,0.0,0.888298,...,,,,,,,,0.890052,0.949289,1
1621,1621,a1513,w3312,4,0.2,0.466667,0.466667,0,0.0,0.615,...,,,,,,,,0.567442,0.743484,0
5057,5057,a2534,w3025,5,0.0,0.0,0.0,0,0.0,0.497872,...,,,,,,,,0.490909,0.612903,1


### Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher.

In [37]:
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        k=5,
        target_attr='match', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']

JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/usr/lib/python3.6/runpy.py in _run_module_as_main(mod_name='ipykernel_launcher', alter_argv=1)
    188         sys.exit(msg)
    189     main_globals = sys.modules["__main__"].__dict__
    190     if alter_argv:
    191         sys.argv[0] = mod_spec.origin
    192     return _run_code(code, main_globals, None,
--> 193                      "__main__", mod_spec)
        mod_spec = ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/dist-packages/ipykernel_launcher.py')
    194 
    195 def run_module(mod_name, init_globals=None,
    196                run_name=None, alter_sys=False):
    197     """Execute a module's code without importing it

...........................................................................
/usr/lib/python3.6/runpy.py in _run_code(code=<code object <module> at 0x7f3eea012300, file "/...3.6/dist-packages/ipykernel_launcher.py", line 5>, run_globals={'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/usr/local/lib/python3.6/dist-packages/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/dist-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py'>, ...}, init_globals=None, mod_name='__main__', mod_spec=ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/dist-packages/ipykernel_launcher.py'), pkg_name='', script_name=None)
     80                        __cached__ = cached,
     81                        __doc__ = None,
     82                        __loader__ = loader,
     83                        __package__ = pkg_name,
     84                        __spec__ = mod_spec)
---> 85     exec(code, run_globals)
        code = <code object <module> at 0x7f3eea012300, file "/...3.6/dist-packages/ipykernel_launcher.py", line 5>
        run_globals = {'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/usr/local/lib/python3.6/dist-packages/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/dist-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py'>, ...}
     86     return run_globals
     87 
     88 def _run_module_code(code, init_globals=None,
     89                     mod_name=None, mod_spec=None,

...........................................................................
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py in <module>()
     11     # This is added back by InteractiveShellApp.init_path()
     12     if sys.path[0] == '':
     13         del sys.path[0]
     14 
     15     from ipykernel import kernelapp as app
---> 16     app.launch_new_instance()

...........................................................................
/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    653 
    654         If a global instance already exists, this reinitializes and starts it
    655         """
    656         app = cls.instance(**kwargs)
    657         app.initialize(argv)
--> 658         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    659 
    660 #-----------------------------------------------------------------------------
    661 # utility functions, for convenience
    662 #-----------------------------------------------------------------------------

...........................................................................
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    481         if self.poller is not None:
    482             self.poller.start()
    483         self.kernel.start()
    484         self.io_loop = ioloop.IOLoop.current()
    485         try:
--> 486             self.io_loop.start()
        self.io_loop.start = <bound method BaseAsyncIOLoop.start of <tornado.platform.asyncio.AsyncIOMainLoop object>>
    487         except KeyboardInterrupt:
    488             pass
    489 
    490 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py in start(self=<tornado.platform.asyncio.AsyncIOMainLoop object>)
    122         except (RuntimeError, AssertionError):
    123             old_loop = None
    124         try:
    125             self._setup_logging()
    126             asyncio.set_event_loop(self.asyncio_loop)
--> 127             self.asyncio_loop.run_forever()
        self.asyncio_loop.run_forever = <bound method BaseEventLoop.run_forever of <_Uni...EventLoop running=True closed=False debug=False>>
    128         finally:
    129             asyncio.set_event_loop(old_loop)
    130 
    131     def stop(self):

...........................................................................
/usr/lib/python3.6/asyncio/base_events.py in run_forever(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
    417             sys.set_asyncgen_hooks(firstiter=self._asyncgen_firstiter_hook,
    418                                    finalizer=self._asyncgen_finalizer_hook)
    419         try:
    420             events._set_running_loop(self)
    421             while True:
--> 422                 self._run_once()
        self._run_once = <bound method BaseEventLoop._run_once of <_UnixS...EventLoop running=True closed=False debug=False>>
    423                 if self._stopping:
    424                     break
    425         finally:
    426             self._stopping = False

...........................................................................
/usr/lib/python3.6/asyncio/base_events.py in _run_once(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
   1427                         logger.warning('Executing %s took %.3f seconds',
   1428                                        _format_handle(handle), dt)
   1429                 finally:
   1430                     self._current_handle = None
   1431             else:
-> 1432                 handle._run()
        handle._run = <bound method Handle._run of <Handle BaseAsyncIOLoop._handle_events(13, 1)>>
   1433         handle = None  # Needed to break cycles when an exception occurs.
   1434 
   1435     def _set_coroutine_wrapper(self, enabled):
   1436         try:

...........................................................................
/usr/lib/python3.6/asyncio/events.py in _run(self=<Handle BaseAsyncIOLoop._handle_events(13, 1)>)
    140             self._callback = None
    141             self._args = None
    142 
    143     def _run(self):
    144         try:
--> 145             self._callback(*self._args)
        self._callback = <bound method BaseAsyncIOLoop._handle_events of <tornado.platform.asyncio.AsyncIOMainLoop object>>
        self._args = (13, 1)
    146         except Exception as exc:
    147             cb = _format_callback_source(self._callback, self._args)
    148             msg = 'Exception in callback {}'.format(cb)
    149             context = {

...........................................................................
/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py in _handle_events(self=<tornado.platform.asyncio.AsyncIOMainLoop object>, fd=13, events=1)
    112             self.writers.remove(fd)
    113         del self.handlers[fd]
    114 
    115     def _handle_events(self, fd, events):
    116         fileobj, handler_func = self.handlers[fd]
--> 117         handler_func(fileobj, events)
        handler_func = <function wrap.<locals>.null_wrapper>
        fileobj = <zmq.sugar.socket.Socket object>
        events = 1
    118 
    119     def start(self):
    120         try:
    121             old_loop = asyncio.get_event_loop()

...........................................................................
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    271         # Fast path when there are no active contexts.
    272         def null_wrapper(*args, **kwargs):
    273             try:
    274                 current_state = _state.contexts
    275                 _state.contexts = cap_contexts[0]
--> 276                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    277             finally:
    278                 _state.contexts = current_state
    279         null_wrapper._wrapped = True
    280         return null_wrapper

...........................................................................
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    445             return
    446         zmq_events = self.socket.EVENTS
    447         try:
    448             # dispatch events:
    449             if zmq_events & zmq.POLLIN and self.receiving():
--> 450                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    451                 if not self.socket:
    452                     return
    453             if zmq_events & zmq.POLLOUT and self.sending():
    454                 self._handle_send()

...........................................................................
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    475             else:
    476                 raise
    477         else:
    478             if self._recv_callback:
    479                 callback = self._recv_callback
--> 480                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function wrap.<locals>.null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    481         
    482 
    483     def _handle_send(self):
    484         """Handle a send event."""

...........................................................................
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function wrap.<locals>.null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    427         close our socket."""
    428         try:
    429             # Use a NullContext to ensure that all StackContexts are run
    430             # inside our blanket exception handler rather than outside.
    431             with stack_context.NullContext():
--> 432                 callback(*args, **kwargs)
        callback = <function wrap.<locals>.null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    433         except:
    434             gen_log.error("Uncaught exception in ZMQStream callback",
    435                           exc_info=True)
    436             # Re-raise the exception so that IOLoop.handle_callback_exception

...........................................................................
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    271         # Fast path when there are no active contexts.
    272         def null_wrapper(*args, **kwargs):
    273             try:
    274                 current_state = _state.contexts
    275                 _state.contexts = cap_contexts[0]
--> 276                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    277             finally:
    278                 _state.contexts = current_state
    279         null_wrapper._wrapped = True
    280         return null_wrapper

...........................................................................
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    278         if self.control_stream:
    279             self.control_stream.on_recv(self.dispatch_control, copy=False)
    280 
    281         def make_dispatcher(stream):
    282             def dispatcher(msg):
--> 283                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    284             return dispatcher
    285 
    286         for s in self.shell_streams:
    287             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {'allow_stdin': True, 'code': "# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 4, 9, 22, 55, 57, 877960, tzinfo=tzutc()), 'msg_id': 'd4f05b125aa74f1488e59f34f480e163', 'msg_type': 'execute_request', 'session': 'fe885e0981af4e288f1b98856a899cbb', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': 'd4f05b125aa74f1488e59f34f480e163', 'msg_type': 'execute_request', 'parent_header': {}})
    228             self.log.warn("Unknown message type: %r", msg_type)
    229         else:
    230             self.log.debug("%s: %s", msg_type, msg)
    231             self.pre_handler_hook()
    232             try:
--> 233                 handler(stream, idents, msg)
        handler = <bound method Kernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = [b'fe885e0981af4e288f1b98856a899cbb']
        msg = {'buffers': [], 'content': {'allow_stdin': True, 'code': "# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 4, 9, 22, 55, 57, 877960, tzinfo=tzutc()), 'msg_id': 'd4f05b125aa74f1488e59f34f480e163', 'msg_type': 'execute_request', 'session': 'fe885e0981af4e288f1b98856a899cbb', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': 'd4f05b125aa74f1488e59f34f480e163', 'msg_type': 'execute_request', 'parent_header': {}}
    234             except Exception:
    235                 self.log.error("Exception in message handler:", exc_info=True)
    236             finally:
    237                 self.post_handler_hook()

...........................................................................
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=[b'fe885e0981af4e288f1b98856a899cbb'], parent={'buffers': [], 'content': {'allow_stdin': True, 'code': "# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 4, 9, 22, 55, 57, 877960, tzinfo=tzutc()), 'msg_id': 'd4f05b125aa74f1488e59f34f480e163', 'msg_type': 'execute_request', 'session': 'fe885e0981af4e288f1b98856a899cbb', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': 'd4f05b125aa74f1488e59f34f480e163', 'msg_type': 'execute_request', 'parent_header': {}})
    394         if not silent:
    395             self.execution_count += 1
    396             self._publish_execute_input(code, parent, self.execution_count)
    397 
    398         reply_content = self.do_execute(code, silent, store_history,
--> 399                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    400 
    401         # Flush output before sending the reply.
    402         sys.stdout.flush()
    403         sys.stderr.flush()

...........................................................................
/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code="# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']", silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    203 
    204         self._forward_input(allow_stdin)
    205 
    206         reply_content = {}
    207         try:
--> 208             res = shell.run_cell(code, store_history=store_history, silent=silent)
        res = undefined
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = "# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']"
        store_history = True
        silent = False
    209         finally:
    210             self._restore_input()
    211 
    212         if res.error_before_exec is not None:

...........................................................................
/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=("# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']",), **kwargs={'silent': False, 'store_history': True})
    532             )
    533         self.payload_manager.write_payload(payload)
    534 
    535     def run_cell(self, *args, **kwargs):
    536         self._last_traceback = None
--> 537         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
        self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        args = ("# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']",)
        kwargs = {'silent': False, 'store_history': True}
    538 
    539     def _showtraceback(self, etype, evalue, stb):
    540         # try to preserve ordering of tracebacks and print statements
    541         sys.stdout.flush()

...........................................................................
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell="# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']", store_history=True, silent=False, shell_futures=True)
   2657         -------
   2658         result : :class:`ExecutionResult`
   2659         """
   2660         try:
   2661             result = self._run_cell(
-> 2662                 raw_cell, store_history, silent, shell_futures)
        raw_cell = "# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']"
        store_history = True
        silent = False
        shell_futures = True
   2663         finally:
   2664             self.events.trigger('post_execute')
   2665             if not silent:
   2666                 self.events.trigger('post_run_cell', result)

...........................................................................
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in _run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell="# Select the best ML matcher using CV\nresult = e..._matcher='f1', random_state=0)\nresult['cv_stats']", store_history=True, silent=False, shell_futures=True)
   2780                 self.displayhook.exec_result = result
   2781 
   2782                 # Execute the user code
   2783                 interactivity = 'none' if silent else self.ast_node_interactivity
   2784                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2785                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler object>
   2786                 
   2787                 self.last_execution_succeeded = not has_raised
   2788                 self.last_execution_result = result
   2789 

...........................................................................
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.Assign object>, <_ast.Expr object>], cell_name='<ipython-input-37-b203e7457d37>', interactivity='last', compiler=<IPython.core.compilerop.CachingCompiler object>, result=<ExecutionResult object at 7f3e8949aba8, executi...rue silent=False shell_futures=True> result=None>)
   2898 
   2899         try:
   2900             for i, node in enumerate(to_run_exec):
   2901                 mod = ast.Module([node])
   2902                 code = compiler(mod, cell_name, "exec")
-> 2903                 if self.run_code(code, result):
        self.run_code = <bound method InteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x7f3e894c2a50, file "<ipython-input-37-b203e7457d37>", line 2>
        result = <ExecutionResult object at 7f3e8949aba8, executi...rue silent=False shell_futures=True> result=None>
   2904                     return True
   2905 
   2906             for i, node in enumerate(to_run_interactive):
   2907                 mod = ast.Interactive([node])

...........................................................................
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x7f3e894c2a50, file "<ipython-input-37-b203e7457d37>", line 2>, result=<ExecutionResult object at 7f3e8949aba8, executi...rue silent=False shell_futures=True> result=None>)
   2958         outflag = True  # happens in more places, so it's easier as default
   2959         try:
   2960             try:
   2961                 self.hooks.pre_run_code_hook()
   2962                 #rprint('Running code', repr(code_obj)) # dbg
-> 2963                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x7f3e894c2a50, file "<ipython-input-37-b203e7457d37>", line 2>
        self.user_global_ns = {'A':          id  \
0        a0   
1        a1   
2  ...i7-6700hq 32gb ram ...  

[2976 rows x 5 columns], 'A1':          id  \
2561  a2561   
1796  a1796   
256...00tus lenovo 20hd000tus  

[146 rows x 5 columns], 'B':          id  \
0        w0   
1        w1   
2  ...64 laptop b camera hp   

[4847 rows x 5 columns], 'B1':          id  \
2011  w2011   
2349  w2349   
226...d42ll/a apple mqd42ll/a  

[200 rows x 5 columns], 'C':        _id ltable_id rtable_id  \
0        0    ...am 1tb hdd + 128gb ...  

[6713 rows x 7 columns], 'C1':        _id ltable_id rtable_id  \
0        0    ...gb ssd windows 10 home  

[6513 rows x 5 columns], 'C2':      _id ltable_id rtable_id  \
0      0      a2...el i7 processor 32gb...  

[727 rows x 7 columns], 'C3':       _id ltable_id rtable_id  \
0       0      ...hdd nvidia geforce ...  

[2066 rows x 5 columns], 'C4':       _id ltable_id rtable_id  \
0       0      ...hdd nvidia geforce ...  

[2158 rows x 7 columns], 'G':        _id ltable_id rtable_id  \
972    972    ...    1  
5159      1  
6352      1  
6588      1  , ...}
        self.user_ns = {'A':          id  \
0        a0   
1        a1   
2  ...i7-6700hq 32gb ram ...  

[2976 rows x 5 columns], 'A1':          id  \
2561  a2561   
1796  a1796   
256...00tus lenovo 20hd000tus  

[146 rows x 5 columns], 'B':          id  \
0        w0   
1        w1   
2  ...64 laptop b camera hp   

[4847 rows x 5 columns], 'B1':          id  \
2011  w2011   
2349  w2349   
226...d42ll/a apple mqd42ll/a  

[200 rows x 5 columns], 'C':        _id ltable_id rtable_id  \
0        0    ...am 1tb hdd + 128gb ...  

[6713 rows x 7 columns], 'C1':        _id ltable_id rtable_id  \
0        0    ...gb ssd windows 10 home  

[6513 rows x 5 columns], 'C2':      _id ltable_id rtable_id  \
0      0      a2...el i7 processor 32gb...  

[727 rows x 7 columns], 'C3':       _id ltable_id rtable_id  \
0       0      ...hdd nvidia geforce ...  

[2066 rows x 5 columns], 'C4':       _id ltable_id rtable_id  \
0       0      ...hdd nvidia geforce ...  

[2158 rows x 7 columns], 'G':        _id ltable_id rtable_id  \
972    972    ...    1  
5159      1  
6352      1  
6588      1  , ...}
   2964             finally:
   2965                 # Reset our crash handler in place
   2966                 sys.excepthook = old_excepthook
   2967         except SystemExit as e:

...........................................................................
/home/liang/Workspace/cs839-project/stage-3/<ipython-input-37-b203e7457d37> in <module>()
      1 # Select the best ML matcher using CV
      2 result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
      3         exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
      4         k=5,
----> 5         target_attr='match', metric_to_select_matcher='f1', random_state=0)
      6 result['cv_stats']

...........................................................................
/usr/local/lib/python3.6/dist-packages/py_entitymatching/matcherselector/mlmatcherselection.py in select_matcher(matchers=[<py_entitymatching.matcher.dtmatcher.DTMatcher object>, <py_entitymatching.matcher.rfmatcher.RFMatcher object>, <py_entitymatching.matcher.svmmatcher.SVMMatcher object>, <py_entitymatching.matcher.linregmatcher.LinRegMatcher object>, <py_entitymatching.matcher.logregmatcher.LogRegMatcher object>, <py_entitymatching.matcher.nbmatcher.NBMatcher object>], x=array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]]), y=array([1, 0, 1, 1, 0, 0, 0]), table=       _id ltable_id rtable_id  id_id_lev_dist  ...         0.727393      0  

[7 rows x 30 columns], exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'], target_attr='match', metric_to_select_matcher='f1', metrics_to_display=['precision', 'recall', 'f1'], k=5, n_jobs=-1, random_state=0)
    119         mean_score_list = []
    120         # Run the cross validation for each matcher
    121         for m in matchers:
    122             # Use scikit learn's cross validation to get the matcher and the list
    123             #  of scores (one for each fold).
--> 124             matcher, scores = cross_validation(m, x, y, met, k, random_state, n_jobs)
        matcher = undefined
        scores = undefined
        m = <py_entitymatching.matcher.dtmatcher.DTMatcher object>
        x = array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]])
        y = array([1, 0, 1, 1, 0, 0, 0])
        met = 'precision'
        k = 5
        random_state = 0
        n_jobs = -1
    125             # Fill a dictionary based on the matcher and the scores.
    126             val_list = [matcher.get_name(), matcher, k]
    127             val_list.extend(scores)
    128             val_list.append(pd.np.mean(scores))

...........................................................................
/usr/local/lib/python3.6/dist-packages/py_entitymatching/matcherselector/mlmatcherselection.py in cross_validation(matcher=<py_entitymatching.matcher.dtmatcher.DTMatcher object>, x=array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]]), y=array([1, 0, 1, 1, 0, 0, 0]), metric='precision', k=5, random_state=0, n_jobs=-1)
    157     # Use KFold function from scikit learn to create a ms object that can be
    158     # used for cross_val_score function.
    159     cv = KFold(k, shuffle=True, random_state=random_state)
    160     # Call the scikit-learn's cross_val_score function
    161     scores = cross_val_score(matcher.clf, x, y, scoring=metric, cv=cv,
--> 162                              n_jobs=n_jobs)
        n_jobs = -1
    163     # Finally, return the matcher along with the scores.
    164     return matcher, scores
    165 
    166 

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator=DecisionTreeClassifier(class_weight=None, criter...lse, random_state=0,
            splitter='best'), X=array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]]), y=array([1, 0, 1, 1, 0, 0, 0]), groups=None, scoring='precision', cv=KFold(n_splits=5, random_state=0, shuffle=True), n_jobs=-1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')
    337     cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups,
    338                                 scoring={'score': scorer}, cv=cv,
    339                                 return_train_score=False,
    340                                 n_jobs=n_jobs, verbose=verbose,
    341                                 fit_params=fit_params,
--> 342                                 pre_dispatch=pre_dispatch)
        pre_dispatch = '2*n_jobs'
    343     return cv_results['test_score']
    344 
    345 
    346 def _fit_and_score(estimator, X, y, scorer, train, test, verbose,

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py in cross_validate(estimator=DecisionTreeClassifier(class_weight=None, criter...lse, random_state=0,
            splitter='best'), X=array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]]), y=array([1, 0, 1, 1, 0, 0, 0]), groups=None, scoring={'score': make_scorer(precision_score)}, cv=KFold(n_splits=5, random_state=0, shuffle=True), n_jobs=-1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', return_train_score=False)
    201     scores = parallel(
    202         delayed(_fit_and_score)(
    203             clone(estimator), X, y, scorers, train, test, verbose, None,
    204             fit_params, return_train_score=return_train_score,
    205             return_times=True)
--> 206         for train, test in cv.split(X, y, groups))
        cv.split = <bound method _BaseKFold.split of KFold(n_splits=5, random_state=0, shuffle=True)>
        X = array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]])
        y = array([1, 0, 1, 1, 0, 0, 0])
        groups = None
    207 
    208     if return_train_score:
    209         train_scores, test_scores, fit_times, score_times = zip(*scores)
    210         train_scores = _aggregate_score_dicts(train_scores)

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=-1), iterable=<generator object cross_validate.<locals>.<genexpr>>)
    784             if pre_dispatch == "all" or n_jobs == 1:
    785                 # The iterable was consumed all at once by the above for loop.
    786                 # No need to wait for async callbacks to trigger to
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=-1)>
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time
    792             self._print('Done %3i out of %3i | elapsed: %s finished',
    793                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Mon Apr  9 17:55:57 2018
PID: 6523                                    Python 3.6.5: /usr/bin/python3
...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _fit_and_score>, (DecisionTreeClassifier(class_weight=None, criter...lse, random_state=0,
            splitter='best'), array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]]), array([1, 0, 1, 1, 0, 0, 0]), {'score': make_scorer(precision_score)}, array([0, 1, 3, 4, 5]), array([2, 6]), 0, None, None), {'return_times': True, 'return_train_score': False})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (DecisionTreeClassifier(class_weight=None, criter...lse, random_state=0,
            splitter='best'), array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]]), array([1, 0, 1, 1, 0, 0, 0]), {'score': make_scorer(precision_score)}, array([0, 1, 3, 4, 5]), array([2, 6]), 0, None, None)
        kwargs = {'return_times': True, 'return_train_score': False}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator=DecisionTreeClassifier(class_weight=None, criter...lse, random_state=0,
            splitter='best'), X=array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.62264151,
         0.72739297]]), y=array([1, 0, 1, 1, 0, 0, 0]), scorer={'score': make_scorer(precision_score)}, train=array([0, 1, 3, 4, 5]), test=array([2, 6]), verbose=0, parameters=None, fit_params={}, return_train_score=False, return_parameters=False, return_n_test_samples=False, return_times=True, error_score='raise')
    453 
    454     try:
    455         if y_train is None:
    456             estimator.fit(X_train, **fit_params)
    457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method DecisionTreeClassifier.fit of Deci...se, random_state=0,
            splitter='best')>
        X_train = array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.57534247,
         0.69102332]])
        y_train = array([1, 0, 1, 0, 0])
        fit_params = {}
    459 
    460     except Exception as e:
    461         # Note fit time as time until error
    462         fit_time = time.time() - start_time

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py in fit(self=DecisionTreeClassifier(class_weight=None, criter...lse, random_state=0,
            splitter='best'), X=array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.57534247,
         0.69102332]]), y=array([1, 0, 1, 0, 0]), sample_weight=None, check_input=True, X_idx_sorted=None)
    785 
    786         super(DecisionTreeClassifier, self).fit(
    787             X, y,
    788             sample_weight=sample_weight,
    789             check_input=check_input,
--> 790             X_idx_sorted=X_idx_sorted)
        X_idx_sorted = None
    791         return self
    792 
    793     def predict_proba(self, X, check_input=True):
    794         """Predict class probabilities of the input samples X.

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py in fit(self=DecisionTreeClassifier(class_weight=None, criter...lse, random_state=0,
            splitter='best'), X=array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.57534247,
         0.69102332]]), y=array([1, 0, 1, 0, 0]), sample_weight=None, check_input=True, X_idx_sorted=None)
    111     def fit(self, X, y, sample_weight=None, check_input=True,
    112             X_idx_sorted=None):
    113 
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
        X = array([[ 5.        ,  0.        ,  0.        ,  ...         nan,  0.57534247,
         0.69102332]])
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):
    119                 X.sort_indices()
    120 

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array=array([[ 5.        ,  0.        ,  0.        ,  ...0.5753425 ,
         0.6910233 ]], dtype=float32), accept_sparse='csc', dtype=<class 'numpy.float32'>, order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator=None)
    448             array = array.astype(np.float64)
    449         if not allow_nd and array.ndim >= 3:
    450             raise ValueError("Found array with dim %d. %s expected <= 2."
    451                              % (array.ndim, estimator_name))
    452         if force_all_finite:
--> 453             _assert_all_finite(array)
        array = array([[ 5.        ,  0.        ,  0.        ,  ...0.5753425 ,
         0.6910233 ]], dtype=float32)
    454 
    455     shape_repr = _shape_repr(array.shape)
    456     if ensure_min_samples > 0:
    457         n_samples = _num_samples(array)

...........................................................................
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X=array([[ 5.        ,  0.        ,  0.        ,  ...0.5753425 ,
         0.6910233 ]], dtype=float32))
     39     # everything is finite; fall back to O(n) space np.isfinite to prevent
     40     # false positives from overflow in sum method.
     41     if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
        X.dtype = dtype('float32')
     45 
     46 
     47 def assert_all_finite(X):
     48     """Throw a ValueError if X contains NaN or infinity.

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
___________________________________________________________________________

### Debugging matcher

We observe that the best matcher is not maximizing F1. We debug the matcher to see what might be wrong.
To do this, first we split the feature vectors into train and test.

In [38]:
#  Split feature vectors into train and test
UV = em.split_train_test(H, train_proportion=0.5)
U = UV['train']
V = UV['test']

Next, we debug the matcher using GUI. For the purposes of this guide, we use random forest matcher for debugging purposes.

In [40]:
# Debug decision tree using GUI
em.vis_debug_rf(rf, U, V, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'match'],
        target_attr='match')

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

From the GUI, we observe that phone numbers seem to be an important attribute, but they are in different format. Current features does not capture and adding a feature incorporating this difference in format can potentially improve 
the F1 numbers.

In [None]:
def phone_phone_feature(ltuple, rtuple):
    p1 = ltuple.phone
    p2 = rtuple.phone
    p1 = p1.replace('/','-')
    p1 = p1.replace(' ','')
    p2 = p2.replace('/','-')
    p2 = p2.replace(' ','')    
    if p1 == p2:
        return 1.0
    else:
        return 0.0

In [None]:
feature_table = em.get_features_for_matching(A, B)
em.add_blackbox_feature(feature_table, 'phone_phone_feature', phone_phone_feature)

Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.

In [None]:
H = em.extract_feature_vecs(I, feature_table=feature_table, attrs_after='gold', show_progress=False)

In [None]:
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
        k=5,
        target_attr='gold', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']

Now, observe the best matcher is achieving a better F1. Let us stop here and  proceed on to evaluating the best matcher on the unseen data (the evaluation set).

##  Evaluating the matching output

Evaluating the matching outputs for the evaluation set typically involves the following four steps:
1. Converting the evaluation set to feature vectors
2. Training matcher using the feature vectors extracted from the development set
3. Predicting the evaluation set using the trained matcher
4. Evaluating the predicted matches

### Converting the evaluation set to  feature vectors

As before, we convert to the feature vectors (using the feature table and the evaluation set)

In [None]:
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(J, feature_table=feature_table,
                            attrs_after='gold', show_progress=False)

### Training the selected matcher

Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.

In [None]:
# Train using feature vectors from I 
rf.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], 
       target_attr='gold')

### Predicting the matches

Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).

In [None]:
# Predict on L 
predictions = rf.predict(table=L, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], 
              append=True, target_attr='predicted', inplace=False)

### Evaluating the predictions

Finally, we evaluate the accuracy of predicted outputs

In [None]:
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'gold', 'predicted')
em.print_eval_summary(eval_result)