In [1]:
%load_ext autotime
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/magellan/')

This demo notebook illustrates how to match two tables using Magellan. Our goal is to come up with a workflow to match DBLP and ACM datasets. Specifically, we want to achieve precision greater than 95% and get recall as high as possible. The datasets contain information about the conference papers published in top database conferences. 

First, we need to import the Magellan package as follows:

In [2]:
# Import libraries
import magellan as mg
import pandas as pd
import py_stringsimjoin as ssj
import os, sys
import qgrid

time: 2.12 s


Matching two tables typically consists of the following four steps:
1. Loading the input tables
2. Exploring and cleaning the input tables
3. Blocking the input tables to get a candidate set
4. Matching the tuple pairs in the candidate set

# 1. Loading the input tables

We begin by loading the input tables.

In [3]:
dblp_dataset_path = os.sep.join(['..', 'DBLP_ACM', 'DBLP_cleaned.csv'])
acm_dataset_path = os.sep.join(['..', 'DBLP_ACM', 'ACM_cleaned.csv'])

time: 1.15 ms


In [4]:
# Load csv files as dataframes and set the key attribute in the dataframe
A = mg.read_csv_metadata(dblp_dataset_path, key='id')
B = mg.read_csv_metadata(acm_dataset_path, key='id')

time: 24.3 ms


In [5]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))

Number of tuples in A: 2616
Number of tuples in B: 2294
Number of tuples in A X B (i.e the cartesian product): 6001104
time: 1.55 ms


In [6]:
A.head(2)

Unnamed: 0,id,title,authors,venue,year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models for Application to Global Information Systems and D...,D. Scott Mackay,SIGMOD Record,1999
1,conf/vldb/PoosalaI96,Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing,"Viswanath Poosala, Yannis E. Ioannidis",VLDB,1996


time: 17.4 ms


In [7]:
B.head(2)

Unnamed: 0,id,title,authors,venue,year
0,304586,The WASA2 object-oriented workflow management system,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data,1999
1,304587,A user-centered interface for querying distributed multimedia databases,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data,1999


time: 7.51 ms


In [8]:
# Display the key attributes of table A and B.
mg.get_key(A), mg.get_key(B)

('id', 'id')

time: 2.46 ms


In [9]:
# If the tables are large we can downsample the tables like this
A1, B1 = mg.down_sample(A, B, 500, 1)
# But for the demo, we will use the entire table A and B

0%                          100%
[##############################] | ETA: 00:00:00

time: 1.11 s



Total time elapsed: 00:00:01


In [10]:
len(A1), len(B1)

(483, 500)

time: 2.59 ms


# 2. Exploring and cleaning the input tables

In the next step, we explore, understand and clean the input tables.

In [11]:
# Profile the input DBLP dataset
ssj.profile_table_for_join(A)

Unnamed: 0_level_0,Unique values,Missing values,Comments
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id,2616 (100.0%),0 (0.0%),This attribute can be used as a key attribute.
title,2521 (96.37%),0 (0.0%),
authors,2316 (88.53%),0 (0.0%),
venue,5 (0.19%),0 (0.0%),
year,10 (0.38%),0 (0.0%),


time: 30.4 ms


In [12]:
# Profile the input ACM dataset
ssj.profile_table_for_join(B)

Unnamed: 0_level_0,Unique values,Missing values,Comments
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id,2294 (100.0%),0 (0.0%),This attribute can be used as a key attribute.
title,2230 (97.21%),0 (0.0%),
authors,2009 (87.58%),14 (0.61%),Joining on this attribute will ignore 14 (0.61%) rows.
venue,5 (0.22%),0 (0.0%),
year,10 (0.44%),0 (0.0%),


time: 32 ms


In [13]:
# Explore the input DBLP dataset
#qgrid.show_grid(A.sample(30, random_state=0))

time: 638 µs


In [14]:
# Explore the input ACM dataset
#qgrid.show_grid(B.sample(30, random_state=0))

time: 516 µs


We observe that that the 'authors' column in DBLP dataset contains '?' to represent the missing values. We will replace it with NaN to uniformly represent missing values.

In [15]:
# Observe that the authors have some ? replace with NaN
A.replace({'authors':{'?':pd.np.NaN}}, inplace=True)

time: 1.82 ms


Now, let us check the number of papers published across different conferences.

We observe that the 'venue' column has different names representing the same conference. We will normalize the names in the 'venue' column across the two datasets.

In [16]:
# Normalize attr. values
B.replace({'venue':{
            'The VLDB Journal — The International Journal on Very Large Data Bases':'VLDB J.',
            'Very Large Data Bases': 'VLDB',
            'ACM SIGMOD Record': 'SIGMOD Record'
        }}, inplace=True)

time: 3.27 ms


# 2. Blocking to create candidate tuple pairs

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. 

Magellan provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Refer to [api reference] for more details. The user can mix and match these blockers to form a blocking sequence 
applied to input tables.

For the matching problem at hand, we know that two conference papers published in different years cannot match. So, we decide to apply an attribute equivelance blocker on the 'year' attribute. 

In [17]:
# Plan
# A, B ------ attribute equivalence [year] -----> C1

time: 776 µs


In [18]:
# Create attribute equivalence blocker
ab = mg.AttrEquivalenceBlocker()
# Block tables using 'year' attribute: same year then include in the canidate set
C1 = ab.block_tables(A, B, 'year', 'year', 
                   l_output_attrs=['title', 'authors', 'year'],
                   r_output_attrs=['title', 'authors', 'year']
                   )

time: 238 ms


In [19]:
# Check the number of rows in C1
len(C1)

601284

time: 2.1 ms


In [20]:
# Display first two rows from C1
C1.head(2)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_year,rtable_title,rtable_authors,rtable_year
0,0,journals/sigmod/Mackay99,304586,Semantic Integration of Environmental Models for Application to Global Information Systems and D...,D. Scott Mackay,1999,The WASA2 object-oriented workflow management system,"Gottfried Vossen, Mathias Weske",1999
1,1,journals/sigmod/Mackay99,304587,Semantic Integration of Environmental Models for Application to Global Information Systems and D...,D. Scott Mackay,1999,A user-centered interface for querying distributed multimedia databases,"Isabel F. Cruz, Kimberly M. James",1999


time: 10.5 ms


The number of tuple pairs considered for matching is reduced to 601284 (from 6001104), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in Magellan as follows:

In [21]:
# Debug blocker output
dbg = mg.debug_blocker(C1, A, B, output_size=200)

time: 1.69 s


In [22]:
# Display first few tuple pairs from the debug_blocker's output
dbg.head()

Unnamed: 0,_id,similarity,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_venue,rtable_title,rtable_authors,rtable_venue
0,0,1.0,journals/sigmod/Hammer01,601859,Treasurer's Message,Joachim Hammer,SIGMOD Record,Treasurer's message,Joachim Hammer,SIGMOD Record
1,1,1.0,journals/sigmod/Aberer02,776994,Book Review Column,Karl Aberer,SIGMOD Record,Book review column,Karl Aberer,SIGMOD Record
2,2,1.0,journals/sigmod/Aberer02a,776994,Book Review Column,Karl Aberer,SIGMOD Record,Book review column,Karl Aberer,SIGMOD Record
3,3,1.0,journals/sigmod/Aberer02b,604274,Book Review Column,Karl Aberer,SIGMOD Record,Book review column,Karl Aberer,SIGMOD Record
4,4,1.0,journals/sigmod/Aberer03b,601865,Book review column,Karl Aberer,SIGMOD Record,Book review column,Karl Aberer,SIGMOD Record


time: 12.8 ms


From the debug blocker's output we observe that the current blocker drops quite a few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.

For the considered dataset, we know that for the conference papers to match the author names must overlap between them. We could use overlap blocker for this purpose. Finally, we would want to union the outputs from the attribute equivalence blocker and the overlap blocker to get a consolidated candidate set.

In [23]:
# Updated blocking sequence
# A, B ------ attribute equivalence [year] -----> C1--
#                                                     |----> C
# A, B ------ overlap blocker [authors] --------> C2--

time: 933 µs


In [24]:
# Create an overlap blocker
ob = mg.OverlapBlocker()
# Apply overlap blocker on 'authors' attribute
C2 = ob.block_tables(A, B, 'authors', 'authors', 
                   l_output_attrs=['title', 'authors', 'year'],
                   r_output_attrs=['title', 'authors', 'year']
                   )

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:01


time: 1.74 s


In [25]:
# Check the number of rows in C2
len(C2)

287414

time: 2.22 ms


In [26]:
# Display first two rows from C2
C2.head(2)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_year,rtable_title,rtable_authors,rtable_year
0,0,journals/vldb/LaurentLSV01,304586,Monotonic complements for independent data warehouses,gottfried laurent dominique nicolas vossen lechtenbrger spyratos jens,2001,The WASA2 object-oriented workflow management system,weske mathias vossen gottfried,1999
1,1,journals/sigmod/McClatcheyV97,304586,Workshop on Workflow Management in Scientific and Engineering Applications - Report,mcclatchey vossen richard gottfried,1997,The WASA2 object-oriented workflow management system,weske mathias vossen gottfried,1999


time: 10.4 ms


In [27]:
# Combine blocker outputs
C = mg.combine_blocker_outputs_via_union([C1, C2])

time: 1.27 s


In [28]:
# Check the number of rows in the consolidated candidate set.
len(C)

857777

time: 2.27 ms


We observe that the number of tuple pairs considered for matching is increased to 875758 (from 601284). Now let us debug the blocker output again to check if the current blocker sequence is dropping any potential matches.

In [29]:
# Debug again
dbg = mg.debug_blocker(C, A, B)

time: 3.07 s


In [30]:
# Display first few rows from the debugger output
dbg.head(3)

Unnamed: 0,_id,similarity,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_venue,rtable_title,rtable_authors,rtable_venue
0,0,0.555556,journals/sigmod/Dogac02,945727,Guest Editor's Introduction,Asuman Dogac,SIGMOD Record,Guest editor's introduction,Karl Aberer,SIGMOD Record
1,1,0.555556,journals/sigmod/Dogac98,945727,Guest Editor's Introduction,Asuman Dogac,SIGMOD Record,Guest editor's introduction,Karl Aberer,SIGMOD Record
2,2,0.5,journals/sigmod/Snodgrass98a,641001,Reminiscences on Influential Papers,Richard T. Snodgrass,SIGMOD Record,Reminiscences on influential papers,Kenneth A. Ross,SIGMOD Record


time: 17.7 ms


We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.

As a note, it is easy to patch the blocker sequence with a user defined blocker. As an example, if we want to include the tuple pairs that matches the first two digits of a year, then we can write a write a blocker that will take in a function, execute it over an attribute in the input table and then apply attribute equivelence blocking (on the output from the function).

In [31]:
%run demo_blocker.py

time: 3.18 ms


In [32]:
myab = MyBlocker()

def fn(x):
    return x//100

myab.set_blackbox_function(fn)

time: 2.36 ms


In [33]:
C3 = myab.block_tables(A, B, 'year', 'year')

time: 1.1 s


In [34]:
C3.head()

Unnamed: 0,_id,ltable_id,rtable_id,ltable_year,rtable_year
0,0,journals/sigmod/Mackay99,304586,1999,1999
1,1,journals/sigmod/Mackay99,304587,1999,1999
2,2,journals/sigmod/Mackay99,304589,1999,1999
3,3,journals/sigmod/Mackay99,304590,1999,1999
4,4,journals/sigmod/Mackay99,304582,1999,1999


time: 11.9 ms


# 3. Matching tuple pairs in the candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes.

This typically involves the following five steps:

1. Sampling and labeling the candidate set
2. Splitting the labeled data into development and evaluation set
3. Selecting the best learning based matcher using the development set
4. Evaluating the selected matcher using the evaluation set

## 3.1 Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.

In [35]:
# Sample candidate set
S = mg.sample_table(C, 450)

time: 309 ms


In [36]:
# Label S and specify the attribute name for the label column
# L = mg.label_table(S, 'gold')

time: 794 µs


In [54]:
# Load the pre-labeled data
L = mg.read_csv_metadata('../DBLP_ACM/dblp_acm_demo_labels_clean.csv', ltable=A, rtable=B)
# Display the number of rows in the labaled data set
len(L)

415

time: 8.39 ms


## 3.2 Splitting the labeled data into development and evaluation set

In this step, we split the labeled data into two sets: development and evaluation. Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.

In [55]:
# Split the labeled data into development and evaluation set
development_evaluation = mg.split_train_test(L, train_proportion=0.7)
development =  development_evaluation['train']
evaluation = development_evaluation['test']

time: 8.11 ms


## 3.3 Select the best learning-based matcher

Selecting the best learning-based matcher typically involves the following steps:
1. Creating a set of learning-based matchers
2. Creating features
3. Extracting feature vectors
4. Selecting the best learning-based matcher using k-fold cross validation
5. Debugging the matcher (and possibly repeat the above steps)

### 3.3.1 Creating a set of learning-based matchers

First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.

In [56]:
# Create a set of ML-matchers
dt = mg.DTMatcher(name='DecisionTree')
svm = mg.SVMMatcher(name='SVM')
rf = mg.RFMatcher(name='RF')
nb = mg.NBMatcher(name='NB')
lg = mg.LogRegMatcher(name='LogReg')
ln = mg.LinRegMatcher(name='LinReg')

time: 2.61 ms


### 3.3.2 Creating features

Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.

In [57]:
# Generate features
feature_table = mg.get_features_for_matching(A, B)

time: 123 ms


In [58]:
# List the names of the features generated
feature_table['feature_name']

0             title_title_jac_qgm_3_qgm_3
1         title_title_cos_dlm_dc0_dlm_dc0
2                         title_title_mel
3                    title_title_lev_dist
4                     title_title_lev_sim
5         authors_authors_jac_qgm_3_qgm_3
6     authors_authors_cos_dlm_dc0_dlm_dc0
7                     authors_authors_mel
8                authors_authors_lev_dist
9                 authors_authors_lev_sim
10            venue_venue_jac_qgm_3_qgm_3
11        venue_venue_cos_dlm_dc0_dlm_dc0
12        venue_venue_jac_dlm_dc0_dlm_dc0
13                        venue_venue_mel
14                   venue_venue_lev_dist
15                    venue_venue_lev_sim
16                        venue_venue_nmw
17                         venue_venue_sw
18                          year_year_exm
19                          year_year_anm
20                     year_year_lev_dist
21                      year_year_lev_sim
Name: feature_name, dtype: object

time: 3.58 ms


We observe that there were 22 features generated. As a first step, lets say that we decide to use only 'year' related features.

In [59]:
# Select the year related features
feature_subset_iter1 = feature_table[18:22]

time: 1.46 ms


In [60]:
# List the names of the features selected
feature_subset_iter1['feature_name']

18         year_year_exm
19         year_year_anm
20    year_year_lev_dist
21     year_year_lev_sim
Name: feature_name, dtype: object

time: 3.28 ms


### 3.3.3 Extracting feature vectors

In this step, we extract feature vectors using the development set and the created features.

In [61]:
# Extract feature vectors
feature_vectors_dev = mg.extract_feature_vecs(development, 
                            feature_table=feature_subset_iter1, 
                            attrs_after='gold') 

0%                          100%
[##############################] | ETA: 00:00:00

time: 271 ms



Total time elapsed: 00:00:00


In [62]:
# Display first few rows
feature_vectors_dev.head(3)

Unnamed: 0,_id,ltable_id,rtable_id,year_year_exm,year_year_anm,year_year_lev_dist,year_year_lev_sim,gold
2,2,conf/sigmod/AcharyaAFZ95,223816,1,1.0,0.0,1.0,1
35,35,conf/sigmod/ChaudhuriDN01,375694,1,1.0,0.0,1.0,1
247,247,conf/vldb/VieiraM03,640997,1,1.0,0.0,1.0,0


time: 11 ms


Next, we might have to impute the feature vectors as it might contain missing values. First, let us check if there are any missing values in the extracted feature vectors.

In [81]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
feature_vectors_dev.isnull().values.any()

False

time: 3.09 ms


### 3.3.4 Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this demo, we use five fold cross validation and use 'precision' metric to select the best matcher.

In [82]:
# Select the best ML matcher using CV
result = mg.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
        k=5,
        target_attr='gold', metric='precision')

time: 241 ms


In [83]:
# Check the cross validation statistics
result['cv_stats']

Unnamed: 0,Name,Matcher,Num folds,Fold 1,Fold 2,Fold 3,Fold 4,Fold 5,Mean score
0,DecisionTree,<magellan.matcher.dtmatcher.DTMatcher object at 0x11c7c3978>,5,0.711111,0.871795,0.692308,0.697674,0.772727,0.749123
1,RF,<magellan.matcher.rfmatcher.RFMatcher object at 0x11c7c3908>,5,0.894737,0.675676,0.840909,0.644444,0.695652,0.750284
2,SVM,<magellan.matcher.svmmatcher.SVMMatcher object at 0x11c7c3ac8>,5,0.74359,0.75,0.714286,0.785714,0.744681,0.747654
3,NB,<magellan.matcher.nbmatcher.NBMatcher object at 0x11c7c33c8>,5,0.7,0.666667,0.682927,0.777778,0.888889,0.743252
4,LogReg,<magellan.matcher.logregmatcher.LogRegMatcher object at 0x11c7c3668>,5,0.833333,0.666667,0.771429,0.787234,0.682927,0.748318
5,LinReg,<magellan.matcher.linregmatcher.LinRegMatcher object at 0x11c7c36a0>,5,0.682927,0.727273,0.833333,0.684211,0.8,0.745549


time: 13.1 ms


### 3.3.5 Debugging matcher

We observe that the best matcher is not getting us to the precision that we expect (i.e > 95%). We debug the matcher to see what might be wrong.

To do this, first we split the feature vectors into train and test.

In [84]:
## Split feature vectors into train and test
train_test = mg.split_train_test(feature_vectors_dev, train_proportion=0.5)
train = train_test['train']
test = train_test['test']

time: 11.4 ms


Next, we debug the matcher using GUI. For the purposes of this demo, we use random forest matcher for debugging purposes.

In [85]:
# Debug decision tree using GUI
mg.vis_debug_rf(rf, train, test, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
        target_attr='gold')

time: 7.06 s


From the GUI, we observe that using only 'year' related features result in a lot of false positives. So we decide to use all the features author, title and year in the feature table and for venue we use write a feature using external package 'fuzzywuzzy'.

In [86]:
# Select all features from the feature table
feature_subset_iter2 = feature_table

time: 776 µs


Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.

In [87]:
# Get new set of features
feature_vectors_dev = mg.extract_feature_vecs(development, feature_table=feature_subset_iter2, attrs_after='gold')

0%                          100%
[##############################] | ETA: 00:00:00

time: 1.24 s



Total time elapsed: 00:00:01


In [88]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
feature_vectors_dev.isnull().values.any()

False

time: 4.28 ms


In [96]:
# Impute feature vectors
feature_vectors_eval = mg.impute_table(feature_vectors_dev, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
                strategy='mean')

time: 21.3 ms


In [89]:
# Apply cross validation to find if there is a better matcher
result = mg.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
        target_attr='gold', metric='f1') 

time: 171 ms


In [90]:
result['cv_stats']

Unnamed: 0,Name,Matcher,Num folds,Fold 1,Fold 2,Fold 3,Fold 4,Fold 5,Mean score
0,DecisionTree,<magellan.matcher.dtmatcher.DTMatcher object at 0x11c7c3978>,5,1.0,1.0,0.969697,0.965517,0.966667,0.980376
1,RF,<magellan.matcher.rfmatcher.RFMatcher object at 0x11c7c3908>,5,1.0,1.0,0.969697,0.969697,0.96875,0.981629
2,SVM,<magellan.matcher.svmmatcher.SVMMatcher object at 0x11c7c3ac8>,5,0.851064,0.984127,0.862069,0.77551,0.925373,0.879629
3,NB,<magellan.matcher.nbmatcher.NBMatcher object at 0x11c7c33c8>,5,0.984127,1.0,0.984615,1.0,1.0,0.993748
4,LogReg,<magellan.matcher.logregmatcher.LogRegMatcher object at 0x11c7c3668>,5,0.986301,0.983607,0.964286,1.0,0.983051,0.983449
5,LinReg,<magellan.matcher.linregmatcher.LinRegMatcher object at 0x11c7c36a0>,5,1.0,0.962963,0.985915,1.0,0.984615,0.986699


time: 12.1 ms


Now, observe the best matcher is achieving the expected precision and we can proceed on to evaluating the best matcher on the unseen data (the evaluation set).

## 3.4 Evaluating the matching output

Evaluating the matching outputs for the evaluation set typically involves the following four steps:
1. Extracting the feature vectors
2. Training matcher using the feature vectors extracted from the development set
3. Predicting the evaluation set using the trained matcher
4. Evaluating the predicted matches

### 3.4.1 Extracting the feature vectors

As before, we extract the feature vectors (using the updated feature table and the evaluation set) and impute it (if necessary).

In [93]:
# Get new set of features
feature_vectors_eval = mg.extract_feature_vecs(evaluation, 
                                               feature_table=feature_subset_iter2, 
                                              attrs_after='gold')

0%                          100%
[##############################] | ETA: 00:00:00

time: 580 ms



Total time elapsed: 00:00:00


In [94]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
feature_vectors_eval.isnull().values.any()

False

time: 3.39 ms


In [95]:
# Impute feature vectors
feature_vectors_eval = mg.impute_table(feature_vectors_eval, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
                strategy='mean')

time: 13.4 ms


### 3.4.2 Training the matcher

Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.

In [97]:
# Train using feature vectors from the development set
rf.fit(table=feature_vectors_dev, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], 
       target_attr='gold')


time: 13.5 ms


### 3.4.3 Predicting the matches
Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).

In [99]:
# Predict M 
predictions = rf.predict(table=feature_vectors_eval, 
                         exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], 
                         append=True, 
                         target_attr='predicted', 
                         inplace=False)

time: 6.06 ms


### 3.4.4 Evaluating the matching output

Finally, we evaluate the predicted outputs

In [102]:
# Evaluate the result
eval_result = mg.eval_matches(predictions, 'gold', 'predicted')
mg.print_eval_summary(eval_result)

Precision : 100.0% (157/157)
Recall : 100.0% (157/157)
F1 : 100.0%
False positives : 0 (out of 157 positive predictions)
False negatives : 0 (out of 133 negative predictions)
time: 13.7 ms
