# Introduction

This IPython notebook illustrates how to select the best learning based matcher. First, we need to import py_entitymatching package and other libraries as follows:

In [1]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

# Set the seed value 
seed = 0



In [2]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'

In [3]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')

No handlers could be found for logger "py_entitymatching.io.parsers"


In [4]:
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

In [5]:
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

In [6]:
# Convert I into feature vectors using updated F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)

# Compute accuracy of X (Logistic Regression) on J

It involves the following steps:

1. Train X using  H
2. Convert J into a set of feature vectors (L)
3. Predict on L using X
4. Evaluate the predictions

In [7]:
# Instantiate the matcher to evaluate.
lg = em.LogRegMatcher(name='LogReg', random_state=0)

In [16]:
# Train using feature vectors from I 
lg.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], 
       target_attr='label')

# Convert J into a set of feature vectors using F
L = em.extract_feature_vecs(J, feature_table=F,
                            attrs_after='label', show_progress=False)

# Predict on L. Using return_probs=True will cause the true probabilities to be returned. 'probs_attr' is the name of 
# the attribute where the probabilities are stored in the returned Dataframe.
predictions = lg.predict(table=L, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], 
              append=True, target_attr='predicted', inplace=False, return_probs=True,
                        probs_attr='proba')

In [17]:
predictions[['_id', 'ltable_id', 'rtable_id', 'label', 'predicted', 'proba']].head(10)

Unnamed: 0,_id,ltable_id,rtable_id,label,predicted,proba
124,124,l1647,r366,0,0,0.060258
54,54,l332,r1463,0,0,0.034182
268,268,l1499,r1725,0,0,0.006461
293,293,l759,r1749,1,1,0.914074
230,230,l1580,r1711,1,1,0.982784
134,134,l77,r1283,1,1,0.993171
12,12,l1657,r110,0,0,0.004027
423,423,l942,r1473,1,1,0.96594
272,272,l1011,r1620,0,0,0.017932
76,76,l1318,r806,0,0,0.01386


In [10]:
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'label', 'predicted')
em.print_eval_summary(eval_result)

Precision : 100.0% (69/69)
Recall : 94.52% (69/73)
F1 : 97.18%
False positives : 0 (out of 69 positive predictions)
False negatives : 4 (out of 156 negative predictions)
