# Machine Learning aided Record Linkage (and Data Fusion?) - a comparison between different ML methods

## Group Components
* **Francesco Porto**
* **Francesco Stranieri**
* **Mattia Vincenzi**

## Abstract
Record Linkage is the process of finding records in one or more datasets that refer to the same entity across different data sources. Traditionally, it is done by applying comparison rules between pairs of attributes from each dataset. In this project we investigate some possible Machine Learning applications to Data Linkage, and we compare them to the standard approach.

## Python Record Linkage Toolkit
Throughout the project, we make use of a Python library called "Python Record Linkage Toolkit", which provides a simple framework to facilitate the process of Record Linkage. In the context of this library, the Record Linkage process is dived into 5 steps:

* Preprocessing
* Indexing
* Comparison
* Classification
* Evaluation

Please refer to the documentation available at the following link for further information:

https://recordlinkage.readthedocs.io/en/latest/index.html

In [125]:
!pip install recordlinkage
import recordlinkage as rl



## Dataset description
We use the FEBRL (Freely Extensible Biomedical Record Linkage) dataset since it provides the "golden links" for optimal Record Linkage. This dataset contains 10000 records (5000 originals and 5000 duplicates, with one duplicate per original); the originals have been split from the duplicates into dataset4a.csv (containing the 5000 original records) 
and dataset4b.csv (containing the 5000 duplicate records).

In [156]:
from recordlinkage.datasets import load_febrl4

In [157]:
# set logging
rl.logging.set_verbosity(rl.logging.INFO)

In [158]:
# load datasets
print('Loading data...')
dfA, dfB, true_links = load_febrl4(return_links=True)
print(len(dfA), 'records in dataset A')
print(len(dfB), 'records in dataset B')
print(len(true_links), 'links between dataset A and B')

Loading data...
5000 records in dataset A
5000 records in dataset B
5000 links between dataset A and B


In [133]:
dfA
dfA.dtypes

given_name       object
surname          object
street_number    object
address_1        object
address_2        object
suburb           object
postcode         object
state            object
date_of_birth    object
soc_sec_id       object
dtype: object

The records having the same numeric id represent the same entity.

In [135]:
dfA.loc['rec-0-org']

given_name               rachael
surname                     dent
street_number                  1
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-org, dtype: object

In [136]:
dfB.loc['rec-0-dup-0']

given_name               rachael
surname                     dent
street_number                  4
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-dup-0, dtype: object

We split each dataset into **training** and **testing** for our ML models.

In [163]:
def split_FEBRL_dataset(dataset, n):
    indexes = dataset.index.to_series().str.rsplit('-').str[1].astype(int).sort_values()
    training_indexes = indexes[:n]
    testing_indexes = indexes[n:]
    training = dataset[(dataset.index).isin(training_indexes.index)]
    testing = dataset[(dataset.index).isin(testing_indexes.index)]
    return training, testing

In [164]:
training_A, testing_A = split_FEBRL_dataset(dfA, 4000)
training_A

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-1070-org,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218
rec-1016-org,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625
rec-1288-org,vanessa,parr,905,macquoid place,broadbridge manor,south grafton,2135,sa,19951119,9239102
rec-3585-org,mikayla,malloney,37,randwick road,avalind,hoppers crossing,4552,vic,19860208,7207688
rec-298-org,blake,howie,1,cutlack street,belmont park belted galloway stud,budgewoi,6017,vic,19250301,5180548
...,...,...,...,...,...,...,...,...,...,...
rec-1622-org,bethanie,menzies,120,archibald street,krismark,belmont,2287,nsw,19871019,8046929
rec-2153-org,annabel,grierson,97,mclachlan crescent,lantana lodge,broome,2480,nsw,19840224,7676186
rec-1604-org,sienna,musolino,22,smeaton circuit,pangani,mckinnon,2700,nsw,19890525,4971506
rec-1003-org,bradley,matthews,2,jondol place,horseshoe ck,jacobs well,7018,sa,19481122,8927667


In [165]:
training_B, testing_B = split_FEBRL_dataset(dfB, 4000)
training_B

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-561-dup-0,elton,,3,light setreet,pinehill,windermere,3212,vic,19651013,1551941
rec-2642-dup-0,mitchell,maxon,47,edkins street,lochaoair,north ryde,3355,nsw,19390212,8859999
rec-608-dup-0,,white,72,lambrigg street,kelgoola,broadbeach waters,3159,vic,19620216,9731855
rec-3239-dup-0,elk i,menzies,1,lyster place,,northwood,2585,vic,19980624,4970481
rec-2886-dup-0,,garanggar,,may maxwell crescent,springettst arcade,forest hill,2342,vic,19921016,1366884
...,...,...,...,...,...,...,...,...,...,...
rec-3152-dup-0,ethan,reuter,,rivers street,haven caravn park,balcllyn,4571,nsw,19391123,3818774
rec-3363-dup-0,patrick,wevaer,100,allambee street,corcooan,preston,2681,nsw,19770725,5276236
rec-3131-dup-0,samuel,crofs,613,banjine street,kurrajong vlge,pengzin,2230,qld,19410531,4467228
rec-3815-dup-0,saah,beattih,60,kay's place,oldershaw court,ashfield,2047,vic,19500712,9435148


## Indexing
Indexing is the process of creating all the possible links between the two datasets. In this specific example, we use a technique called **Blocking**, which groups together all the records that agree on AT LEAST one of the specified attributes. It is also capable of returning each link only once (and not twice) by only looking at the upper triangular matrix of matches.

In [137]:
from recordlinkage.index import Block
# start indexing
print('Build index...')
indexer = rl.Index()
indexer.add(Block('surname'))
# OR
indexer.add(Block('date_of_birth'))
# OR
indexer.add(Block('soc_sec_id'))
candidate_links_training = indexer.index(training_A, training_B)
print(len(candidate_links_training), 'candidate links between dataset A and B')

Build index...
INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing [1/?] - time: 0.50s - pairs: 58048/16000000 - rr: 0.99637


INFO:recordlinkage:indexing [1/?] - time: 0.50s - pairs: 58048/16000000 - rr: 0.99637


INFO:recordlinkage:indexing [1/?] - time: 0.50s - pairs_total: 58048/16000000 - rr_total: 0.99637


INFO:recordlinkage:indexing [1/?] - time: 0.50s - pairs_total: 58048/16000000 - rr_total: 0.99637


58048 candidate links between dataset A and B


In [138]:
candidate_links_training

MultiIndex([(  'rec-0-org',    'rec-0-dup-0'),
            (  'rec-0-org', 'rec-1505-dup-0'),
            (  'rec-0-org', 'rec-1636-dup-0'),
            (  'rec-0-org', 'rec-2074-dup-0'),
            (  'rec-0-org', 'rec-2683-dup-0'),
            (  'rec-0-org', 'rec-2724-dup-0'),
            (  'rec-0-org', 'rec-2894-dup-0'),
            (  'rec-1-org',    'rec-1-dup-0'),
            (  'rec-1-org', 'rec-1052-dup-0'),
            (  'rec-1-org', 'rec-2552-dup-0'),
            ...
            ('rec-999-org', 'rec-3681-dup-0'),
            ('rec-999-org', 'rec-3685-dup-0'),
            ('rec-999-org',  'rec-370-dup-0'),
            ('rec-999-org', 'rec-3766-dup-0'),
            ('rec-999-org', 'rec-3862-dup-0'),
            ('rec-999-org', 'rec-3913-dup-0'),
            ('rec-999-org', 'rec-3940-dup-0'),
            ('rec-999-org',  'rec-859-dup-0'),
            ('rec-999-org',  'rec-911-dup-0'),
            ('rec-999-org',  'rec-999-dup-0')],
           names=['rec_id_1', 'rec_id_2'], 

In [49]:
dfA.loc[candidate_links[0][0]]

given_name               rachael
surname                     dent
street_number                  1
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-org, dtype: object

In [50]:
dfB.loc[candidate_links[0][1]]

given_name               rachael
surname                     dent
street_number                  4
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-dup-0, dtype: object

In [51]:
dfB.loc[candidate_links[1][1]]

given_name                  emiily
surname                       dent
street_number                   27
address_1        gungurra crescent
address_2                 redlands
suburb                     whyalla
postcode                      3775
state                          nsw
date_of_birth             19960112
soc_sec_id                 9836985
Name: rec-1505-dup-0, dtype: object

## Comparison (aka "the classic approach")
**Comparison** refers to the process of evaluating all the possible links in order to figure out the best ones. In order to compare attributes, we need to specify (for each attribute):
* A **metric** to be used
* A **threshold** to decide under which circumstances the metric shall return true (= a match) or false (= not a match)

We decided not to use exact matches on strings as input errors (e.g. by an employee) might be common, therefore it would be too strict of a restriction.

In [140]:
from recordlinkage.compare import Exact, String
# start comparing
print('Start comparing...')
comparer = rl.Compare()
comparer.add(String('given_name', 'given_name', method='jarowinkler',
                    threshold=0.85, label='given_name'))
comparer.add(String('surname', 'surname', method='jarowinkler',
                    threshold=0.85, label='surname'))
comparer.add(String('date_of_birth', 'date_of_birth', method='jarowinkler',
                    threshold=0.85, label='date_of_birth'))
comparer.add(String('suburb', 'suburb', method='jarowinkler', 
                    threshold=0.85, label='suburb'))
comparer.add(String('state', 'state', method='jarowinkler', 
                    threshold=0.85, label='state'))
comparer.add(String('address_1', 'address_1', 
                    threshold=0.85, label='address_1'))
comparer.add(String('address_2', 'address_2', 
                    threshold=0.85, label='address_2'))
features_training = comparer.compute(candidate_links_training, training_A, training_B)

print('feature shape', features_training.shape)

Start comparing...
INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing [1/?] - time: 21.84s - pairs: 58048


INFO:recordlinkage:comparing [1/?] - time: 21.84s - pairs: 58048


INFO:recordlinkage:comparing [1/?] - time: 21.84s - pairs_total: 58048


INFO:recordlinkage:comparing [1/?] - time: 21.84s - pairs_total: 58048


feature shape (58048, 7)


In [141]:
features_training.shape

(58048, 7)

In [54]:
# Sum the comparison results.
features.sum(axis=1).value_counts().sort_index(ascending=False)

7.0     1725
6.0     1890
5.0     1004
4.0      310
3.0      995
2.0    20206
1.0    61002
dtype: int64

In [55]:
# no ML
matches = features[features.sum(axis=1) > 3]
print(len(matches))

4929


In [56]:
matches

Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,suburb,state,address_1,address_2
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rec-0-org,rec-0-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-1-org,rec-1-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-10-org,rec-10-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-100-org,rec-100-dup-0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
rec-1000-org,rec-1000-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...
rec-995-org,rec-995-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
rec-996-org,rec-996-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
rec-997-org,rec-997-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-998-org,rec-998-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [57]:
# return the confusion matrix
conf_noml = rl.confusion_matrix(true_links, matches, len(candidate_links))
print('confusion matrix')
print(conf_noml)

confusion matrix
[[ 4918    82]
 [   11 82121]]


In [58]:
# compute the F-score for this classification
fscore = rl.fscore(conf_noml)
print('fscore', fscore)
recall = rl.recall(true_links, matches)
print('recall', recall)
precision = rl.precision(true_links, matches)
print('precision', precision)

fscore 0.9906334978346258
recall 0.9836
precision 0.9977683100020288


## Logistic Regression Classifier

In [59]:
# use the Logistic Regression Classifier
# this classifier is equivalent to the deterministic record linkage approach
intercept = -11.0
coefficients = [1.5, 1.5, 8.0, 6.0, 2.5, 6.5, 5.0]

print('Deterministic classifier')
print('intercept', intercept)
print('coefficients', coefficients)

Deterministic classifier
intercept -11.0
coefficients [1.5, 1.5, 8.0, 6.0, 2.5, 6.5, 5.0]


In [60]:
logreg = rl.LogisticRegressionClassifier(
    coefficients=coefficients, intercept=intercept)
links = logreg.predict(features)

print(len(links), 'matches')

INFO:recordlinkage:Classification - predict matches and non-matches


INFO:recordlinkage:Classification - predict matches and non-matches


5735 matches


In [61]:
# return the confusion matrix
conf_logreg = rl.confusion_matrix(true_links, links, len(candidate_links))
print('confusion matrix')
print(conf_logreg)

confusion matrix
[[ 4973    27]
 [  762 81370]]


In [62]:
# compute the F-score for this classification
fscore = rl.fscore(conf_logreg)
print('fscore', fscore)
recall = rl.recall(true_links, links)
print('recall', recall)
precision = rl.precision(true_links, links)
print('precision', precision)

fscore 0.9265020959478342
recall 0.9946
precision 0.8671316477768091


In [63]:
# Predict the match probability for each pair in the dataset.
probs = logreg.prob(features)
print(probs)

INFO:recordlinkage:Classification - compute probabilities


INFO:recordlinkage:Classification - compute probabilities


rec_id_1     rec_id_2      
rec-0-org    rec-0-dup-0       1.000000
             rec-1505-dup-0    0.000075
             rec-1636-dup-0    0.000075
             rec-2074-dup-0    0.000911
             rec-2683-dup-0    0.000075
                                 ...   
rec-999-org  rec-3940-dup-0    0.000075
             rec-4941-dup-0    0.000075
             rec-859-dup-0     0.000075
             rec-911-dup-0     0.000911
             rec-999-dup-0     1.000000
Length: 87132, dtype: float64


## Naive Bayes

In [145]:
# Initialise the NaiveBayesClassifier.
cl = rl.NaiveBayesClassifier()
true_links_training = true_links[:4000]
true_links_testing = true_links[4000:]
cl.fit(features_training, true_links_training)

INFO:recordlinkage:Classification - start training NaiveBayesClassifier


INFO:recordlinkage:Classification - start training NaiveBayesClassifier


INFO:recordlinkage:Classification - training computation time: ~0.26s


INFO:recordlinkage:Classification - training computation time: ~0.26s


In [146]:
# Print the parameters that are trained (m, u and p). Note that the estimates
# are very good.
print("p probability P(Match):", cl.p)
print("m probabilities P(x_i=1|Match):", cl.m_probs)
print("u probabilities P(x_i=1|Non-Match):", cl.u_probs)
print("log weights of features:", cl.log_weights)
print("weights of features:", cl.weights)

p probability P(Match): 0.06878789966923929
m probabilities P(x_i=1|Match): {'given_name': {0.0: 0.202855011126721, 1.0: 0.7971449888732783}, 'surname': {0.0: 0.1477585951535889, 1.0: 0.8522414048464103}, 'date_of_birth': {0.0: 0.07713500740621045, 1.0: 0.9228649925937881}, 'suburb': {0.0: 0.07463062486197715, 1.0: 0.9253693751380213}, 'state': {0.0: 0.057851261815614195, 1.0: 0.9421487381843849}, 'address_1': {0.0: 0.13849237973992579, 1.0: 0.8615076202600742}, 'address_2': {0.0: 0.3205609907056852, 1.0: 0.6794390092943143}}
u probabilities P(x_i=1|Non-Match): {'given_name': {0.0: 0.9943945962653057, 1.0: 0.005605403734694655}, 'surname': {0.0: 0.007399872324854794, 1.0: 0.9926001276751463}, 'date_of_birth': {0.0: 0.9511793526919657, 1.0: 0.048820647308035744}, 'suburb': {0.0: 0.9974100434838232, 1.0: 0.002589956516178129}, 'state': {0.0: 0.7839607796357014, 1.0: 0.2160392203642986}, 'address_1': {0.0: 0.9994265081882288, 1.0: 0.0005734918117713746}, 'address_2': {0.0: 0.9994820072167

In [153]:
# evaluate the model

print('Build index...')
indexer = rl.Index()
indexer.add(Block('surname'))
# OR
indexer.add(Block('date_of_birth'))
# OR
indexer.add(Block('soc_sec_id'))
candidate_links_testing = indexer.index(testing_A, testing_B)
print(len(candidate_links_testing))

# start comparing
print('Start comparing...')
comparer = rl.Compare()
comparer.add(String('given_name', 'given_name', method='jarowinkler',
                    threshold=0.85, label='given_name'))
comparer.add(String('surname', 'surname', method='jarowinkler',
                    threshold=0.85, label='surname'))
comparer.add(String('date_of_birth', 'date_of_birth', method='jarowinkler',
                    threshold=0.85, label='date_of_birth'))
comparer.add(String('suburb', 'suburb', method='jarowinkler', 
                    threshold=0.85, label='suburb'))
comparer.add(String('state', 'state', method='jarowinkler', 
                    threshold=0.85, label='state'))
comparer.add(String('address_1', 'address_1', 
                    threshold=0.85, label='address_1'))
comparer.add(String('address_2', 'address_2', 
                    threshold=0.85, label='address_2'))
features_testing = comparer.compute(candidate_links_testing, testing_A, testing_B)

testing_links = cl.predict(features_testing)
print("Predicted number of links:", len(testing_links))

Build index...
INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing [1/?] - time: 0.08s - pairs: 3985/1000000 - rr: 0.99601


INFO:recordlinkage:indexing [1/?] - time: 0.08s - pairs: 3985/1000000 - rr: 0.99601


INFO:recordlinkage:indexing [1/?] - time: 0.08s - pairs_total: 3985/1000000 - rr_total: 0.99601


INFO:recordlinkage:indexing [1/?] - time: 0.08s - pairs_total: 3985/1000000 - rr_total: 0.99601


3985
Start comparing...
INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing [1/?] - time: 1.50s - pairs: 3985


INFO:recordlinkage:comparing [1/?] - time: 1.50s - pairs: 3985


INFO:recordlinkage:comparing [1/?] - time: 1.50s - pairs_total: 3985


INFO:recordlinkage:comparing [1/?] - time: 1.50s - pairs_total: 3985


INFO:recordlinkage:Classification - predict matches and non-matches


INFO:recordlinkage:Classification - predict matches and non-matches


Predicted number of links: 996


In [155]:
cm = rl.confusion_matrix(true_links_testing, testing_links, len(candidate_links_testing))
print("Confusion matrix:\n", cm)
print(sum(sum(cm)))

Confusion matrix:
 [[ 995    5]
 [   1 2984]]
3985


In [150]:
# compute the F-score for this classification
fscore = rl.fscore(cm)
print('fscore', fscore)
recall = rl.recall(true_links_testing, links)
print('recall', recall)
precision = rl.precision(true_links_testing, links)
print('precision', precision)

fscore 0.9969939879759518
recall 0.995
precision 0.998995983935743


In [31]:
# Predict the match probability for each pair in the dataset.
probs = cl.prob(features)
probs

INFO:recordlinkage:Classification - compute probabilities


rec_id_1     rec_id_2      
rec-0-org    rec-0-dup-0       1.000000e+00
             rec-1505-dup-0    2.257802e-07
             rec-1636-dup-0    2.257802e-07
             rec-2074-dup-0    1.297394e-05
             rec-2683-dup-0    2.257802e-07
                                   ...     
rec-999-org  rec-3940-dup-0    2.257802e-07
             rec-4941-dup-0    2.257802e-07
             rec-859-dup-0     2.257802e-07
             rec-911-dup-0     1.297394e-05
             rec-999-dup-0     1.000000e+00
Length: 87132, dtype: float64

## Expectation-Conditional Maximisation

In [32]:
'''
Example: Unsupervised learning with the ECM algorithm.
Train data is often hard to collect in record linkage or data matching
problems. The Expectation-Conditional Maximisation (ECM) algorithm is the most
well known algorithm for unsupervised data matching. The algorithm preforms
relatively well compared to supervised methods.
'''
import numpy as np

In [33]:
# Initialise the Expectation-Conditional Maximisation classifier.
cl = rl.ECMClassifier()
cl.fit(features)

INFO:recordlinkage:Classification - start training ECMClassifier
INFO:recordlinkage:Classification - training computation time: ~0.33s


INFO:recordlinkage:Classification - training computation time: ~0.33s


In [34]:
# Print the parameters that are trained (m, u and p). Note that the estimates
# are very good.
print("p probability P(Match):", cl.p)
print("m probabilities P(x_i=1|Match):", cl.m_probs)
print("u probabilities P(x_i=1|Non-Match):", cl.u_probs)
print("log weights of features:", cl.log_weights)
print("weights of features:", cl.weights)

p probability P(Match): 0.05746886445320084
m probabilities P(x_i=1|Match): {'given_name': {0.0: 0.2067585108551402, 1.0: 0.7932414891448585}, 'surname': {0.0: 0.1483757226825982, 1.0: 0.8516242773174004}, 'date_of_birth': {0.0: 0.07938649755546, 1.0: 0.9206135024445387}, 'suburb': {0.0: 0.07765753195188282, 1.0: 0.9223424680481159}, 'state': {0.0: 0.05977818622152173, 1.0: 0.9402218137784779}, 'address_1': {0.0: 0.1439501610395507, 1.0: 0.8560498389604491}, 'address_2': {0.0: 0.3164559283945158, 1.0: 0.6835440716054837}}
u probabilities P(x_i=1|Non-Match): {'given_name': {0.0: 0.9941437693417281, 1.0: 0.005856230658272183}, 'surname': {0.0: 0.007574181559987423, 1.0: 0.9924258184400123}, 'date_of_birth': {0.0: 0.950804753928978, 1.0: 0.049195246071021416}, 'suburb': {0.0: 0.9978632054162465, 1.0: 0.002136794583753248}, 'state': {0.0: 0.7819660634016198, 1.0: 0.21803393659837952}, 'address_1': {0.0: 0.9994102177845318, 1.0: 0.0005897822154671389}, 'address_2': {0.0: 0.9995344018720016,

In [35]:
# evaluate the model
links = cl.predict(features)
print("Predicted number of links:", len(links))

INFO:recordlinkage:Classification - predict matches and non-matches


INFO:recordlinkage:Classification - predict matches and non-matches


Predicted number of links: 4991


In [36]:
cm = rl.confusion_matrix(true_links, links, len(candidate_links))
print("Confusion matrix:\n", cm)

Confusion matrix:
 [[ 4976    24]
 [   15 82117]]


In [37]:
# compute the F-score for this classification
fscore = rl.fscore(cm)
print('fscore', fscore)
recall = rl.recall(true_links, links)
print('recall', recall)
precision = rl.precision(true_links, links)
print('precision', precision)

fscore 0.9960964868381544
recall 0.9952
precision 0.9969945902624725


In [38]:
# Predict the match probability for each pair in the dataset.
probs = cl.prob(features)
print(probs)

INFO:recordlinkage:Classification - compute probabilities


INFO:recordlinkage:Classification - compute probabilities


rec_id_1     rec_id_2      
rec-0-org    rec-0-dup-0       1.000000e+00
             rec-1505-dup-0    2.464947e-07
             rec-1636-dup-0    2.464947e-07
             rec-2074-dup-0    1.390442e-05
             rec-2683-dup-0    2.464947e-07
                                   ...     
rec-999-org  rec-3940-dup-0    2.464947e-07
             rec-4941-dup-0    2.464947e-07
             rec-859-dup-0     2.464947e-07
             rec-911-dup-0     1.390442e-05
             rec-999-dup-0     1.000000e+00
Length: 87132, dtype: float64
