# Machine Learning aided Record Linkage (and Data Fusion?) - a comparison between different ML methods

## Group Components
* **Francesco Porto**
* **Francesco Stranieri**
* **Mattia Vincenzi**

## Abstract
Record Linkage is the process of finding records in one or more datasets that refer to the same entity across different data sources. Traditionally, it is done by applying comparison rules between pairs of attributes from each dataset. In this project we investigate some possible Machine Learning applications to Data Linkage, and we compare them to the standard approach.

## Python Record Linkage Toolkit
Throughout the project, we make use of a Python library called "Python Record Linkage Toolkit", which provides a simple framework to facilitate the process of Record Linkage. In the context of this library, the Record Linkage process is dived into 5 steps:

* Preprocessing
* Indexing
* Comparison
* Classification
* Evaluation

Please refer to the documentation available at the following link for further information:

https://recordlinkage.readthedocs.io/en/latest/index.html

In [104]:
!pip install recordlinkage
import recordlinkage as rl



## Dataset description
We use the FEBRL (Freely Extensible Biomedical Record Linkage) dataset since it provides the "golden links" for optimal Record Linkage. This dataset contains 10000 records (5000 originals and 5000 duplicates, with one duplicate per original); the originals have been split from the duplicates into dataset4a.csv (containing the 5000 original records) 
and dataset4b.csv (containing the 5000 duplicate records).

In [105]:
from recordlinkage.datasets import load_febrl4

In [106]:
# set logging
rl.logging.set_verbosity(rl.logging.INFO)

In [107]:
# load datasets
print('Loading data...')
dfA, dfB, true_links = load_febrl4(return_links=True)
print(len(dfA), 'records in dataset A')
print(len(dfB), 'records in dataset B')
print(len(true_links), 'links between dataset A and B')

Loading data...
5000 records in dataset A
5000 records in dataset B
5000 links between dataset A and B


In [108]:
dfA
dfA.dtypes

given_name       object
surname          object
street_number    object
address_1        object
address_2        object
suburb           object
postcode         object
state            object
date_of_birth    object
soc_sec_id       object
dtype: object

The records having the same numeric id represent the same entity.

In [109]:
dfA.loc['rec-0-org']

given_name               rachael
surname                     dent
street_number                  1
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-org, dtype: object

In [110]:
dfB.loc['rec-0-dup-0']

given_name               rachael
surname                     dent
street_number                  4
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-dup-0, dtype: object

We split each dataset into **training** and **testing** for our models.

In [111]:
def split_FEBRL_dataset(dataset,n):
    indexes = dataset.index.to_series().str.rsplit('-').str[1].astype(int).sort_values()
    training_indexes = indexes[:n]
    testing_indexes = indexes[n:]
    training = dataset[(dataset.index).isin(training_indexes.index)]
    testing = dataset[(dataset.index).isin(testing_indexes.index)]
    return training, testing

In [112]:
def split_true_links(true_links,n):
    training_true_links = true_links[:n]
    testing_true_links = true_links[n:]
    return training_true_links, testing_true_links

In [113]:
training_A, testing_A = split_FEBRL_dataset(dfA, 4000)
training_A

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-1070-org,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218
rec-1016-org,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625
rec-1288-org,vanessa,parr,905,macquoid place,broadbridge manor,south grafton,2135,sa,19951119,9239102
rec-3585-org,mikayla,malloney,37,randwick road,avalind,hoppers crossing,4552,vic,19860208,7207688
rec-298-org,blake,howie,1,cutlack street,belmont park belted galloway stud,budgewoi,6017,vic,19250301,5180548
...,...,...,...,...,...,...,...,...,...,...
rec-1622-org,bethanie,menzies,120,archibald street,krismark,belmont,2287,nsw,19871019,8046929
rec-2153-org,annabel,grierson,97,mclachlan crescent,lantana lodge,broome,2480,nsw,19840224,7676186
rec-1604-org,sienna,musolino,22,smeaton circuit,pangani,mckinnon,2700,nsw,19890525,4971506
rec-1003-org,bradley,matthews,2,jondol place,horseshoe ck,jacobs well,7018,sa,19481122,8927667


In [114]:
training_B, testing_B = split_FEBRL_dataset(dfB, 4000)
training_B

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-561-dup-0,elton,,3,light setreet,pinehill,windermere,3212,vic,19651013,1551941
rec-2642-dup-0,mitchell,maxon,47,edkins street,lochaoair,north ryde,3355,nsw,19390212,8859999
rec-608-dup-0,,white,72,lambrigg street,kelgoola,broadbeach waters,3159,vic,19620216,9731855
rec-3239-dup-0,elk i,menzies,1,lyster place,,northwood,2585,vic,19980624,4970481
rec-2886-dup-0,,garanggar,,may maxwell crescent,springettst arcade,forest hill,2342,vic,19921016,1366884
...,...,...,...,...,...,...,...,...,...,...
rec-3152-dup-0,ethan,reuter,,rivers street,haven caravn park,balcllyn,4571,nsw,19391123,3818774
rec-3363-dup-0,patrick,wevaer,100,allambee street,corcooan,preston,2681,nsw,19770725,5276236
rec-3131-dup-0,samuel,crofs,613,banjine street,kurrajong vlge,pengzin,2230,qld,19410531,4467228
rec-3815-dup-0,saah,beattih,60,kay's place,oldershaw court,ashfield,2047,vic,19500712,9435148


In [115]:
training_true_links, testing_true_links = split_true_links(true_links, 4000)
training_true_links

MultiIndex([(   'rec-0-org',    'rec-0-dup-0'),
            (   'rec-1-org',    'rec-1-dup-0'),
            (   'rec-2-org',    'rec-2-dup-0'),
            (   'rec-3-org',    'rec-3-dup-0'),
            (   'rec-4-org',    'rec-4-dup-0'),
            (   'rec-5-org',    'rec-5-dup-0'),
            (   'rec-6-org',    'rec-6-dup-0'),
            (   'rec-7-org',    'rec-7-dup-0'),
            (   'rec-8-org',    'rec-8-dup-0'),
            (   'rec-9-org',    'rec-9-dup-0'),
            ...
            ('rec-3990-org', 'rec-3990-dup-0'),
            ('rec-3991-org', 'rec-3991-dup-0'),
            ('rec-3992-org', 'rec-3992-dup-0'),
            ('rec-3993-org', 'rec-3993-dup-0'),
            ('rec-3994-org', 'rec-3994-dup-0'),
            ('rec-3995-org', 'rec-3995-dup-0'),
            ('rec-3996-org', 'rec-3996-dup-0'),
            ('rec-3997-org', 'rec-3997-dup-0'),
            ('rec-3998-org', 'rec-3998-dup-0'),
            ('rec-3999-org', 'rec-3999-dup-0')],
           length=4000)

## Indexing
Indexing is the process of creating all the possible links between the two datasets. In this specific example, we use a technique called **Blocking**, which groups together all the records that agree on AT LEAST one of the specified attributes. It is also capable of returning each link only once (and not twice) by only looking at the upper triangular matrix of matches. We have defined a wrapper function to facilitate the indexing process:

In [116]:
from recordlinkage.index import Block

def blocking_dataset(dataset1, dataset2, parameters):
    indexer = rl.Index()
    for parameter in parameters:
        indexer.add(Block(parameter))
    candidate_links = indexer.index(dataset1, dataset2)
    return candidate_links

In [117]:
candidate_links_training = blocking_dataset(training_A, training_B, ['surname','date_of_birth', 'soc_sec_id'])
candidate_links_training

INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing [1/?] - time: 0.74s - pairs: 58048/16000000 - rr: 0.99637


INFO:recordlinkage:indexing [1/?] - time: 0.74s - pairs: 58048/16000000 - rr: 0.99637


INFO:recordlinkage:indexing [1/?] - time: 0.74s - pairs_total: 58048/16000000 - rr_total: 0.99637


INFO:recordlinkage:indexing [1/?] - time: 0.74s - pairs_total: 58048/16000000 - rr_total: 0.99637


MultiIndex([(  'rec-0-org',    'rec-0-dup-0'),
            (  'rec-0-org', 'rec-1505-dup-0'),
            (  'rec-0-org', 'rec-1636-dup-0'),
            (  'rec-0-org', 'rec-2074-dup-0'),
            (  'rec-0-org', 'rec-2683-dup-0'),
            (  'rec-0-org', 'rec-2724-dup-0'),
            (  'rec-0-org', 'rec-2894-dup-0'),
            (  'rec-1-org',    'rec-1-dup-0'),
            (  'rec-1-org', 'rec-1052-dup-0'),
            (  'rec-1-org', 'rec-2552-dup-0'),
            ...
            ('rec-999-org', 'rec-3681-dup-0'),
            ('rec-999-org', 'rec-3685-dup-0'),
            ('rec-999-org',  'rec-370-dup-0'),
            ('rec-999-org', 'rec-3766-dup-0'),
            ('rec-999-org', 'rec-3862-dup-0'),
            ('rec-999-org', 'rec-3913-dup-0'),
            ('rec-999-org', 'rec-3940-dup-0'),
            ('rec-999-org',  'rec-859-dup-0'),
            ('rec-999-org',  'rec-911-dup-0'),
            ('rec-999-org',  'rec-999-dup-0')],
           names=['rec_id_1', 'rec_id_2'], 

In [118]:
candidate_links_testing = blocking_dataset(testing_A, testing_B, ['surname','date_of_birth', 'soc_sec_id'])
candidate_links_testing

INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing - initialize Index class


INFO:recordlinkage:indexing [1/?] - time: 0.12s - pairs: 3985/1000000 - rr: 0.99601


INFO:recordlinkage:indexing [1/?] - time: 0.12s - pairs: 3985/1000000 - rr: 0.99601


INFO:recordlinkage:indexing [1/?] - time: 0.12s - pairs_total: 3985/1000000 - rr_total: 0.99601


INFO:recordlinkage:indexing [1/?] - time: 0.12s - pairs_total: 3985/1000000 - rr_total: 0.99601


MultiIndex([('rec-4000-org', 'rec-4000-dup-0'),
            ('rec-4001-org', 'rec-4001-dup-0'),
            ('rec-4002-org', 'rec-4002-dup-0'),
            ('rec-4002-org', 'rec-4130-dup-0'),
            ('rec-4002-org', 'rec-4490-dup-0'),
            ('rec-4002-org', 'rec-4552-dup-0'),
            ('rec-4002-org', 'rec-4731-dup-0'),
            ('rec-4002-org', 'rec-4938-dup-0'),
            ('rec-4003-org', 'rec-4003-dup-0'),
            ('rec-4003-org', 'rec-4642-dup-0'),
            ...
            ('rec-4994-org', 'rec-4994-dup-0'),
            ('rec-4995-org', 'rec-4181-dup-0'),
            ('rec-4995-org', 'rec-4995-dup-0'),
            ('rec-4996-org', 'rec-4996-dup-0'),
            ('rec-4997-org', 'rec-4099-dup-0'),
            ('rec-4997-org', 'rec-4266-dup-0'),
            ('rec-4997-org', 'rec-4997-dup-0'),
            ('rec-4998-org', 'rec-4998-dup-0'),
            ('rec-4999-org', 'rec-4860-dup-0'),
            ('rec-4999-org', 'rec-4999-dup-0')],
           names=['rec_

Verify that a pair of candidate links agree on the given attributes:

In [119]:
training_A.loc[candidate_links_training[0][0]]

given_name               rachael
surname                     dent
street_number                  1
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-org, dtype: object

In [120]:
training_B.loc[candidate_links_training[0][1]]

given_name               rachael
surname                     dent
street_number                  4
address_1            knox street
address_2        lakewood estate
suburb                    byford
postcode                    4129
state                        vic
date_of_birth           19280722
soc_sec_id               1683994
Name: rec-0-dup-0, dtype: object

## Comparison
**Comparison** refers to the process of evaluating all the possible links in order to figure out the best ones. In order to compare attributes, we need to specify (for each attribute):
* A **metric** to be used
* A **threshold** to decide under which circumstances the metric shall return true (= a match) or false (= not a match)

We decided not to use exact matches on strings as input errors (e.g. by an employee) might be common, therefore it would be too strict of a restriction.

In [121]:
from recordlinkage.compare import Exact, String

# Only works for datasets that have the same column names
def compare_records(candidate_links, dataset1, dataset2, string_comparison, exact_match):
    comparer = rl.Compare()
    
    for attribute in string_comparison.keys():
        method_comp, threshold_comp = string_comparison[attribute]
        comparer.add(String(attribute,attribute,method=method_comp,threshold=threshold_comp,label=attribute))
    
    for attribute in exact_match:
        comparer.exact(attribute,attribute)
    
    features = comparer.compute(candidate_links,dataset1,dataset2)
    return features

In [122]:
comparison = {'given_name':['jarowinkler',0.85],'surname':['jarowinkler',0.85],'date_of_birth':['jarowinkler',0.85],
              'suburb':['jarowinkler',0.85], 'state':['jarowinkler',0.85], 'address_1':['jarowinkler',0.85], 
              'address_2':['jarowinkler',0.85] }
exact = []
training_features = compare_records(candidate_links_training, training_A, training_B, comparison, exact)
training_features

INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing [1/?] - time: 10.31s - pairs: 58048


INFO:recordlinkage:comparing [1/?] - time: 10.31s - pairs: 58048


INFO:recordlinkage:comparing [1/?] - time: 10.31s - pairs_total: 58048


INFO:recordlinkage:comparing [1/?] - time: 10.31s - pairs_total: 58048


Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,suburb,state,address_1,address_2
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rec-0-org,rec-0-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-0-org,rec-1505-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
rec-0-org,rec-1636-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
rec-0-org,rec-2074-dup-0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
rec-0-org,rec-2683-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
rec-999-org,rec-3913-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
rec-999-org,rec-3940-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
rec-999-org,rec-859-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
rec-999-org,rec-911-dup-0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [123]:
testing_features = compare_records(candidate_links_testing, testing_A, testing_B, comparison, exact)
testing_features

INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing - initialize Compare class


INFO:recordlinkage:comparing [1/?] - time: 0.80s - pairs: 3985


INFO:recordlinkage:comparing [1/?] - time: 0.80s - pairs: 3985


INFO:recordlinkage:comparing [1/?] - time: 0.80s - pairs_total: 3985


INFO:recordlinkage:comparing [1/?] - time: 0.80s - pairs_total: 3985


Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,suburb,state,address_1,address_2
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rec-4000-org,rec-4000-dup-0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
rec-4001-org,rec-4001-dup-0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
rec-4002-org,rec-4002-dup-0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
rec-4002-org,rec-4130-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
rec-4002-org,rec-4490-dup-0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
rec-4997-org,rec-4266-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
rec-4997-org,rec-4997-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
rec-4998-org,rec-4998-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-4999-org,rec-4860-dup-0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


# Trivial approach

In [124]:
# Sum the comparison results.
training_features.sum(axis=1).value_counts().sort_index(ascending=False)

7.0     1640
6.0     1373
5.0      724
4.0      221
3.0      665
2.0    13307
1.0    40118
dtype: int64

In [125]:
training_matches = training_features[training_features.sum(axis=1) > 3]
training_matches

Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,suburb,state,address_1,address_2
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rec-0-org,rec-0-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-1-org,rec-1-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-10-org,rec-10-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-100-org,rec-100-dup-0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
rec-1000-org,rec-1000-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...
rec-995-org,rec-995-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-996-org,rec-996-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
rec-997-org,rec-997-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
rec-998-org,rec-998-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [126]:
testing_features.sum(axis=1).value_counts().sort_index(ascending=False)

7.0     386
6.0     401
5.0     156
4.0      41
3.0      52
2.0     774
1.0    2175
dtype: int64

In [127]:
testing_matches = testing_features[testing_features.sum(axis=1) > 3]
testing_matches

Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,suburb,state,address_1,address_2
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rec-4000-org,rec-4000-dup-0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
rec-4001-org,rec-4001-dup-0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
rec-4002-org,rec-4002-dup-0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
rec-4003-org,rec-4003-dup-0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
rec-4004-org,rec-4004-dup-0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
rec-4995-org,rec-4995-dup-0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
rec-4996-org,rec-4996-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
rec-4997-org,rec-4997-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
rec-4998-org,rec-4998-dup-0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [128]:
# return the confusion matrix
conf_noml = rl.confusion_matrix(testing_true_links, testing_matches, len(candidate_links_testing))
print('confusion matrix')
print(conf_noml)

confusion matrix
[[ 984   16]
 [   0 2985]]


In [129]:
# compute the F-score for this classification
fscore = rl.fscore(conf_noml)
print('fscore', fscore)
recall = rl.recall(testing_true_links, testing_matches)
print('recall', recall)
precision = rl.precision(testing_true_links, testing_matches)
print('precision', precision)

fscore 0.9919354838709677
recall 0.984
precision 1.0


## Logistic Regression Classifier

In [130]:
intercept = -11.0
coefficients = [1.5, 1.5, 8.0, 6.0, 2.5, 6.5, 5.0]

print('Deterministic classifier')
print('intercept', intercept)
print('coefficients', coefficients)

Deterministic classifier
intercept -11.0
coefficients [1.5, 1.5, 8.0, 6.0, 2.5, 6.5, 5.0]


In [131]:
logreg = rl.LogisticRegressionClassifier(
    coefficients=coefficients, intercept=intercept)
links_logreg = logreg.predict(testing_features)

print(len(links_logreg), 'matches')

INFO:recordlinkage:Classification - predict matches and non-matches


INFO:recordlinkage:Classification - predict matches and non-matches


1028 matches


In [132]:
# return the confusion matrix
conf_logreg = rl.confusion_matrix(testing_true_links, links_logreg, len(candidate_links_testing))
print('confusion matrix')
print(conf_logreg)

confusion matrix
[[ 997    3]
 [  31 2954]]


In [133]:
# compute the F-score for this classification
fscore = rl.fscore(conf_logreg)
print('fscore', fscore)
recall = rl.recall(true_links, links_logreg)
print('recall', recall)
precision = rl.precision(true_links, links_logreg)
print('precision', precision)

fscore 0.9832347140039448
recall 0.1994
precision 0.9698443579766537


In [135]:
# Predict the match probability for each pair in the dataset.
probs = logreg.prob(testing_features)
print(probs)

INFO:recordlinkage:Classification - compute probabilities


INFO:recordlinkage:Classification - compute probabilities


rec_id_1      rec_id_2      
rec-4000-org  rec-4000-dup-0    1.000000
rec-4001-org  rec-4001-dup-0    0.999994
rec-4002-org  rec-4002-dup-0    1.000000
              rec-4130-dup-0    0.000075
              rec-4490-dup-0    0.000911
                                  ...   
rec-4997-org  rec-4266-dup-0    0.000075
              rec-4997-dup-0    1.000000
rec-4998-org  rec-4998-dup-0    1.000000
rec-4999-org  rec-4860-dup-0    0.000075
              rec-4999-dup-0    0.999994
Length: 3985, dtype: float64


## Naive Bayes

In [136]:
# Initialise the NaiveBayesClassifier.
cl = rl.NaiveBayesClassifier()
cl.fit(training_features, training_true_links)

INFO:recordlinkage:Classification - start training NaiveBayesClassifier


INFO:recordlinkage:Classification - start training NaiveBayesClassifier


INFO:recordlinkage:Classification - training computation time: ~0.38s


INFO:recordlinkage:Classification - training computation time: ~0.38s


In [137]:
# Print the parameters that are trained (m, u and p). Note that the estimates
# are very good.
print("p probability P(Match):", cl.p)
print("m probabilities P(x_i=1|Match):", cl.m_probs)
print("u probabilities P(x_i=1|Non-Match):", cl.u_probs)
print("log weights of features:", cl.log_weights)
print("weights of features:", cl.weights)

p probability P(Match): 0.06878789966923929
m probabilities P(x_i=1|Match): {'given_name': {0.0: 0.202855011126721, 1.0: 0.7971449888732783}, 'surname': {0.0: 0.1477585951535889, 1.0: 0.8522414048464103}, 'date_of_birth': {0.0: 0.07713500740621045, 1.0: 0.9228649925937881}, 'suburb': {0.0: 0.07463062486197715, 1.0: 0.9253693751380213}, 'state': {0.0: 0.057851261815614195, 1.0: 0.9421487381843849}, 'address_1': {0.0: 0.11169548651662971, 1.0: 0.8883045134833688}, 'address_2': {0.0: 0.23891811976368016, 1.0: 0.7610818802363194}}
u probabilities P(x_i=1|Non-Match): {'given_name': {0.0: 0.9943945962653057, 1.0: 0.005605403734694655}, 'surname': {0.0: 0.007399872324854794, 1.0: 0.9926001276751463}, 'date_of_birth': {0.0: 0.9511793526919657, 1.0: 0.048820647308035744}, 'suburb': {0.0: 0.9974100434838232, 1.0: 0.002589956516178129}, 'state': {0.0: 0.7839607796357014, 1.0: 0.2160392203642986}, 'address_1': {0.0: 0.9970955489886397, 1.0: 0.0029044510113608356}, 'address_2': {0.0: 0.998834518550

In [138]:
# Evaluate the model
links_bayes = cl.predict(testing_features)
print("Predicted number of links:", len(links_bayes))

INFO:recordlinkage:Classification - predict matches and non-matches


INFO:recordlinkage:Classification - predict matches and non-matches


Predicted number of links: 998


In [139]:
cm = rl.confusion_matrix(testing_true_links, links_bayes, len(candidate_links_testing))
print("Confusion matrix:\n", cm)
print(sum(sum(cm)))

Confusion matrix:
 [[ 996    4]
 [   2 2983]]
3985


In [140]:
# compute the F-score for this classification
fscore = rl.fscore(cm)
print('fscore', fscore)
recall = rl.recall(testing_true_links, links)
print('recall', recall)
precision = rl.precision(testing_true_links, links)
print('precision', precision)

fscore 0.9969969969969971
recall 0.996
precision 0.9979959919839679


In [141]:
# Predict the match probability for each pair in the dataset.
probs = cl.prob(testing_features)
probs

INFO:recordlinkage:Classification - compute probabilities


INFO:recordlinkage:Classification - compute probabilities


rec_id_1      rec_id_2      
rec-4000-org  rec-4000-dup-0    1.000000e+00
rec-4001-org  rec-4001-dup-0    1.000000e+00
rec-4002-org  rec-4002-dup-0    1.000000e+00
              rec-4130-dup-0    1.552330e-07
              rec-4490-dup-0    9.173783e-06
                                    ...     
rec-4997-org  rec-4266-dup-0    1.552330e-07
              rec-4997-dup-0    9.999999e-01
rec-4998-org  rec-4998-dup-0    1.000000e+00
rec-4999-org  rec-4860-dup-0    1.552330e-07
              rec-4999-dup-0    1.000000e+00
Length: 3985, dtype: float64

## Expectation-Conditional Maximisation

Unsupervised learning with the ECM algorithm.
Train data is often hard to collect in record linkage or data matching
problems. The Expectation-Conditional Maximisation (ECM) algorithm is the most
well known algorithm for unsupervised data matching. The algorithm preforms
relatively well compared to supervised methods.

In [142]:
# Initialise the Expectation-Conditional Maximisation classifier.
cl = rl.ECMClassifier()
cl.fit(training_features)

INFO:recordlinkage:Classification - start training ECMClassifier


INFO:recordlinkage:Classification - start training ECMClassifier


INFO:recordlinkage:Classification - training computation time: ~0.28s


INFO:recordlinkage:Classification - training computation time: ~0.28s


In [143]:
# Print the parameters that are trained (m, u and p). Note that the estimates
# are very good.
print("p probability P(Match):", cl.p)
print("m probabilities P(x_i=1|Match):", cl.m_probs)
print("u probabilities P(x_i=1|Non-Match):", cl.u_probs)
print("log weights of features:", cl.log_weights)
print("weights of features:", cl.weights)

p probability P(Match): 0.06889150268366431
m probabilities P(x_i=1|Match): {'given_name': {0.0: 0.20512406601999528, 1.0: 0.7948759339800049}, 'surname': {0.0: 0.14935524667978686, 1.0: 0.8506447533202137}, 'date_of_birth': {0.0: 0.07601967331481775, 1.0: 0.9239803266851828}, 'suburb': {0.0: 0.07639393282192558, 1.0: 0.9236060671780747}, 'state': {0.0: 0.05833475347035745, 1.0: 0.9416652465296432}, 'address_1': {0.0: 0.11211970450034767, 1.0: 0.8878802954996524}, 'address_2': {0.0: 0.23953167427874456, 1.0: 0.7604683257212559}}
u probabilities P(x_i=1|Non-Match): {'given_name': {0.0: 0.9943147859802327, 1.0: 0.005685214019766486}, 'surname': {0.0: 0.0072661175359290155, 1.0: 0.9927338824640696}, 'date_of_birth': {0.0: 0.9513591284844728, 1.0: 0.048640871515525326}, 'suburb': {0.0: 0.9973822551460484, 1.0: 0.0026177448539499246}, 'state': {0.0: 0.7840057992261295, 1.0: 0.2159942007738693}, 'address_1': {0.0: 0.9971626791631462, 1.0: 0.0028373208368517224}, 'address_2': {0.0: 0.99887367

In [144]:
# evaluate the model
links_ecm = cl.predict(testing_features)
print("Predicted number of links:", len(testing_links))

INFO:recordlinkage:Classification - predict matches and non-matches


INFO:recordlinkage:Classification - predict matches and non-matches


Predicted number of links: 998


In [145]:
cm = rl.confusion_matrix(testing_true_links, links_ecm, len(candidate_links_testing))
print("Confusion matrix:\n", cm)

Confusion matrix:
 [[ 996    4]
 [   2 2983]]


In [146]:
# compute the F-score for this classification
fscore = rl.fscore(cm)
print('fscore', fscore)
recall = rl.recall(testing_true_links, links_ecm)
print('recall', recall)
precision = rl.precision(testing_true_links, links_ecm)
print('precision', precision)

fscore 0.9969969969969971
recall 0.996
precision 0.9979959919839679


In [147]:
# Predict the match probability for each pair in the dataset.
probs = cl.prob(testing_features)
print(probs)

INFO:recordlinkage:Classification - compute probabilities


INFO:recordlinkage:Classification - compute probabilities


rec_id_1      rec_id_2      
rec-4000-org  rec-4000-dup-0    1.000000e+00
rec-4001-org  rec-4001-dup-0    1.000000e+00
rec-4002-org  rec-4002-dup-0    1.000000e+00
              rec-4130-dup-0    1.605937e-07
              rec-4490-dup-0    9.409594e-06
                                    ...     
rec-4997-org  rec-4266-dup-0    1.605937e-07
              rec-4997-dup-0    9.999999e-01
rec-4998-org  rec-4998-dup-0    1.000000e+00
rec-4999-org  rec-4860-dup-0    1.605937e-07
              rec-4999-dup-0    1.000000e+00
Length: 3985, dtype: float64


TODO:
* Creare funzioni per cleaning, valutazione e classification (se possibile)
* Provare un altro dataset FEBRL
* Data fusion se c'è tempo