# Record Linkage / Entity Resolution / Deduplication (using Python!)
## A Pratical Guide by Abby for DSSG 2022

## Real World Data is A MESS

* Inputted by humans
* Often not reviewed
* Not properly normalized by the input system

## Real World Data is A MESS

* Lack of a **unique identifier** like a social security number to join datasets! 

## Real World Data is A MESS

The solution is **record linkage**-- joining records in a fuzzy way by calculating similarity of different attributes of entities!

We use names, addresses, and more!

In [1]:
import warnings; warnings.simplefilter('ignore')
import logging; logging.disable(level=logging.INFO)


In [2]:
import pandas as pd

data = [
    ("Chin's","3200 Las Vegas Boulevard","New York"),
    ("Chin Bistro","3200 Las Vegas Blvd.","New York"),
    ("Bistro","3400 Las Vegas Blvd.","New York City"),
    ("Bistro","3400 Las Vegas B.","NYC"),
]

df = pd.DataFrame(data, columns=['restaurant', 'address', 'city'])

In [3]:
df

Unnamed: 0,restaurant,address,city
0,Chin's,3200 Las Vegas Boulevard,New York
1,Chin Bistro,3200 Las Vegas Blvd.,New York
2,Bistro,3400 Las Vegas Blvd.,New York City
3,Bistro,3400 Las Vegas B.,NYC


A good deduplication on the data above would find that:

* `(0, 1)`are duplicates
* `(2, 3)` are duplicates
* but `(0, 1)` and `(2, 3)` are different, despite being similar

The process for deduplicating a dataset is as follows:

![er_total](old_er.png)

We add an additional step of preprocessing because I feel like that's particularly useful for people.

## Step 1: Preprocessing

Without unique identifiers, we need to match records by fuzzy data like:

* Names
* Addresses
* Phone Numbers
* Dates

So it's important to clean them for matching.

## Tools for Preprocessing: cleaning Names

* Regex for cleaning
* My personal favorite `probablepeople`-- uses NLP to separate out some fields

In [4]:
import probablepeople as pp
pp.parse('Mr. Paul van de Boor')

[('Mr.', 'PrefixMarital'),
 ('Paul', 'GivenName'),
 ('van', 'Surname'),
 ('de', 'Surname'),
 ('Boor', 'Surname')]

But this isn't that good for non-English names! So you can try `nameparser`: 


In [5]:
import nameparser
from nameparser import HumanName
from IPython.display import display

display(pp.parse('Núria deAdell Raventós'))
display(HumanName('Núria deAdell Raventós'))

[('Núria', 'CorporationName'),
 ('deAdell', 'CorporationName'),
 ('Raventós', 'CorporationNameOrganization')]

<HumanName : [
	title: '' 
	first: 'Núria' 
	middle: 'deAdell' 
	last: 'Raventós' 
	suffix: ''
	nickname: ''
]>

## Tools for Preprocessing: cleaning Addresses using Geocoding! 


Google Geocoder generally does a good job, but can only handle a maximum number of API queries. Google is able to:
* Ignore lower/upper case difference
* Ignore the difference between Suite 300 vs 330 (building entrance may be the same)
* Consider 'Lupertino' as 'Cupertino'
* Expand abbreviations, like 'STE' vs 'Suite'


In [61]:
import requests
import geocoder

full_addresses = [
    "2066 Crist Drive, 94024, Los Altos, CALIFORNIA, US",
    "2066 Crist Dr, 94024, LOS ALTOS, CALIFORNIA, US",
    "20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US",
    "20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US",
    "10260 Bandley Drive, 95014, Cupertino, CALIFORNIA, US",
    "10260 Bandley Dr., 95014, Cupertino, CALIFORNIA, US",
    "20525 MARIANI AVENUE, 95014, CUPERTINO, CALIFORNIA, US",
    "20525 Mariani Ave, 95014, CUPERTINO, CALIFORNIA, US",
    "1 Infinite Loop, 95014, Cupertino, CALIFORNIA, US",
    "One Infinite Loop,, 95014, Cupertino, CALIFORNIA, US",
    "One Apple Park Way, 95014, Cupertino, CALIFORNIA, US",
    "1 Apple Park Way, 95014, Cupertino, CALIFORNIA, US",
]

full_addresses_latlng = []
with requests.Session() as session:
    for a in full_addresses:
        a_geocoded = geocoder.google(a, session=session)
        full_addresses_latlng.append(a_geocoded.latlng)

address_latlng = list(zip(full_addresses, full_addresses_latlng))

In [62]:
print(address_latlng[2], address_latlng[3])

('20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US', None) ('20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US', None)


Pretty good, but not perfect.

If geocoding isn't working, we can try: 
* `pypostal` or `usaaddress` parsers
* a ton of other custom options

## Tools for Preprocessing: cleaning Phone Numbers 

Use `phonenumbers` package

In [8]:
import phonenumbers

print("Phone number cleaning:")
phone = "(443) 745-7187"
print(phone, '->')
print(
    phonenumbers.format_number(
        phonenumbers.parse(phone, 'US'),
        phonenumbers.PhoneNumberFormat.E164)
)

Phone number cleaning:
(443) 745-7187 ->
+14437457187


## Tools for Preprocessing: cleaning Dates
`dateparser`  can guess date formats and parse them as datetime objects. It can even guess DD/MM or MM/DD by the language


In [6]:
import dateparser

print(dateparser.parse("at 10/5/1994 3:40pm"))
print(dateparser.parse("às 10/5/1994 15:40"))

1994-10-05 15:40:00
1994-05-10 15:40:00


## Let's use a real world dataset to walk through the rest of the steps!

This `restaurant` dataset is a combination restaurants from the Zagat guide and a few other guides. 

It has 881 rows, with 150 duplicates. `cluster` column indicates the true grouping of real-world entities.



In [9]:
df_with_truth = pd.read_csv('restaurant.csv')
df_with_truth.head(12)

Unnamed: 0,name,addr,city,phone,type,cluster
0,arnie morton's of chicago,435 s. la cienega blv.,los angeles,310/246-1501,american,0
1,arnie morton's of chicago,435 s. la cienega blvd.,los angeles,310-246-1501,steakhouses,0
2,arnie morton,435 s. la cienega boulevard,los angeles,310-246-1501,steakhouses,0
3,art's delicatessen,12224 ventura blvd.,studio city,818/762-1221,american,1
4,art's deli,12224 ventura blvd.,studio city,818-762-1221,delis,1
5,art's deli,12224 ventura blvd.,los angeles,818-762-1221,delis,1
6,hotel bel-air,701 stone canyon rd.,bel air,310/472-1211,californian,2
7,bel-air hotel,701 stone canyon rd.,bel air,310-472-1211,californian,2
8,bel-air,701 stone canyon road,bel air,(310) 472-1211,american,2
9,cafe bizou,14016 ventura blvd.,sherman oaks,818/788-3536,french,3


We make things a little harder by deleting the phone + truth label (`cluster`).

In [10]:
df = df_with_truth.drop(columns=['cluster', 'phone'])
df.head(12)

Unnamed: 0,name,addr,city,type
0,arnie morton's of chicago,435 s. la cienega blv.,los angeles,american
1,arnie morton's of chicago,435 s. la cienega blvd.,los angeles,steakhouses
2,arnie morton,435 s. la cienega boulevard,los angeles,steakhouses
3,art's delicatessen,12224 ventura blvd.,studio city,american
4,art's deli,12224 ventura blvd.,studio city,delis
5,art's deli,12224 ventura blvd.,los angeles,delis
6,hotel bel-air,701 stone canyon rd.,bel air,californian
7,bel-air hotel,701 stone canyon rd.,bel air,californian
8,bel-air,701 stone canyon road,bel air,american
9,cafe bizou,14016 ventura blvd.,sherman oaks,french


In [12]:
import numpy as np
all_addresses = df['addr'].str.cat(df['city'], sep=', ').values
unique_addresses = np.unique(all_addresses)
print(len(all_addresses), len(unique_addresses))

881 819


In [13]:
import os.path
import json

geocoding_filename = 'address_to_geocoding.json'

def geocode_addresses(address_to_geocoding):
    remaining_addresses = (
        set(unique_addresses) -
        set(k for k, v in address_to_geocoding.items() if v is not None))
    
    with requests.Session() as session:
        for i, address in enumerate(remaining_addresses):
            print(f"Geocoding {i + 1}/{len(remaining_addresses)}")
            geocode_result = geocoder.google(address, session=session)
            address_to_geocoding[address] = geocode_result.json

        with open(geocoding_filename, 'w') as f:
            json.dump(address_to_geocoding, f, indent=4)

if not os.path.exists(geocoding_filename):
    address_to_geocoding = {}
    geocode_addresses(address_to_geocoding)
else:
    with open(geocoding_filename) as f:
        address_to_geocoding = json.load(f)
    geocode_addresses(address_to_geocoding)
 
address_to_postal = {
    k: v['postal']
    for k, v in address_to_geocoding.items()
    if v is not None and 'postal' in v
}
address_to_latlng = {
    k: (v['lat'], v['lng'])
    for k, v in address_to_geocoding.items()
    if v is not None
}
print(f"Failed to get postal from {len(address_to_geocoding) - len(address_to_postal)}")
print(f"Failed to get latlng from {len(address_to_geocoding) - len(address_to_latlng)}")

Geocoding 1/1
Failed to get postal from 11
Failed to get latlng from 1


In [14]:
def assign_postal_lat_lng(df):
    addresses = df['addr'].str.cat(df['city'], sep=', ')
    addresses_to_postal = [address_to_postal.get(a) for a in addresses]
    addresses_to_lat = [address_to_latlng[a][0] if a in address_to_latlng else None for a in addresses]
    addresses_to_lng = [address_to_latlng[a][1] if a in address_to_latlng else None for a in addresses]

    return df.assign(postal=addresses_to_postal, lat=addresses_to_lat, lng=addresses_to_lng)

df = assign_postal_lat_lng(df)
df.head(6)

Unnamed: 0,name,addr,city,type,postal,lat,lng
0,arnie morton's of chicago,435 s. la cienega blv.,los angeles,american,90048,34.070708,-118.376563
1,arnie morton's of chicago,435 s. la cienega blvd.,los angeles,steakhouses,90048,34.070708,-118.376563
2,arnie morton,435 s. la cienega boulevard,los angeles,steakhouses,90048,34.070708,-118.376563
3,art's delicatessen,12224 ventura blvd.,studio city,american,91604,34.142963,-118.399465
4,art's deli,12224 ventura blvd.,studio city,delis,91604,34.142963,-118.399465
5,art's deli,12224 ventura blvd.,los angeles,delis,91604,34.142963,-118.399465


## Step 2: Blocking

We have the cleaned records, we now need the pairs we want to compare to find matches. To produce the pairs, we could do a "full" index, i.e., all records against all records. The formula for this is: 
`len(df) * (len(df) - 1) / 2 == 387640`

![er_total](old_er.png)

In [15]:
## Step 2: Blocking

import recordlinkage as rl
from recordlinkage.index import Full

full_indexer = Full()
pairs = full_indexer.index(df)

print(f"Complete blocking: {len(df)} records, {len(pairs)} pairs")

Complete blocking: 881 records, 387640 pairs


The number of pairs grows too fast as the number of records grows: it grows quadratically. That's why we need **blocking**. We need to produce only pairs that are good candidates of being duplicates to avoid wasting too much time.

Typically, we block on attributes we're more confident on. 

A blocking crtieria could be: only consider records to be potentially pairs **IF they match of zip code. 88

OR another blocking criteria could be: only consider records to be potentially pairs **IF they match of zip code OR first intial of last name.**

In [17]:
from recordlinkage.index import Block

postal_indexer = Block('postal')
pairs = postal_indexer.index(df)

print(f"Postal index: {len(pairs)} pairs")

Postal index: 6462 pairs


## Step 3: Similarity Calculation (Comparison)! 

Now we want to run comparisons on the blocked pairs to produce a comparison vector for each pair. A comparison vector represents the similarity between 2 records by holding similarity values between 0 to 1 for each column.

![er_total](old_er.png)

In [18]:
pd.DataFrame([[0.5, 0.8, 0.9, 1]],
             columns=['name', 'addr', 'postal', 'latlng'],
             index=pd.MultiIndex.from_arrays([[100], [200]]))

Unnamed: 0,Unnamed: 1,name,addr,postal,latlng
100,200,0.5,0.8,0.9,1


so this tells us that te pair of records `(100, 200)` has:

* Low similarity on names
* Some similarity on addresses
* High similarity on postals
* Equal latitude, longitude!

## A few more similarity calculation/comparison details: 

* **Strings**: Jaro-Winkler, Levenshtein distance are common.
* Python has built in comparison methods for **numbers** (like latitude/long or age), R doesn't!
* could use exact matching for gender or other categorical attributes


In [19]:
comp = rl.Compare()
comp.string('name', 'name', method='jarowinkler', label='name')
comp.string('addr', 'addr', method='jarowinkler', label='addr')
comp.string('postal', 'postal', method='jarowinkler', label='postal')
comp.geo('lat', 'lng', 'lat', 'lng', method='exp', scale=0.1, offset=0.01, label='latlng');

comparison_vectors = comp.compute(pairs, df)
comparison_vectors.head(5)

Unnamed: 0,Unnamed: 1,name,addr,postal,latlng
1,0,1.0,0.985507,1.0,1.0
2,0,0.896,0.910774,1.0,1.0
2,1,0.896,0.923779,1.0,1.0
32,0,0.520128,0.580927,1.0,2.5e-05
32,1,0.520128,0.574339,1.0,2.5e-05


## Step 4: Classification

We're going to talk about a few different ways to classify comparison vectors.

![er_total](old_er.png)




## Threshold-Based Classification
A simple way to classify comparison vectors as matches or nonmatches is to compute a weighted average over the vectors to get a score.


In [20]:
scores = np.average(
    comparison_vectors.values,
    axis=1,
    weights=[50, 30, 10, 20])
scored_comparison_vectors = comparison_vectors.assign(score=scores)
scored_comparison_vectors.head(5)

Unnamed: 0,Unnamed: 1,name,addr,postal,latlng,score
1,0,1.0,0.985507,1.0,1.0,0.996047
2,0,0.896,0.910774,1.0,1.0,0.928393
2,1,0.896,0.923779,1.0,1.0,0.93194
32,0,0.520128,0.580927,1.0,2.5e-05,0.48577
32,1,0.520128,0.574339,1.0,2.5e-05,0.483974


## But Abby: how are the weights computed? 

* honestly in practice I feel like people just guess #SAD #CRY
* The often-used probabilistic way, Felligi-Sunter (1969): each field (name, address, etc.) is assigned an agreement weight and a disagreement weight. 
* These **weights are log likelihood ratios** based on the ability of field values to discriminate between records and the probability that the values contain errors. Sex has poor discrimination because there are very few options. Last name has high discrimination. 

$$log(\frac{\text{matching prob of field}}{\text{non-matching prob of field}}) $$ 

* This helps us account for the fact that if two records are more likely to be the same person if they match on a less common last name like 'Vajiac' instead of'Smith'
* the expectation-maximation (EM) algorithm is used to estimate field weights (Winkler 1990) better!




## Dumb Assumptions that never hold in practice

* Matching probabilities are independent from one another 
* Comparison vectors between pairs are independent from one another

Let's take a quick look at records in rows `0` to `5`.

In [65]:
df.head(6)

Unnamed: 0,name,addr,city,type,postal,lat,lng
0,arnie morton's of chicago,435 s. la cienega blv.,los angeles,american,90048,34.070708,-118.376563
1,arnie morton's of chicago,435 s. la cienega blvd.,los angeles,steakhouses,90048,34.070708,-118.376563
2,arnie morton,435 s. la cienega boulevard,los angeles,steakhouses,90048,34.070708,-118.376563
3,art's delicatessen,12224 ventura blvd.,studio city,american,91604,34.142963,-118.399465
4,art's deli,12224 ventura blvd.,studio city,delis,91604,34.142963,-118.399465
5,art's deli,12224 ventura blvd.,los angeles,delis,91604,34.142963,-118.399465


In [27]:
matches = scored_comparison_vectors[
    scored_comparison_vectors['score'] >= 0.9]
matches.head(10)


Unnamed: 0,Unnamed: 1,name,addr,postal,latlng,score
1,0,1.0,0.985507,1.0,1.0,0.996047
2,0,0.896,0.910774,1.0,1.0,0.928393
2,1,0.896,0.923779,1.0,1.0,0.93194
33,32,0.896296,1.0,1.0,1.0,0.952862
36,35,1.0,0.809804,1.0,1.0,0.948128
4,3,0.911111,1.0,1.0,1.0,0.959596
5,3,0.911111,1.0,1.0,1.0,0.959596
5,4,1.0,1.0,1.0,1.0,1.0
49,48,1.0,1.0,1.0,1.0,1.0
50,48,1.0,0.909992,1.0,1.0,0.975452



![test](threshold_based.png)

In Felligi-Sunter, classify a pair as a match if: 
$$\log(\frac{m}{u}) > t $$


In [31]:
golden_pairs = Block('cluster').index(df_with_truth)
print("Golden pairs:", len(golden_pairs))

found_pairs_set = set(matches.index)

golden_pairs_set = set(golden_pairs)

true_positives = golden_pairs_set & found_pairs_set
false_positives = found_pairs_set - golden_pairs_set
false_negatives = golden_pairs_set - found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))

Golden pairs: 150
true_positives total: 127
false_positives total: 2
false_negatives total: 23


Just a couple of false positives:

In [29]:
print(f"False positives:")
for false_positive_pair in false_positives:
    display(df.loc[list(false_positive_pair)][['name', 'addr', 'postal', 'lat', 'lng']])

False positives:


Unnamed: 0,name,addr,postal,lat,lng
198,ritz-carlton dining room (buckhead),3434 peachtree rd. ne,30326,33.850488,-84.363631
196,ritz-carlton cafe (buckhead),3434 peachtree rd. ne,30326,33.850488,-84.363631


Unnamed: 0,name,addr,postal,lat,lng
839,ritz-carlton cafe (atlanta),181 peachtree st.,30303,33.758572,-84.387181
200,ritz-carlton restaurant,181 peachtree st.,30303,33.758572,-84.387181


But lots of false negatives!!! 

In [32]:
print(f"False negatives (sample 10 of {len(false_negatives)}):")
for false_negative_pair in list(false_negatives)[:10]:
    display(df.loc[list(false_negative_pair)][['name', 'addr', 'postal', 'lat', 'lng']])

False negatives (sample 10 of 23):


Unnamed: 0,name,addr,postal,lat,lng
165,abruzzi,2355 peachtree rd. ne,30305,33.819166,-84.387518
164,abruzzi,2355 peachtree rd. peachtree battle shopping ...,30305,33.819818,-84.386643


Unnamed: 0,name,addr,postal,lat,lng
35,locanda veneta,8638 w. third st.,90048.0,34.073421,-118.381098
34,locanda veneta,3rd st.,,34.009031,-118.488174


Unnamed: 0,name,addr,postal,lat,lng
29,katsu,1972 hillhurst ave.,90027,34.107403,-118.287172
28,restaurant katsu,1972 n. hillhurst ave.,90027,34.107403,-118.287172


Unnamed: 0,name,addr,postal,lat,lng
37,locanda,w. third st.,,34.068947,-118.322599
36,locanda veneta,8638 w 3rd,90048.0,34.073421,-118.381098


Unnamed: 0,name,addr,postal,lat,lng
183,heera of india,595 piedmont ave.,30308,33.770298,-84.381122
182,heera of india,595 piedmont ave. rio shopping mall,30324,33.798336,-84.371044


Unnamed: 0,name,addr,postal,lat,lng
41,palm the (los angeles),9001 santa monica blvd.,90069,34.083064,-118.387282
40,the palm,9001 santa monica blvd.,90069,34.083473,-118.387373


Unnamed: 0,name,addr,postal,lat,lng
137,shun lee palace,155 e. 55th st.,10022,40.759435,-73.969072
136,shun lee west,43 w. 65th st.,10023,40.7729,-73.981348


Unnamed: 0,name,addr,postal,lat,lng
36,locanda veneta,8638 w 3rd,90048.0,34.073421,-118.381098
34,locanda veneta,3rd st.,,34.009031,-118.488174


Unnamed: 0,name,addr,postal,lat,lng
145,uncle nick's,747 ninth ave.,10019,40.763884,-73.989002
144,uncle nick's,747 9th ave. between 50th and 51st sts.,10019,40.763835,-73.988912


Unnamed: 0,name,addr,postal,lat,lng
111,mesa grill,102 fifth ave.,10011,40.737045,-73.993119
110,mesa grill,102 5th ave. between 15th and 16th sts.,10001,40.748441,-73.985664


## Supervised Classification


Instead of trying to guess weights and thresholds, we can train a classifier to learn how to classify matches and nonmatches based on some training data we provide. Remember that this has to be processed as well! 

For a first classifier pass, I would use a SVM. SVMs are:
* Are (more) resilient to noise 
* Can handle correlated features (like `postal` and `latlng`)
* Are robust to imbalanced training sets (a huge probelm w/ record linkage/entity resoution problems!)

In [54]:

df_training = pd.read_csv('restaurant-training.csv', skip_blank_lines=True)
df_training = df_training.drop(columns=['phone'])
df_training.head()


Unnamed: 0,name,addr,city,type,cluster
0,locanda veneta,3rd st.,los angeles,italian,13
1,locanda veneta,8638 w. third st.,los angeles,italian,13
2,locanda veneta,8638 w 3rd,st los angeles,italian,13
3,cafe lalo,201 w. 83rd st.,new york,coffee bar,26
4,cafe lalo,201 w. 83rd st.,new york city,coffeehouses,26


In [53]:
import numpy as np
import re
import pprint

irrelevant_regex = re.compile(r'[^a-z0-9\s]')
multispace_regex = re.compile(r'\s\s+')

def assign_cleaned_name(df):
    restaurant_stopwords = {
        's', 'the', 'la', 'le', 'of', 'and', 'on', 'l'}
    restaurant_stopwords_regex = r'\b(?:{})\b'.format(
        '|'.join(restaurant_stopwords))
    return df.assign(
        name=df['name']
             .str.replace(restaurant_stopwords_regex, '')
             .str.replace(multispace_regex, ' ')
             .str.strip())
df_training = assign_no_symbols_name(df_training)
df_training = assign_cleaned_name(df_training)
df_training = assign_postal_lat_lng(df_training)

In [49]:
all_training_pairs = Full().index(df_training)
matches_training_pairs = Block('cluster').index(df_training)

training_vectors = comp.compute(all_training_pairs, df_training)

svm = rl.SVMClassifier()
svm.fit(training_vectors, matches_training_pairs);

svm_pairs = svm.predict(comparison_vectors)
svm_found_pairs_set = set(svm_pairs)

svm_true_positives = golden_pairs_set & svm_found_pairs_set
svm_false_positives = svm_found_pairs_set - golden_pairs_set
svm_false_negatives = golden_pairs_set - svm_found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))

print('svm_true_positives total:', len(svm_true_positives))
print('svm_false_positives total:', len(svm_false_positives))
print('svm_false_negatives total:', len(svm_false_negatives))

true_positives total: 127
false_positives total: 2
false_negatives total: 23

svm_true_positives total: 132
svm_false_positives total: 2
svm_false_negatives total: 18


We got better results! The only false positive we got on the SVM classifier and not on the threshold method is a really difficult case where most columns are very similar:


In [52]:
print("(SVM false positives) - (Threshold false positives):")
for svm_false_positive in (svm_false_positives - false_positives):
    display(df.loc[list(svm_false_positive)][['name', 'addr', 'city', 'type', 'lat', 'lng']])

(SVM false positives) - (Threshold false positives):


There are other classifiers from recordlinkage library we could try, but the truth is:

* It's very **difficult to build a good training set** that takes in account all important cases of matches/nonmatches
* It's possible to tune classifier parameters to get better results, but it's **difficult to decide the right parameters** that will generalize well for future predictions
* And we're not even **if the blocking rules we used are really sane**: we can be dropping true positives that are not being blocked together, or even introducing false negatives that are being blocked together but our classifier isn't being able classify them as nonmatching

## If I had more time I'd talk about: Clustering

We can cluster potentiall co-referent records. This helps us deal with the problem of **transitive closure** and deals with some of the assumptions that are wrong from Felligi-Sunter. Basically, when 

* record `A` matches record `B`
* and record `B` matches record `C`
* we enforce that record `A` matches record `C`




## Step 5: Evaluation

* Accuracy is not appropriate, look at precision/recall/F-measure
* BUT, if you use clustering-- don't just look at pairwise precision/recall, look at the whole host of clustering evaluation tools available (V-measure, homogeneity, completeness)



![er_total](old_er.png)

## Worth Looking at: 

* **Active Learning methods**: identify training examples that "lead to maximal accuracy improvements" to train both: optimal classifier weights  AND optimal blocking rules. Check out the `Dedupe` library!! 
* Privacy protected record linkage!
* Don't wanna write code? Lots of product versions are available, include for the `Dedupe` library

![dedupe](dedupe.jpg)

## Last but not least

* How you deduplicate is important for any downstream task you do! Linear regression (Lahiri 2004), logistic regression (diConsiglio 2018)
* **YOU SHOULD BE REPORTING YOUR DEDUPLICATION CRITERIA**
* my research! what to do when you don't have complete ground truth data!
* my research! Bringing in the network science

![network](preandpostFS.png)