This collab does active learning on the FEBRL people data. 
Input from subject matter experts is gathered using dedupe. 

By Bharath Gunasekaran


# Use Active Learning to Link FEBRL People Data

## Downloading dependencies

In [1]:

import requests

tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
r = requests.get(tutorial_functions_url)

with open("linking_tutorial_functions.py", "w") as fh:
    fh.write(r.text)

!pip install -q altair dedupe dedupe-variable-name jellyfish recordlinkage 

In [4]:
pip install numpy --upgrade

Collecting numpy
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 5.4 MB/s 
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.4 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed numpy-1.21.4


In [2]:
import datetime
import itertools
import os
import pathlib
import re
from typing import Any, Dict, Optional

import dedupe
import pandas as pd

import linking_tutorial_functions as tutorial

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt


## Define Working Filepaths

For convenience, we'll define a `pathlib.Path` to reference our current working directory.

In [3]:
WORKING_DIR = pathlib.Path(os.path.abspath(''))
WORKING_DIR

PosixPath('/content')

## Load Training Dataset and Ground Truth Labels

In [5]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(True)

Let's take a quick look at our training dataset to refresh on the columns, formats, and data.

In [6]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fbc4143d-15f9-4f27-b5f0-dedbadce6616,matilda,struck,8,ballard place,,west perth,2470,qld,19611002,32.0,03 05903135,8276847
48a56cad-7ba6-45e1-97cd-517ba65bdab5,lachlan,eglinton,36,kambalda crescent,villa 427,auburn,5109,,19260108,27.0,,9937958
b1792d21-e4be-4b86-8dea-454ffa5194c5,mikayla,asher,588,britten-jones drive,,miami,4218,nsw,19251102,32.0,03 33770501,7017310
96653d73-bebc-4459-94f3-c3f0a8c514d4,grace,bristow,7,,wandella park snowy,cardiff,6163,nsw,19400120,,07 37864073,3535974
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,wilson,bishop,11,chisholm street,,bronte,2490,nsw,19210305,27.0,04 15209769,5573522


## Data Augmentation

Doing Data augmentation before passing data to dedupe. Converting birth date to mm/dd/yy. Stripping all whitespaces in all the text fields. Additionally, `dedupe` requires input data to be in dictionaries, using the record id as the key and the record metadata as the value. So, we'll convert our dataframes to this format.

In [7]:
def format_dob(dob: str) -> Optional[str]:
    """ Transform date of birth format from YYYYMMDD to mm/dd/yy.
        If DOB cannot be transformed, return None.
    """
    try:
        if re.match(r"\d{8}", dob):
            return (datetime.datetime.strptime(dob, "%Y%m%d")).strftime("%m/%d/%y")
    except:
        pass

    return None

def strip_and_null(x: Any) -> Optional[str]:
    """ Stringify incoming variable, remove trailing/leading whitespace
        and return resulting string. Return None if resulting string is empty.
    """
    x = str(x).strip()
    
    if x == "":
        return None
    else:
        return x
    
def convert_df_to_dict(df: pd.DataFrame) -> Dict[str, Dict]:
    """ Convert pandas DataFrame to dict keyed by record id.
        Convert all fields to strings or Nones to satisfy dedupe.
        Transform date format of date_of_birth field.
    """    

    for col in df.columns:
        df[col] = df[col].apply(lambda x: strip_and_null(x))

    df["date_of_birth"] = df["date_of_birth"].apply(lambda x: format_dob(x))    

    return df.to_dict("index")

In [8]:
records_A = convert_df_to_dict(df_A)
records_B = convert_df_to_dict(df_B)

## Prepare Training

When we linked our data via SimSum and supervised learning, we defined our blockers and comparators manually with `recordlinkage`. The `dedupe` library takes an active learning approach to blocking and classification and will use our feedback gathered during the labeling session to learn blocking rules and train a classifier. 

To prepare our `dedupe.RecordLink` object for training, first we'll define the fields that we think `dedupe` should pay attention to when matching records - these definitions will serve as the comparators. The `field` contains the name of the attribute to use for comparison, and the `type` defines the comparison type.

In [10]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "address_1", "type" : "ShortString" },
    { "field" : "address_2", "type" : "ShortString" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker = dedupe.RecordLink(fields)
linker.prepare_training(records_A, records_B)

INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (commonThreeTokens, address_1)


CPU times: user 58.4 s, sys: 1.07 s, total: 59.4 s
Wall time: 58.8 s


## Active Learning Labeling Session!

At this point, we're ready to provide feedback to `dedupe` via an active learning labeling session. For this, `dedupe` supplies a convenience method to iterate through pairs it is uncertain about. As you provide feedback for each pair, dedupe learns blocking rules and recalculates its linking model weights.

You can use `y` (yes, match), `n` (no, not match), and `u` (unsure) to provide feedback on candidate links. When you're ready to exit the labeling session, use `f`.

In [11]:
dedupe.console_label(linker)

first_name : ella
surname : turtur
address_1 : None
address_2 : clarkwood
suburb : crookwell
postcode : 4551
state : sa
date_of_birth : 06/27/37
soc_sec_id : 8312467

first_name : nan
surname : tuftmq
address_1 : None
address_2 : clarkwood
suburb : crookwell
postcode : 4551
state : sa
date_of_birth : 06/27/37
soc_sec_id : 8312467

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


first_name : louis
surname : traforti
address_1 : barnett close
address_2 : tilburoo
suburb : pacific paradise
postcode : 2518
state : vic
date_of_birth : 05/12/38
soc_sec_id : 1913191

first_name : louis
surname : trafcrati
address_1 : None
address_2 : tilburoo
suburb : None
postcode : 2518
state : vic
date_of_birth : 05/12/38
soc_sec_id : 1913191

0/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : kailey
surname : hazell
address_1 : veale street
address_2 : troyvic
suburb : matraville
postcode : 6111
state : vic
date_of_birth : 02/08/28
soc_sec_id : 8130347

first_name : nathan
surname : jesser
address_1 : clisby close
address_2 : None
suburb : 2213
postcode : fraser
state : vic
date_of_birth : None
soc_sec_id : 5290222

1/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (commonThreeTokens, address_1)
INFO:dedupe.training:PartialPredicate: (commonFourGram, surname, CorporationName)
first_name : jai
surname : plumb
address_1 : gadali crescent
address_2 : moondani
suburb : rosebud
postcode : 2024
state : wa
date_of_birth : None
soc_sec_id : 6878563

first_name : chloe
surname : hare
address_1 : la trobme close
address_2 : None
suburb : None
postcode : 3195
state : qld
date_of_birth : None
soc_sec_id : 9880358

1/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : mattheo
surname : bullock
address_1 : murchison street
address_2 : None
suburb : rye
postcode : 3041
state : nsw
date_of_birth : 12/12/05
soc_sec_id : 2296205

first_name : matteo
surname : bulloxk
address_1 : None
address_2 : None
suburb : rye
postcode : 3041
state : nsw
date_of_birth : 12/12/05
soc_sec_id : 2296205

1/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : toby
surname : dent
address_1 : studley street
address_2 : sunnydale cottage
suburb : bowral
postcode : 2450
state : wa
date_of_birth : 04/02/81
soc_sec_id : 9402107

first_name : toby
surname : de nd
address_1 : None
address_2 : None
suburb : bowral
postcode : 2450
state : wa
date_of_birth : 04/02/81
soc_sec_id : 9402107

2/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
first_name : kiandra
surname : dunstone
address_1 : None
address_2 : None
suburb : oaklands park
postcode : 6163
state : wa
date_of_birth : 10/29/11
soc_sec_id : 5277244

first_name : kiandra
surname : dunstone
address_1 : None
address_2 : None
suburb : oaklands park
postcode : 6163
state : wa
date_of_birth : None
soc_sec_id : 5277244

3/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : sarah
surname : humphreys
address_1 : girrahween street
address_2 : southern wood
suburb : corowa
postcode : 5065
state : qld
date_of_birth : None
soc_sec_id : 7393243

first_name : zara
surname : humpcys
address_1 : None
address_2 : southern wood
suburb : corowa
postcode : 5095
state : qld
date_of_birth : None
soc_sec_id : 7393243

4/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : david
surname : sznajder
address_1 : None
address_2 : bowtells caravn park
suburb : eaton
postcode : 2298
state : vic
date_of_birth : 09/15/78
soc_sec_id : 8971940

first_name : david
surname : sznadkr
address_1 : None
address_2 : bowtells caravn park
suburb : eaton
postcode : 2298
state : vic
date_of_birth : 09/15/78
soc_sec_id : 6770683

4/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : william
surname : wilkins
address_1 : None
address_2 : None
suburb : currumbin valley
postcode : 3166
state : nsw
date_of_birth : 11/12/56
soc_sec_id : 3252646

first_name : william
surname : wiljish
address_1 : None
address_2 : None
suburb : currumbinuvalley
postcode : 3166
state : nsw
date_of_birth : 11/12/56
soc_sec_id : 3252666

5/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (commonThreeTokens, address_2)
first_name : madison
surname : david
address_1 : thornley place
address_2 : esk farm
suburb : boorowa
postcode : 2486
state : nsw
date_of_birth : 03/06/25
soc_sec_id : 1762143

first_name : madison
surname : david
address_1 : thornley place
address_2 : None
suburb : boorowa
postcode : 2486
state : nsw
date_of_birth : 03/06/25
soc_sec_id : 1762843

6/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (commonTwoTokens, suburb)
first_name : joshua
surname : ryan
address_1 : None
address_2 : country club
suburb : craigie
postcode : 4165
state : qld
date_of_birth : 08/15/28
soc_sec_id : 4871225

first_name : joshua
surname : ryax
address_1 : None
address_2 : country club
suburb : craigie
postcode : 4165
state : qld
date_of_birth : 08/25/28
soc_sec_id : 4871225

7/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : michael
surname : mason
address_1 : lightfoot crescent
address_2 : loughmore
suburb : gunalda
postcode : 4872
state : sa
date_of_birth : 10/18/08
soc_sec_id : 1017175

first_name : michael
surname : maon
address_1 : lightfoot crescent
address_2 : loughmore
suburb : gunalda
postcode : 4872
state : sa
date_of_birth : 10/08/08
soc_sec_id : 1017175

8/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : alexandra
surname : van rensburg
address_1 : astelia place
address_2 : None
suburb : woodcroft
postcode : 2756
state : nsw
date_of_birth : 01/31/22
soc_sec_id : 3123032

first_name : alexandra
surname : van rensburg
address_1 : None
address_2 : None
suburb : woodcroft
postcode : 2756
state : nsw
date_of_birth : None
soc_sec_id : 3123034

9/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : samara
surname : shelley
address_1 : bland place
address_2 : None
suburb : emu plains
postcode : 2533
state : nsw
date_of_birth : None
soc_sec_id : 5902410

first_name : samara
surname : shelley
address_1 : bland place
address_2 : None
suburb : None
postcode : 2533
state : nsw
date_of_birth : None
soc_sec_id : 5902401

10/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, surname)
first_name : angelica
surname : noack
address_1 : ellerston avenue
address_2 : None
suburb : tuart hill
postcode : 3515
state : None
date_of_birth : 08/25/32
soc_sec_id : 2615059

first_name : angelica
surname : noack
address_1 : ellerston avenue
address_2 : None
suburb : tuart gill
postcode : 3515
state : None
date_of_birth : 09/25/32
soc_sec_id : 2610549

11/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, surname)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_1)
first_name : georgia
surname : nguyen
address_1 : None
address_2 : brentwood vlge
suburb : sefton
postcode : 3101
state : wa
date_of_birth : None
soc_sec_id : 4084643

first_name : georgia
surname : nguyen
address_1 : None
address_2 : brentwoom vlge
suburb : sefton
postcode : 3101
state : wa
date_of_birth : None
soc_sec_id : 2139336

12/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


We can now train our linker, based on the labeling session feedback.

In [12]:
%%time
linker.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  * (true_distinct + false_distinct)))
INFO:rlr.crossvalidation:optimum alpha: 0.010000, score 0.31593767422978325
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (wholeFieldPredicate, postcode), SimplePredicate: (suffixArray, first_name), PartialIndexLevenshteinSearchPredicate: (1, surname, Surname))
INFO:dedupe.training:(SimplePredicate: (monthPredicate, date_of_birth), SimplePredicate: (sameSevenCharStartPredicate, suburb), SimplePredicate: (commonFourGram, first_name))


CPU times: user 5.86 s, sys: 670 ms, total: 6.53 s
Wall time: 5.99 s


Let's persist our training data (captured during in the labeling session), as well as the learned model weights.

In [13]:
ACTIVE_LEARNING_DIR = WORKING_DIR / "dedupe_active_learning"
ACTIVE_LEARNING_DIR.mkdir(parents=True, exist_ok=True)

SETTINGS_FILE = ACTIVE_LEARNING_DIR / "dedupe_learned_settings"
TRAINING_FILE = ACTIVE_LEARNING_DIR / "dedupe_training.json"

with open(TRAINING_FILE, "w") as fh:
    linker.write_training(fh)
    
with open(SETTINGS_FILE, "wb") as sf:
    linker.write_settings(sf)

## Examine Learned Blockers

Now, let's take a look at the predicates (blockers) that `dedupe` learned during our active learning labeling session. Note that `dedupe` can learn composite predicates/blockers, i.e. individual predicates can be combined with logical operators.

In [14]:
linker.predicates

((SimplePredicate: (wholeFieldPredicate, postcode),
  SimplePredicate: (suffixArray, first_name),
  PartialIndexLevenshteinSearchPredicate: (1, surname, Surname)),
 (SimplePredicate: (monthPredicate, date_of_birth),
  SimplePredicate: (sameSevenCharStartPredicate, suburb),
  SimplePredicate: (commonFourGram, first_name)))

Next, let's examine the resulting candidate pairs and look at our blocking efficiency. The `.pairs` method will give us all candidate record pairs that are generated by blocking with the learned blockers.

In [15]:
candidate_pairs = [x for x in linker.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

1,395 candidate pairs generated from blocking.


You'll notice that, in contrast to `recordlinkage`, our post-blocking candidate pairs contain both the record ids as well as the record metadata.

In [16]:
candidate_pairs[0]

(('fbc4143d-15f9-4f27-b5f0-dedbadce6616',
  {'address_1': 'ballard place',
   'address_2': None,
   'age': '32',
   'date_of_birth': '10/02/61',
   'first_name': 'matilda',
   'phone_number': '03 05903135',
   'postcode': '2470',
   'soc_sec_id': '8276847',
   'state': 'qld',
   'street_number': '8',
   'suburb': 'west perth',
   'surname': 'struck'}),
 ('a9f5a761-83d6-452e-9f27-a452b3d06a4e',
  {'address_1': 'ballard place',
   'address_2': None,
   'age': '32',
   'date_of_birth': '10/02/61',
   'first_name': 'matikda',
   'phone_number': '03 05903135',
   'postcode': '2407',
   'soc_sec_id': '8276847',
   'state': 'qld',
   'street_number': '0',
   'suburb': 'west perth',
   'surname': 'strucl'}))

We can assemble our candidate pair ids into an indexed pandas dataframe for easier comparision with our known true links.

In [17]:
df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

df_candidate_links.head()

person_id_A,person_id_B
fbc4143d-15f9-4f27-b5f0-dedbadce6616,a9f5a761-83d6-452e-9f27-a452b3d06a4e
48a56cad-7ba6-45e1-97cd-517ba65bdab5,c77c2c04-4415-4c4d-b248-18dc28fd63d0
b1792d21-e4be-4b86-8dea-454ffa5194c5,043d063f-3f72-46ca-bb66-e7f610d4c2cd
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,337aa0c5-4a0a-4bcd-89db-6fa998fa783c
7264bfb0-bbcb-4f68-b9bf-03619237cfb2,8e5d98b8-9611-480e-8c65-b0e56520307b


Now, let's take a look at our learned blocker performance.

In [18]:
max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

10,562,500 total possible pairs.

1,395 pairs after full blocking: 0.999868% search space reduction.
46.43% true links retained after blocking.


## Score Pairs and Examine Learned Classifier

After `dedupe` has trained blockers and a classification model based on our labeling session, we can link the records in our training dataset via the `.join` method.

In [19]:
%%time
linked_records = linker.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

CPU times: user 1.02 s, sys: 104 ms, total: 1.12 s
Wall time: 2.17 s


`linker.join` will return the links, along with a model confidence.

In [20]:
linked_records[0:3]

[(('fe7dcaf6-3a4d-456d-8ac1-2b81875158bc',
   'eb827aae-7870-4a7a-8dc4-f5e755ae19b6'),
  1.0),
 (('fc7ce04c-b9de-428b-83c2-80219ad8c4d3',
   'd0e57003-6a0f-4e51-9cff-0e60927e0611'),
  1.0),
 (('fc041887-b5b8-427b-ac8b-bd867d0c813f',
   'f93ba54b-0c5c-400d-89e8-1fe6746bf928'),
  1.0)]

We'll format the `dedupe` linker predictions into a format that we can use with our existing evaluation functions.

In [21]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
fe7dcaf6-3a4d-456d-8ac1-2b81875158bc,eb827aae-7870-4a7a-8dc4-f5e755ae19b6,1.000000,True
fc7ce04c-b9de-428b-83c2-80219ad8c4d3,d0e57003-6a0f-4e51-9cff-0e60927e0611,1.000000,True
fc041887-b5b8-427b-ac8b-bd867d0c813f,f93ba54b-0c5c-400d-89e8-1fe6746bf928,1.000000,True
faa1d9cd-3f7d-436d-b4f9-6f589a64735d,62824acf-42e5-40c1-bd53-12f0b33a7d2a,1.000000,True
ef330b40-c045-42de-8ef0-4be9fe1aa929,61e0179f-21fa-493e-81d1-e4282ace99f2,1.000000,True
...,...,...,...
ce337571-8ded-408e-81d0-4294aec0ce17,0e150356-f577-4617-82c7-f336e84e83f1,0.278125,True
c5deaa95-d521-4b12-868f-1c4d30d1f754,8f5c5a4e-9732-40cd-8605-9aa48a7d501d,0.246426,True
3b72f294-e4fa-4bba-b512-0030c5a4b3fc,d7bc4558-ce38-4c4d-877e-68f90b7ecc81,0.198989,True
fdcf5238-384c-44f3-99a4-3e8f049ff693,fe7fae1a-c83e-4ba9-b60b-9499965b6c0d,0.053640,True


## Choosing a Linking Model Score Threshold

The `dedupe` `.join` method that we used to score our training data directly incorporates the learned blockers. Thus, note that the scored pairs appearing on the distribution represent blocked pairs, and that our blockers *significantly* reduced the candidate pair search space.

### Model Score Distribution

In [22]:
df_predictions["ground_truth"].value_counts()

True     1392
False       1
Name: ground_truth, dtype: int64

In [23]:
tutorial.plot_model_score_distribution(df_predictions)

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


### Precision and Recall vs. Model Score

In [24]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

In [25]:
df_eval.head()

Unnamed: 0,threshold,tp,fp,tn,fn,precision,recall,f1
0,0.0,1392,1,0,0,0.999282,1.0,0.999641
1,0.020408,1391,1,0,1,0.999282,0.999282,0.999282
2,0.040816,1391,1,0,1,0.999282,0.999282,0.999282
3,0.061224,1390,1,0,2,0.999281,0.998563,0.998922
4,0.081633,1390,1,0,2,0.999281,0.998563,0.998922


In [26]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

## Iterating with Active Learning

When using active learning, we iterate on our linking solution, and incorporate progressively more labeled training data. Perhaps we're not satisfied with the current performance of the blockers or classifier, and we'd like to create more labeled examples for dedupe to train on.

Recall that earlier, we saved off our existing training data from the first labeling session. We can load this persisted data into a `dedupe` linker, and kick off another labeling session. Perhaps, after investigating the data during our first cycle, we don't think that dedupe should include `address_1` and `address2` in its comparators.

### Tweak the Linker and Use Existing Training Data

In [27]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker2 = dedupe.RecordLink(fields)

with open(TRAINING_FILE, "r") as fh:
    linker2.prepare_training(records_A, records_B, training_file=fh)

INFO:dedupe.api:reading training from file
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, postcode)


CPU times: user 41 s, sys: 799 ms, total: 41.8 s
Wall time: 41.2 s


Now, we can kick off a second active learning/labeling session.

In [28]:
dedupe.console_label(linker2)

first_name : sophie
surname : nan
suburb : wahroonga
postcode : 3189
state : nsw
date_of_birth : 07/08/31
soc_sec_id : 7381703

first_name : sophie
surname : nan
suburb : wahroonga
postcode : 3198
state : nsw
date_of_birth : 07/08/31
soc_sec_id : 7381703

12/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


 y


(y)es / (n)o / (u)nsure / (f)inished


y


first_name : breana
surname : wilson-haffenden
suburb : warnbro
postcode : 2303
state : vic
date_of_birth : 10/26/35
soc_sec_id : 6126870

first_name : breana
surname : wilson-haffenden
suburb : warnbro
postcode : 2013
state : vic
date_of_birth : 10/26/35
soc_sec_id : 6126870

13/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, postcode)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
first_name : chelsea
surname : stanley
suburb : kingaroy
postcode : 5008
state : vic
date_of_birth : 06/04/80
soc_sec_id : 4846538

first_name : chelsea
surname : stanley
suburb : kingaroy
postcode : 5080
state : vic
date_of_birth : None
soc_sec_id : 4840348

14/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : joshua
surname : stanley
suburb : hastings
postcode : 4171
state : nsw
date_of_birth : None
soc_sec_id : 8326533

first_name : joshua
surname : stanley
suburb : hastings
postcode : 4129
state : nsw
date_of_birth : None
soc_sec_id : 8326533

15/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, postcode)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
first_name : chloe
surname : goess
suburb : ballarat
postcode : 2315
state : vic
date_of_birth : 10/01/27
soc_sec_id : 4357749

first_name : chloe
surname : goess
suburb : ballarst
postcode : 2316
state : vic
date_of_birth : 10/01/27
soc_sec_id : 4357749

16/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


### Retrain the Linker and Examine Blocking Performance

Now, let's retrain, and examine blocker performance. Ideally, we see an improved true link retention following our second labeling session.

In [29]:
%%time
linker2.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  * (true_distinct + false_distinct)))
INFO:rlr.crossvalidation:optimum alpha: 0.000100, score 0.3868986130697193
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (sameFiveCharStartPredicate, suburb), SimplePredicate: (metaphoneToken, first_name), SimplePredicate: (yearPredicate, date_of_birth))
INFO:dedupe.training:(SimplePredicate: (commonSixGram, surname), SimplePredicate: (commonSixGram, first_name), PartialIndexLevenshteinSearchPredicate: (4, surname, Surname))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, surname), LevenshteinSearchPredicate: (4, suburb))
INFO:dedupe.training:(PartialPredicate: (sameThreeCharStartPredicate, surname, CorporationName), SimplePredicate: (monthPredicate, date_of_birth), SimplePredicate: (sameSevenCharStartPredicate, first_name))


CPU times: user 5.92 s, sys: 563 ms, total: 6.48 s
Wall time: 6.1 s


In [30]:
candidate_pairs = [x for x in linker2.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

1,394 candidate pairs generated from blocking.
10,562,500 total possible pairs.

1,394 pairs after full blocking: 0.999868% search space reduction.
45.67% true links retained after blocking.


### Evaluate Classification Performance

In [31]:
%%time
linked_records = linker2.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

CPU times: user 7.59 s, sys: 234 ms, total: 7.83 s
Wall time: 9.08 s


In [32]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
fff044ab-8dca-4946-bfa4-1675ee7d56b5,99060a0c-e1bf-4869-bf08-2e15389193b6,1.000000,True
fcb0aaf9-24b0-4831-9086-2c2449d75b3b,548bdbbb-2528-4705-8b2e-76e99d3def77,1.000000,True
ef309234-8a00-44b1-b557-edab66f06d6b,aea39e9c-be88-401a-9547-ec3fa9435dba,1.000000,True
eb2933fc-4fec-4f00-8a6b-b7f6c8152086,d19b7654-1f80-4527-b73c-56e6232e69cb,1.000000,True
e4d6ec3f-5182-47c5-ad5f-05144b736e02,cef12e2c-57c1-4b6b-9103-faee51892589,1.000000,True
...,...,...,...
d596b917-7f8f-4c6b-b8b2-326a71c46ad4,83b083d5-928c-4524-b198-67d426546b5e,0.139982,True
62393125-2205-4842-a165-e4f924c0fcd1,6146bafc-5745-4355-a562-161e8e7cf5b3,0.057441,True
d3910de0-bf8b-49dc-9835-9e5ed936548e,cd1066e8-0fdd-4d90-873e-e237e79cdb96,0.050284,True
5f6363c3-56e5-467d-aafe-516a6a6238f2,f5c23104-6c17-4796-8b2e-0d0a249d5fa6,0.045375,True


In [33]:
df_predictions["ground_truth"].value_counts()

True     1369
False       3
Name: ground_truth, dtype: int64

In [34]:
tutorial.plot_model_score_distribution(df_predictions)

In [35]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

tutorial.plot_precision_recall_vs_threshold(df_eval)