<a href="https://colab.research.google.com/github/arunt-sjsu/advanced_deep_learning/blob/main/assignment_7_Catchup_1/FEBRL_Data_with_Active_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use Active Learning to Link FEBRL People Data
The following notebook is a replication of the active learning example using FEBRL dataset. 

We use dedeupe dataset to perform an active learning excercise and apply simsum classification.

<a href="https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/03_Link_FEBRL_Data_with_Active_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

In [None]:
import requests

tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
r = requests.get(tutorial_functions_url)
    
with open("linking_tutorial_functions.py", "w") as fh:
  fh.write(r.text)
    
!pip install -q altair dedupe dedupe-variable-name jellyfish recordlinkage 
!pip install -U numpy


In [1]:
import datetime
import itertools
import os
import pathlib
import re
from typing import Any, Dict, Optional

import dedupe
import pandas as pd

import linking_tutorial_functions as tutorial

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt


## Define Working Filepaths

For convenience, we'll define a `pathlib.Path` to reference our current working directory.

In [2]:
WORKING_DIR = pathlib.Path(os.path.abspath(''))
WORKING_DIR

PosixPath('/content')

## Load Training Dataset and Ground Truth Labels

In [3]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(True)

Let's take a quick look at our training dataset to refresh on the columns, formats, and data.

In [4]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fbc4143d-15f9-4f27-b5f0-dedbadce6616,matilda,struck,8,ballard place,,west perth,2470,qld,19611002,32.0,03 05903135,8276847
48a56cad-7ba6-45e1-97cd-517ba65bdab5,lachlan,eglinton,36,kambalda crescent,villa 427,auburn,5109,,19260108,27.0,,9937958
b1792d21-e4be-4b86-8dea-454ffa5194c5,mikayla,asher,588,britten-jones drive,,miami,4218,nsw,19251102,32.0,03 33770501,7017310
96653d73-bebc-4459-94f3-c3f0a8c514d4,grace,bristow,7,,wandella park snowy,cardiff,6163,nsw,19400120,,07 37864073,3535974
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,wilson,bishop,11,chisholm street,,bronte,2490,nsw,19210305,27.0,04 15209769,5573522


## Data Augmentation

We'll do minimal data augmentation before feeding our training data to `dedupe`; we just want to format the date of birth data as `mm/dd/yy`, and ensure all columns are in string format and stripped of trailing/leading whitespace. Additionally, `dedupe` requires input data to be in dictionaries, using the record id as the key and the record metadata as the value. So, we'll convert our dataframes to this format.

In [5]:
def format_dob(dob: str) -> Optional[str]:
    """ Transform date of birth format from YYYYMMDD to mm/dd/yy.
        If DOB cannot be transformed, return None.
    """
    try:
        if re.match(r"\d{8}", dob):
            return (datetime.datetime.strptime(dob, "%Y%m%d")).strftime("%m/%d/%y")
    except:
        pass

    return None

def strip_and_null(x: Any) -> Optional[str]:
    """ Stringify incoming variable, remove trailing/leading whitespace
        and return resulting string. Return None if resulting string is empty.
    """
    x = str(x).strip()
    
    if x == "":
        return None
    else:
        return x
    
def convert_df_to_dict(df: pd.DataFrame) -> Dict[str, Dict]:
    """ Convert pandas DataFrame to dict keyed by record id.
        Convert all fields to strings or Nones to satisfy dedupe.
        Transform date format of date_of_birth field.
    """    

    for col in df.columns:
        df[col] = df[col].apply(lambda x: strip_and_null(x))

    df["date_of_birth"] = df["date_of_birth"].apply(lambda x: format_dob(x))    

    return df.to_dict("index")

In [6]:
records_A = convert_df_to_dict(df_A)
records_B = convert_df_to_dict(df_B)

We can examine a small sample of the resulting transformed records:

In [8]:
[records_A[k] for k in list(records_A.keys())[0:2]]

[{'address_1': 'ballard place',
  'address_2': None,
  'age': '32',
  'date_of_birth': '10/02/61',
  'first_name': 'matilda',
  'phone_number': '03 05903135',
  'postcode': '2470',
  'soc_sec_id': '8276847',
  'state': 'qld',
  'street_number': '8',
  'suburb': 'west perth',
  'surname': 'struck'},
 {'address_1': 'kambalda crescent',
  'address_2': 'villa 427',
  'age': '27',
  'date_of_birth': '01/08/26',
  'first_name': 'lachlan',
  'phone_number': None,
  'postcode': '5109',
  'soc_sec_id': '9937958',
  'state': None,
  'street_number': '36',
  'suburb': 'auburn',
  'surname': 'eglinton'}]

## Prepare Training

When we linked our data via SimSum and supervised learning, we defined our blockers and comparators manually with `recordlinkage`. The `dedupe` library takes an active learning approach to blocking and classification and will use our feedback gathered during the labeling session to learn blocking rules and train a classifier. 

To prepare our `dedupe.RecordLink` object for training, first we'll define the fields that we think `dedupe` should pay attention to when matching records - these definitions will serve as the comparators. The `field` contains the name of the attribute to use for comparison, and the `type` defines the comparison type.

In [15]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "address_1", "type" : "ShortString" },
    { "field" : "address_2", "type" : "ShortString" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker = dedupe.RecordLink(fields)
linker.prepare_training(records_A, records_B)

INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)


CPU times: user 58.8 s, sys: 735 ms, total: 59.6 s
Wall time: 59.1 s


## Active Learning Labeling Session!

At this point, we're ready to provide feedback to `dedupe` via an active learning labeling session. For this, `dedupe` supplies a convenience method to iterate through pairs it is uncertain about. As you provide feedback for each pair, dedupe learns blocking rules and recalculates its linking model weights.

You can use `y` (yes, match), `n` (no, not match), and `u` (unsure) to provide feedback on candidate links. When you're ready to exit the labeling session, use `f`.

In [16]:
dedupe.console_label(linker)

first_name : fergus
surname : yialas
address_1 : dutton street
address_2 : None
suburb : lalor
postcode : 2350
state : tas
date_of_birth : 05/06/41
soc_sec_id : 5690786

first_name : joshu
surname : webb
address_1 : None
address_2 : None
suburb : lalor
postcode : 5500
state : None
date_of_birth : 07/16/04
soc_sec_id : 9699819

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


first_name : kazuki
surname : rook
address_1 : None
address_2 : parklands village
suburb : allambie
postcode : 4340
state : qld
date_of_birth : 11/20/60
soc_sec_id : 4589956

first_name : kazu ji
surname : rook
address_1 : None
address_2 : parklands village
suburb : allambie
postcode : 4340
state : qc
date_of_birth : 12/10/60
soc_sec_id : 4589956

0/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : niamh
surname : nan
address_1 : phillip avenue
address_2 : parlour mountain
suburb : None
postcode : 3939
state : vic
date_of_birth : 11/17/99
soc_sec_id : 8913923

first_name : hannah
surname : nan
address_1 : None
address_2 : gumnut cottage
suburb : ryde
postcode : 2220
state : vig
date_of_birth : None
soc_sec_id : 2342535

1/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : bodhi
surname : conaghty
address_1 : henty street
address_2 : mendip lodge
suburb : pymble
postcode : 7315
state : nsw
date_of_birth : 12/20/76
soc_sec_id : 6959141

first_name : bodhl
surname : conaghty
address_1 : henty street
address_2 : mendip lodge
suburb : pymble
postcode : 7315
state : nsw
date_of_birth : 12/20/76
soc_sec_id : 9087838

1/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : barnaby
surname : krix
address_1 : leslie street
address_2 : mirani
suburb : bargara
postcode : 4500
state : qld
date_of_birth : 06/30/59
soc_sec_id : 1474782

first_name : bafsh
surname : krix
address_1 : leslie atreet
address_2 : mirani
suburb : bargara
postcode : 4500
state : qld
date_of_birth : 06/30/59
soc_sec_id : 1474982

2/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:PartialPredicate: (tokenFieldPredicate, surname, CorporationName)
first_name : jazz
surname : clarke
address_1 : maccallum circuit
address_2 : None
suburb : bayview
postcode : 3152
state : vic
date_of_birth : 01/11/96
soc_sec_id : 4459305

first_name : jazz
surname : clarje
address_1 : maccallum circuit
address_2 : None
suburb : bayview
postcode : 3152
state : vic
date_of_birth : 01/11/96
soc_sec_id : 4453905

3/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_2)
first_name : emiily
surname : lowe
address_1 : pollock street
address_2 : beechworth
suburb : madora
postcode : 5161
state : vic
date_of_birth : 06/03/88
soc_sec_id : 6055911

first_name : philipp
surname : lowe
address_1 : pollock street
address_2 : beechworth
suburb : madora
postcode : 5161
state : vic
date_of_birth : None
soc_sec_id : 6055911

4/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : lachlan
surname : warrior
address_1 : sheaffe street
address_2 : None
suburb : wendouree
postcode : 3199
state : sa
date_of_birth : None
soc_sec_id : 4453394

first_name : laclhal
surname : warrior
address_1 : sheaffe street
address_2 : None
suburb : wendouree
postcode : 3199
state : nse
date_of_birth : None
soc_sec_id : 4453394

4/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : michael
surname : bishop
address_1 : mueller street
address_2 : None
suburb : scarborough
postcode : 3121
state : None
date_of_birth : 07/25/08
soc_sec_id : 6087264

first_name : mirah
surname : bishp
address_1 : mueller street
address_2 : None
suburb : scarborough
postcode : 3121
state : None
date_of_birth : 07/17/08
soc_sec_id : 6087264

5/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : jordyn
surname : millar
address_1 : baracchi crescent
address_2 : wayamba
suburb : carine
postcode : 4012
state : vic
date_of_birth : 02/14/92
soc_sec_id : 6169602

first_name : jordyn
surname : millar
address_1 : baracchi crescent
address_2 : None
suburb : carine
postcode : 4012
state : None
date_of_birth : 03/14/92
soc_sec_id : 6169621

6/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : claudia
surname : moscatt
address_1 : emu bank street
address_2 : brookfield
suburb : cundletown
postcode : 2500
state : None
date_of_birth : 12/21/19
soc_sec_id : 2577386

first_name : claudis
surname : moscatt
address_1 : emu bank street
address_2 : brookfield
suburb : cundletown
postcode : 2500
state : None
date_of_birth : 11/21/19
soc_sec_id : 2577368

7/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, address_1)
INFO:dedupe.training:SimplePredicate: (twoGramFingerprint, address_2)
first_name : jarryd
surname : white
address_1 : sturgeon street
address_2 : None
suburb : mount colah
postcode : 2800
state : qld
date_of_birth : 05/21/29
soc_sec_id : 9569060

first_name : nan
surname : white
address_1 : None
address_2 : None
suburb : mount colah
postcode : 2800
state : qldo
date_of_birth : 05/21/29
soc_sec_id : 9569068

8/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


first_name : toby
surname : dent
address_1 : studley street
address_2 : sunnydale cottage
suburb : bowral
postcode : 2450
state : wa
date_of_birth : 04/02/81
soc_sec_id : 9402107

first_name : toby
surname : de nd
address_1 : None
address_2 : None
suburb : bowral
postcode : 2450
state : wa
date_of_birth : 04/02/81
soc_sec_id : 9402107

8/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : annalise
surname : crlik
address_1 : None
address_2 : the crater
suburb : melville
postcode : 3460
state : nsw
date_of_birth : 10/05/45
soc_sec_id : 3039405

first_name : annalise
surname : crlih
address_1 : None
address_2 : None
suburb : melville
postcode : 3460
state : None
date_of_birth : 10/05/45
soc_sec_id : 3039405

9/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, address_1)
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_2)
first_name : chelsea
surname : lyden
address_1 : corroboree park
address_2 : None
suburb : broken hill
postcode : 6232
state : nsw
date_of_birth : 08/17/87
soc_sec_id : 1169181

first_name : chelsea
surname : lyden
address_1 : corroboiee park
address_2 : None
suburb : broken hill
postcode : 6232
state : ns
date_of_birth : 08/27/87
soc_sec_id : 1169181

10/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


We can now train our linker, based on the labeling session feedback.

In [17]:
%%time
linker.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  * (true_distinct + false_distinct)))
INFO:rlr.crossvalidation:optimum alpha: 0.000100, score 0.01041666666666667
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (monthPredicate, date_of_birth), SimplePredicate: (sameThreeCharStartPredicate, surname), SimplePredicate: (fingerprint, suburb))
INFO:dedupe.training:(TfidfNGramSearchPredicate: (0.8, suburb), SimplePredicate: (commonSixGram, surname), SimplePredicate: (exclusiveDayPredicate, date_of_birth))
INFO:dedupe.training:(SimplePredicate: (wholeFieldPredicate, soc_sec_id), PartialPredicate: (sameThreeCharStartPredicate, first_name, Surname), PartialPredicate: (suffixArray, surname, Surname))
INFO:dedupe.training:(SimplePredicate: (firstTwoTokensPredicate, address_2), PartialPredicate: (tokenFieldPredicate, surname, Surname), SimplePredicate: (yearPredicate, date_of_birth))


CPU times: user 7.68 s, sys: 920 ms, total: 8.6 s
Wall time: 7.7 s


Let's persist our training data (captured during in the labeling session), as well as the learned model weights.

In [18]:
ACTIVE_LEARNING_DIR = WORKING_DIR / "dedupe_active_learning"
ACTIVE_LEARNING_DIR.mkdir(parents=True, exist_ok=True)

SETTINGS_FILE = ACTIVE_LEARNING_DIR / "dedupe_learned_settings"
TRAINING_FILE = ACTIVE_LEARNING_DIR / "dedupe_training.json"

with open(TRAINING_FILE, "w") as fh:
    linker.write_training(fh)
    
with open(SETTINGS_FILE, "wb") as sf:
    linker.write_settings(sf)

## Examine Learned Blockers

Now, let's take a look at the predicates (blockers) that `dedupe` learned during our active learning labeling session. Note that `dedupe` can learn composite predicates/blockers, i.e. individual predicates can be combined with logical operators.

In [19]:
linker.predicates

((SimplePredicate: (monthPredicate, date_of_birth),
  SimplePredicate: (sameThreeCharStartPredicate, surname),
  SimplePredicate: (fingerprint, suburb)),
 (TfidfNGramSearchPredicate: (0.8, suburb),
  SimplePredicate: (commonSixGram, surname),
  SimplePredicate: (exclusiveDayPredicate, date_of_birth)),
 (SimplePredicate: (wholeFieldPredicate, soc_sec_id),
  PartialPredicate: (sameThreeCharStartPredicate, first_name, Surname),
  PartialPredicate: (suffixArray, surname, Surname)),
 (SimplePredicate: (firstTwoTokensPredicate, address_2),
  PartialPredicate: (tokenFieldPredicate, surname, Surname),
  SimplePredicate: (yearPredicate, date_of_birth)))

Next, let's examine the resulting candidate pairs and look at our blocking efficiency. The `.pairs` method will give us all candidate record pairs that are generated by blocking with the learned blockers.

In [20]:
candidate_pairs = [x for x in linker.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

1,190 candidate pairs generated from blocking.


You'll notice that, in contrast to `recordlinkage`, our post-blocking candidate pairs contain both the record ids as well as the record metadata.

In [21]:
candidate_pairs[0]

(('fbc4143d-15f9-4f27-b5f0-dedbadce6616',
  {'address_1': 'ballard place',
   'address_2': None,
   'age': '32',
   'date_of_birth': '10/02/61',
   'first_name': 'matilda',
   'phone_number': '03 05903135',
   'postcode': '2470',
   'soc_sec_id': '8276847',
   'state': 'qld',
   'street_number': '8',
   'suburb': 'west perth',
   'surname': 'struck'}),
 ('a9f5a761-83d6-452e-9f27-a452b3d06a4e',
  {'address_1': 'ballard place',
   'address_2': None,
   'age': '32',
   'date_of_birth': '10/02/61',
   'first_name': 'matikda',
   'phone_number': '03 05903135',
   'postcode': '2407',
   'soc_sec_id': '8276847',
   'state': 'qld',
   'street_number': '0',
   'suburb': 'west perth',
   'surname': 'strucl'}))

We can assemble our candidate pair ids into an indexed pandas dataframe for easier comparision with our known true links.

In [22]:
df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

df_candidate_links.head()

person_id_A,person_id_B
fbc4143d-15f9-4f27-b5f0-dedbadce6616,a9f5a761-83d6-452e-9f27-a452b3d06a4e
48a56cad-7ba6-45e1-97cd-517ba65bdab5,c77c2c04-4415-4c4d-b248-18dc28fd63d0
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,337aa0c5-4a0a-4bcd-89db-6fa998fa783c
b4e3efc2-9c8f-4e3e-8b98-9bfa842094f9,e63f19ca-3f5b-4021-ac1e-05fc7495bd48
050a4ce1-8fc9-410d-bae1-65a70a518e34,e36dc4e4-c33c-4021-9dba-ceed3a4956d7


Now, let's take a look at our learned blocker performance.

In [23]:
max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

10,562,500 total possible pairs.

1,190 pairs after full blocking: 0.999887% search space reduction.
39.67% true links retained after blocking.


## Score Pairs and Examine Learned Classifier

After `dedupe` has trained blockers and a classification model based on our labeling session, we can link the records in our training dataset via the `.join` method.

In [24]:
%%time
linked_records = linker.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

CPU times: user 2.17 s, sys: 104 ms, total: 2.27 s
Wall time: 3.41 s


`linker.join` will return the links, along with a model confidence.

In [25]:
linked_records[0:3]

[(('fff044ab-8dca-4946-bfa4-1675ee7d56b5',
   '99060a0c-e1bf-4869-bf08-2e15389193b6'),
  1.0),
 (('ffd668ac-2f63-4c05-a6a3-58ebcf1f4a80',
   '3e8c4b67-3611-4a08-84c8-b082b627bb21'),
  1.0),
 (('ff86e492-166d-4652-bf5b-61b9eef60e51',
   '0e4b371a-3c2b-4e2d-bae3-1e37058855b7'),
  1.0)]

We'll format the `dedupe` linker predictions into a format that we can use with our existing evaluation functions.

In [26]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
fff044ab-8dca-4946-bfa4-1675ee7d56b5,99060a0c-e1bf-4869-bf08-2e15389193b6,1.000000,True
ffd668ac-2f63-4c05-a6a3-58ebcf1f4a80,3e8c4b67-3611-4a08-84c8-b082b627bb21,1.000000,True
ff86e492-166d-4652-bf5b-61b9eef60e51,0e4b371a-3c2b-4e2d-bae3-1e37058855b7,1.000000,True
ff64e7b6-7a23-45d2-abfd-df84b7dfe02a,e44731a3-6d4d-4d1e-8466-57aef723dcdc,1.000000,True
fed0c57c-c844-47f9-a338-3777f231729b,b908bc05-6230-4b65-91d4-b7aa8f61b6b0,1.000000,True
...,...,...,...
62d4aa2a-864d-4646-87d2-f7117a1a4e0f,11bbd921-7258-421e-814c-2cd50347429a,0.207074,True
c720fe31-603b-483b-a65e-0aec2f5a1fd0,90e57a5c-a1a0-4ecb-9f2e-793aea34b056,0.202214,True
23afe634-5cbe-4a97-a33a-a611cb8c702c,efc1f4fa-fb86-424c-89ac-ed3be8a75519,0.159320,True
88211d39-93f1-404b-b2be-594a7a0a24dd,b143acb5-b7ad-49b0-aaef-6848b98ecb15,0.044695,True


## Choosing a Linking Model Score Threshold

The `dedupe` `.join` method that we used to score our training data directly incorporates the learned blockers. Thus, note that the scored pairs appearing on the distribution represent blocked pairs, and that our blockers *significantly* reduced the candidate pair search space.

### Model Score Distribution

In [27]:
df_predictions["ground_truth"].value_counts()

True    1190
Name: ground_truth, dtype: int64

In [28]:
tutorial.plot_model_score_distribution(df_predictions)

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


### Precision and Recall vs. Model Score

In [29]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

In [None]:
df_eval.head()

In [31]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

## Iterating with Active Learning

When using active learning, we iterate on our linking solution, and incorporate progressively more labeled training data. Perhaps we're not satisfied with the current performance of the blockers or classifier, and we'd like to create more labeled examples for dedupe to train on.

Recall that earlier, we saved off our existing training data from the first labeling session. We can load this persisted data into a `dedupe` linker, and kick off another labeling session. Perhaps, after investigating the data during our first cycle, we don't think that dedupe should include `address_1` and `address2` in its comparators.

### Tweak the Linker and Use Existing Training Data

In [32]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker2 = dedupe.RecordLink(fields)

with open(TRAINING_FILE, "r") as fh:
    linker2.prepare_training(records_A, records_B, training_file=fh)

INFO:dedupe.api:reading training from file
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)


CPU times: user 40.1 s, sys: 550 ms, total: 40.7 s
Wall time: 41.9 s


Now, we can kick off a second active learning/labeling session.

In [33]:
dedupe.console_label(linker2)

first_name : louis
surname : traforti
suburb : pacific paradise
postcode : 2518
state : vic
date_of_birth : 05/12/38
soc_sec_id : 1913191

first_name : louis
surname : trafcrati
suburb : None
postcode : 2518
state : vic
date_of_birth : 05/12/38
soc_sec_id : 1913191

10/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


first_name : callie
surname : campbell
suburb : burwood
postcode : 2480
state : nsw
date_of_birth : 02/01/97
soc_sec_id : 8116328

first_name : callie
surname : campbell
suburb : None
postcode : 2480
state : None
date_of_birth : 02/01/97
soc_sec_id : 8116328

11/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
INFO:dedupe.training:PartialPredicate: (sameThreeCharStartPredicate, surname, CorporationName)
first_name : jaiden
surname : gaugg
suburb : None
postcode : 3850
state : nsw
date_of_birth : 05/01/76
soc_sec_id : 7820255

first_name : jaiden
surname : gaugg
suburb : None
postcode : 3850
state : None
date_of_birth : 05/01/76
soc_sec_id : 7820255

12/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
first_name : nathan
surname : hanna
suburb : None
postcode : 2526
state : vic
date_of_birth : 11/06/54
soc_sec_id : 1862673

first_name : harrisbn
surname : bishp
suburb : 2400
postcode : toolleen
state : qld
date_of_birth : None
soc_sec_id : 5782301

13/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : sophie
surname : campbell
suburb : None
postcode : 3799
state : vic
date_of_birth : 09/20/19
soc_sec_id : 7746745

first_name : callum
surname : newfkry
suburb : miller
postcode : 5700
state : qls
date_of_birth : None
soc_sec_id : 3078894

14/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (2, first_name, CorporationName)
first_name : jared
surname : ryan
suburb : campbellfield
postcode : 4680
state : wa
date_of_birth : 03/25/31
soc_sec_id : 6128736

first_name : jare
surname : rysn
suburb : campbellfield
postcode : 4680
state : wy
date_of_birth : 03/25/31
soc_sec_id : 6128736

14/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


### Retrain the Linker and Examine Blocking Performance

Now, let's retrain, and examine blocker performance. Ideally, we see an improved true link retention following our second labeling session.

In [34]:
%%time
linker2.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  * (true_distinct + false_distinct)))
  scores = np.exp(scores + self.bias) / (1 + np.exp(scores + self.bias))
  scores = np.exp(scores + self.bias) / (1 + np.exp(scores + self.bias))
INFO:rlr.crossvalidation:optimum alpha: 0.001000, score 0.10163888579390475
INFO:dedupe.training:Final predicate set:


CPU times: user 9.43 s, sys: 1.55 s, total: 11 s
Wall time: 9.23 s


In [36]:
df_candidate_links

person_id_A,person_id_B
fbc4143d-15f9-4f27-b5f0-dedbadce6616,a9f5a761-83d6-452e-9f27-a452b3d06a4e
48a56cad-7ba6-45e1-97cd-517ba65bdab5,c77c2c04-4415-4c4d-b248-18dc28fd63d0
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,337aa0c5-4a0a-4bcd-89db-6fa998fa783c
b4e3efc2-9c8f-4e3e-8b98-9bfa842094f9,e63f19ca-3f5b-4021-ac1e-05fc7495bd48
050a4ce1-8fc9-410d-bae1-65a70a518e34,e36dc4e4-c33c-4021-9dba-ceed3a4956d7
...,...
88211d39-93f1-404b-b2be-594a7a0a24dd,b143acb5-b7ad-49b0-aaef-6848b98ecb15
a63ef7fb-b231-492e-9e89-5a1b36871dab,fd5b5d7d-97f4-4ac7-a94d-6897540622e0
12923ed1-3ec4-4f05-a0ee-e2f50de15d02,95714e93-ddad-4ac3-ad72-c274429411aa
601bed72-e640-4894-b6f9-85b724133b1d,d6b87055-3867-4060-a21b-737313c74bfe


In [47]:
candidate_pairs = [x for x in linker2.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")
if len(candidate_pairs)!=0:
  df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
  ).rename(columns={0 : "person_id_A", 1 : "person_id_B"})
  print(df_candidate_links.columns)
  df_candidate_links.set_index(["person_id_A", "person_id_B"])

  max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

  print(f"{max_candidate_pairs:,} total possible pairs.")

  # Calculate search space reduction.
  search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
  print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

  # Calculate retained true links percentage.
  total_true_links = df_ground_truth.shape[0]
  true_links_after_blocking = pd.merge(
      df_ground_truth,
      df_candidate_links,
      left_index=True,
      right_index=True,
      how="inner"
  ).shape[0]

  retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
  print(f"{retained_true_link_percent}% true links retained after blocking.")

0 candidate pairs generated from blocking.


### Evaluate Classification Performance

In [43]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
fff044ab-8dca-4946-bfa4-1675ee7d56b5,99060a0c-e1bf-4869-bf08-2e15389193b6,1.000000,True
ffd668ac-2f63-4c05-a6a3-58ebcf1f4a80,3e8c4b67-3611-4a08-84c8-b082b627bb21,1.000000,True
ff86e492-166d-4652-bf5b-61b9eef60e51,0e4b371a-3c2b-4e2d-bae3-1e37058855b7,1.000000,True
ff64e7b6-7a23-45d2-abfd-df84b7dfe02a,e44731a3-6d4d-4d1e-8466-57aef723dcdc,1.000000,True
fed0c57c-c844-47f9-a338-3777f231729b,b908bc05-6230-4b65-91d4-b7aa8f61b6b0,1.000000,True
...,...,...,...
62d4aa2a-864d-4646-87d2-f7117a1a4e0f,11bbd921-7258-421e-814c-2cd50347429a,0.207074,True
c720fe31-603b-483b-a65e-0aec2f5a1fd0,90e57a5c-a1a0-4ecb-9f2e-793aea34b056,0.202214,True
23afe634-5cbe-4a97-a33a-a611cb8c702c,efc1f4fa-fb86-424c-89ac-ed3be8a75519,0.159320,True
88211d39-93f1-404b-b2be-594a7a0a24dd,b143acb5-b7ad-49b0-aaef-6848b98ecb15,0.044695,True


In [44]:
df_predictions["ground_truth"].value_counts()

True    1190
Name: ground_truth, dtype: int64

In [45]:
tutorial.plot_model_score_distribution(df_predictions)

In [46]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

tutorial.plot_precision_recall_vs_threshold(df_eval)