# Multi-party Record Linkage with Blocking

In this tutorial, we will demonstrate the CLI tools for multiparty record linkage with blocking techniques. 


In [None]:
# NBVAL_IGNORE_OUTPUT
!pip install -U anonlink-client anonlink

In [1]:
import io
import math
import json
import pandas as pd

import anonlink
import clkhash
from clkhash import clk

from anonlinkclient.utils import combine_clks_blocks, deserialize_filters

%run util.py

Suppose we are interested to find records that appear at least twice in 3 parties

## Generate CLKs and Candidate Blocks

First we have a look at dataset

In [2]:
# NBVAL_IGNORE_OUTPUT
corruption_rate = 20
file_template = 'data/ncvr_numrec_5000_modrec_2_ocp_' + str(corruption_rate) + '_myp_{}_nump_10.csv'
df1 = pd.read_csv(file_template.format(0))
df1.head()

Unnamed: 0,recid,givenname,surname,suburb,postcode
0,1503359,pauline,camkbell,lilescille,28091
1,1972058,deborah,galyen,ennike,286z3
2,889525,charle5,mitrhell,roaring river,28669
3,4371845,petehr,werts,swannanoa,28478
4,1187991,katpy,silbiger,duyham,27705


A linkage schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the linkage schema can be found in the <a href='https://clkhash.readthedocs.io/en/stable/schema.html'>api docs</a>. We will ignore the column ‘recid’ for CLK generation.

In [3]:
# NBVAL_IGNORE_OUTPUT
with open("novt_schema.json") as f:
    print(f.read())


{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "info": "c2NoZW1hX2V4YW1wbGU=",
      "keySize": 64
    }
  },
  "features": [
        {
      "identifier": "recid",
      "ignored": true
    },
    {
      "identifier": "givenname",
      "format": {
        "type": "string",
        "encoding": "utf-8",
        "maxLength": 30,
        "case": "lower"
      },
      "hashing": {
        "comparison": {"type":  "ngram", "n":  2},
        "strategy": {"bitsPerFeature":  100},
        "hash": {"type": "blakeHash"},
        "missingValue": {
          "sentinel": ".",
          "replaceWith": ""
        }
      }
    },
    {
      "identifier": "surname",
      "format": {
        "type": "string",
        "encoding": "utf-8",
        "maxLength": 30,
        "case": "lower"
      },
      "hashing": {
        "comp

### Validate the schema
The command line tool can check that the linkage schema is valid:

In [4]:
# NBVAL_IGNORE_OUTPUT
!anonlink validate-schema "novt_schema.json"

[32mschema is valid[0m


### Hash data
We can now hash our Personally Identifiable Information (PII) data from the CSV file using our defined linkage schema. We must provide a secret key to this command - this key has to be used by both parties hashing data. For this toy example we will use the secret ‘secret’, for real data, make sure that the secret contains enough entropy, as knowledge of this secret is sufficient to reconstruct the PII information from a CLK!

In [5]:
secret = 'secret'

In [6]:
# NBVAL_IGNORE_OUTPUT
!anonlink hash 'data/ncvr_numrec_5000_modrec_2_ocp_20_myp_0_nump_10.csv' secret 'novt_schema.json' 'novt_clk_0.json'

[31mCLK data written to novt_clk_0.json[0m


Let's hash data for party B and C:

In [7]:
# NBVAL_IGNORE_OUTPUT
!anonlink hash 'data/ncvr_numrec_5000_modrec_2_ocp_20_myp_1_nump_10.csv' secret 'novt_schema.json' 'novt_clk_1.json'

[31mCLK data written to novt_clk_1.json[0m


In [8]:
# NBVAL_IGNORE_OUTPUT
!anonlink hash 'data/ncvr_numrec_5000_modrec_2_ocp_20_myp_2_nump_10.csv' secret 'novt_schema.json' 'novt_clk_2.json'

[31mCLK data written to novt_clk_2.json[0m


`anonlink` provides a command `describe` to inspect the hashing results i.e. the Cryptographic Longterm Key (CLK). Normally we will expect a relative symmetric shape popcount with a moderate mean comparing to bloom filter length.

In [9]:
# NBVAL_IGNORE_OUTPUT
!anonlink describe 'novt_clk_0.json'

    ----------------------------------------------------------------------------------------------------------------------------
    |                                                        popcounts                                                         |
    ----------------------------------------------------------------------------------------------------------------------------

 298| [39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39mo[39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m

In this case, the popcount mean is not very large compared to bloom filter length (1024). If the popcount mean is large, you can reduce it by modifying the schema. For more details, please have a look at this <a href='https://clkhash.readthedocs.io/en/stable/tutorial_cli.html'>tutorial</a>. 

### Block dataset


Blocking is a technique that makes record linkage scalable. It is achieved by partitioning datasets into groups, called blocks and only comparing records in corresponding blocks. This can reduce the number of comparisons that need to be conducted to find which pairs of records should be linked.

Similar to the hashing above, the blocking is configured with a schema. For this linkage we chose 'lambda-fold' as blocking technique. This blocking method is proposed in paper <a href='https://ieeexplore.ieee.org/abstract/document/6880802'>*An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage*</a>. We also provide a detailed explanation of how this blocking method works in this <a href='https://github.com/data61/blocklib/blob/master/docs/tutorial/tutorial_blocking.ipynb'>tutorial</a>.

In [10]:
# NBVAL_IGNORE_OUTPUT
with open("blocking_schema.json") as f:
    print(f.read())


{
    "type": "lambda-fold",
    "version": 1,
    "config": {
        "blocking-features": [6],
        "Lambda": 3,
        "bf-len": 64,
        "num-hash-funcs": 3,
        "K": 5,
        "input-clks": true,
        "random_state": 0
    }
}    
    


**Party A Blocks its Data**

In [11]:
# NBVAL_IGNORE_OUTPUT
!anonlink block 'novt_clk_0.json' 'blocking_schema.json' 'novt_blocks_0.json'

**Party B Blocks its Data**

In [12]:
# NBVAL_IGNORE_OUTPUT
!anonlink block 'novt_clk_1.json' 'blocking_schema.json' 'novt_blocks_1.json'

**Party C Blocks its Data**

In [13]:
# NBVAL_IGNORE_OUTPUT
!anonlink block 'novt_clk_2.json' 'blocking_schema.json' 'novt_blocks_2.json'

## Get Ground Truth

In [14]:
#NBVAL_IGNORE_OUTPUT
truth = []

for party in [0, 1, 2]:
    df = pd.read_csv('data/ncvr_numrec_5000_modrec_2_ocp_20_myp_{}_nump_10.csv'.format(party))
    truth.append(pd.DataFrame({'id{}'.format(party): df.index, 'recid': df['recid']}))
    
dfj = truth[0].merge(truth[1], on='recid', how='outer')
for df in truth[2:]:
    dfj = dfj.merge(df, on='recid', how='outer')

dfj = dfj.drop(columns=['recid'])
true_matches = set()
for row in dfj.itertuples(index=False):
    cand = [(i, int(x)) for i, x in enumerate(row) if not math.isnan(x)]
    if len(cand) > 1:
        true_matches.add(tuple(cand))

print(f'we have {len(true_matches)} true matches')
e = iter(true_matches)
for i in range(10):
    print(next(e))

we have 1649 true matches
((1, 3428), (2, 822))
((0, 111), (1, 1339))
((0, 355), (1, 4570), (2, 2530))
((1, 1416), (2, 4199))
((0, 4933), (1, 2652), (2, 4928))
((0, 3443), (2, 3456))
((0, 1987), (1, 92), (2, 1989))
((0, 2924), (2, 3908))
((0, 668), (1, 2285), (2, 4149))
((0, 3339), (1, 819), (2, 826))


## Solve with Anonlink

In [15]:
# NBVAL_IGNORE_OUTPUT
clk_files = ['novt_clk_{}.json'.format(x) for x in range(3)]
block_files = ['novt_blocks_{}.json'.format(x) for x in range(3)]

clk_blocks = []

for i, (clk_f, block_f) in enumerate(zip(clk_files, block_files)):
    print('Combining CLKs and Blocks for Party {}'.format(i))
    clk_blocks.append(json.load(combine_clks_blocks(open(clk_f, 'rb'), open(block_f, 'rb')))['clknblocks'])
    
    
clk_groups = []
rec_to_blocks = {}

for i, clk_blk in enumerate(clk_blocks):
    clk_groups.append(deserialize_filters([r[0] for r in clk_blk]))
    rec_to_blocks[i] = {rind: clk_blk[rind][1:] for rind in range(len(clk_blk))}


Combining CLKs and Blocks for Party 0
Combining CLKs and Blocks for Party 1
Combining CLKs and Blocks for Party 2


## Assess Linkage Quality

We can assess the linkage quality by precision and recall. 

* Precision is measured by the proportion of found record groups classified as true matches. 

* Recall is measured by the proportion of true matching groups that are classified as found groups


In [16]:
#NBVAL_IGNORE_OUTPUT
threshold = 0.87

# matching with blocking
found_groups = solve(clk_groups, rec_to_blocks, threshold)
print("Example found groups: ")
for i in range(10):
    print(found_groups[i])
precision, recall = evaluate(found_groups, true_matches)
print('\n\nWith blocking: ')
print(f'precision: {precision}, recall: {recall}')

# matching without blocking
found_groups = naive_solve(clk_groups, threshold)
precision, recall = evaluate(found_groups, true_matches)
print('Without blocking: ')
print(f'precision: {precision}, recall: {recall}')

Example found groups: 
((0, 764), (2, 2877))
((0, 2492), (1, 3757), (2, 43))
((1, 914), (2, 911), (0, 190))
((0, 4895), (2, 228))
((0, 2085), (1, 3800), (2, 3841))
((1, 1195), (2, 1194))
((1, 24), (2, 4682))
((0, 2623), (1, 4274), (2, 2647))
((0, 1281), (1, 1271), (2, 4081))
((0, 4479), (2, 1767))


With blocking: 
precision: 0.7802981205443941, recall: 0.730139478471801
Without blocking: 
precision: 0.7808661926308985, recall: 0.7325651910248635


## Assess Blocking

**Reduction Ratio**

Reduction ratio measures the proportion of number of comparisons reduced by using blocking technique. If we have two data providers each has $N$ number of records, then 

$$\text{reduction ratio}= 1 - \frac{\text{number of comparisons after blocking}}{N^3}$$


**Set Completeness**

Set completeness (aka pair completeness in two-party senario) measure how many true matches are maintained after blocking. It is evalauted as

$$\text{set completeness}= \frac{\text{number of true matches after blocking}}{\text{number of all true matches}}$$



In [17]:
# NBVAL_IGNORE_OUTPUT
block_a = json.load(open('novt_blocks_0.json'))['blocks']
block_b = json.load(open('novt_blocks_1.json'))['blocks']
block_c = json.load(open('novt_blocks_2.json'))['blocks']

filtered_reverse_indices = [block_a, block_b, block_c]
# filtered_reverse_indices[0]
data = []
for party in [0, 1, 2]:
    dfa = pd.read_csv('data/ncvr_numrec_5000_modrec_2_ocp_0_myp_{}_nump_10.csv'.format(party))
    recid = dfa['recid'].values
    data.append(recid)

rr, reduced_num_comparison, naive_num_comparison = reduction_ratio(filtered_reverse_indices, data, K=2)
print('\nWith blocking, we reduced {:,} comparisons to {:,} comparisons i.e. the reduction ratio={}'
      .format(naive_num_comparison, reduced_num_comparison, rr))


With blocking, we reduced 125,000,000,000 comparisons to 3,548,164,581 comparisons i.e. the reduction ratio=0.971614683352


In [18]:
# NBVAL_IGNORE_OUTPUT
sc = set_completeness(filtered_reverse_indices, true_matches, K=2)
print('Set completeness = {}'.format(sc))

Set completeness = 0.9727107337780473
