# Multi-party Record Linkage with Blocking

In this tutorial, we will demonstrate the CLI tools for multiparty record linkage with blocking techniques. 

In [16]:
import io
import os
import math
from bitarray import bitarray
import base64
import time
from IPython import display

import anonlink
import pandas as pd
import clkhash
from clkhash import clk
from clkhash.field_formats import *
from clkhash.schema import Schema
from clkhash.comparators import NgramComparison, NumericComparison, ExactComparison

Suppose we are interested to find records that appear at least twice in 3 parties

## Generate CLKs and Candidate Blocks

First we have a look at dataset

In [2]:

df1 = pd.read_csv('data/ncvr_numrec_5000_modrec_2_ocp_0_myp_0_nump_10.csv')
df1.head()

Unnamed: 0,recid,firstname,lastname,suburb,postcode
0,1705996,joseph,amaral,red springs,28377
1,6994349,nizi,azeez,morrisville,27560
2,4871772,barbara,richmond,salisbury,28144
3,3329049,shannon,boulware,raleigh,27615
4,6101030,latwyella,maxwell,bostic,28018


A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘ENTID’ for CLK generation.

In [3]:
with open("novt_schema.json") as f:
    print(f.read())


{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "info": "c2NoZW1hX2V4YW1wbGU=",
      "keySize": 64
    }
  },
  "features": [
        {
      "identifier": "recid",
      "ignored": true
    },
    {
      "identifier": "firstname",
      "format": {
        "type": "string",
        "encoding": "utf-8",
        "maxLength": 30,
        "case": "lower"
      },
      "hashing": {
        "comparison": {"type":  "ngram", "n":  2},
        "strategy": {"bitsPerFeature":  100},
        "hash": {"type": "blakeHash"},
        "missingValue": {
          "sentinel": ".",
          "replaceWith": ""
        }
      }
    },
    {
      "identifier": "lastname",
      "format": {
        "type": "string",
        "encoding": "utf-8",
        "maxLength": 30,
        "case": "lower"
      },
      "hashing": {
        "com

### Validate the schema
The command line tool can check that the linkage schema is valid:

In [4]:
!anonlink validate-schema "novt_schema.json"

[32mschema is valid[0m


### Hash data
We can now hash our Personally Identifiable Information (PII) data from the CSV file using our defined linkage schema. We must provide two secret keys to this command - these keys have to be used by both parties hashing data. For this toy example we will use the secret ‘secret’, for real data, make sure that the secret contains enough entropy, as knowledge of this secret is sufficient to reconstruct the PII information from a CLK!

In [5]:
!anonlink hash 'data/ncvr_numrec_5000_modrec_2_ocp_0_myp_0_nump_10.csv' secret 'novt_schema.json' 'novt_clk_0.json'

[31mCLK data written to novt_clk_0.json[0m


Let's hash data for party B and C:

In [6]:
!anonlink hash 'data/ncvr_numrec_5000_modrec_2_ocp_0_myp_1_nump_10.csv' secret 'novt_schema.json' 'novt_clk_1.json'

[31mCLK data written to novt_clk_1.json[0m


In [7]:
!anonlink hash 'data/ncvr_numrec_5000_modrec_2_ocp_0_myp_2_nump_10.csv' secret 'novt_schema.json' 'novt_clk_2.json'

[31mCLK data written to novt_clk_2.json[0m


In [8]:
!anonlink describe 'novt_clk_0.json'

    ----------------------------------------------------------------------------------------------------------------------------
    |                                                        popcounts                                                         |
    ----------------------------------------------------------------------------------------------------------------------------

 298| [39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39mo[39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m[39m [39m

### Block dataset


In [9]:
with open("blocking_schema.json") as f:
    print(f.read())

{
    "type": "lambda-fold",
    "version": 1,
    "config": {
        "blocking-features": [1, 2],
        "Lambda": 5,
        "bf-len": 2048,
        "num-hash-funcs": 10,
        "K": 80,
        "input-clks": false,
        "random_state": 0
    }
}


In [10]:
!anonlink block 'data/ncvr_numrec_5000_modrec_2_ocp_0_myp_0_nump_10.csv' 'blocking_schema.json' 'novt_blocks_0.json'

Number of Blocks:   20584
Minimum Block Size: 1
Maximum Block Size: 44
Average Block Size: 1
Median Block Size:  1
Standard Deviation of Block Size:  0.834023740486585


In [11]:
!anonlink block 'data/ncvr_numrec_5000_modrec_2_ocp_0_myp_1_nump_10.csv' 'blocking_schema.json' 'novt_blocks_1.json'

Number of Blocks:   20525
Minimum Block Size: 1
Maximum Block Size: 38
Average Block Size: 1
Median Block Size:  1
Standard Deviation of Block Size:  0.8738307514471261


In [12]:
!anonlink block 'data/ncvr_numrec_5000_modrec_2_ocp_0_myp_2_nump_10.csv' 'blocking_schema.json' 'novt_blocks_2.json'

Number of Blocks:   20559
Minimum Block Size: 1
Maximum Block Size: 35
Average Block Size: 1
Median Block Size:  1
Standard Deviation of Block Size:  0.8066635755263701


## Upload CLKs and Blocks to Server

## Get Ground Truth

In [41]:
truth = []

for party in [0, 1, 2]:
    df = pd.read_csv('data/ncvr_numrec_5000_modrec_2_ocp_0_myp_{}_nump_10.csv'.format(party))
    truth.append(pd.DataFrame({'id{}'.format(party): df.index, 'recid': df['recid']}))
    
dfj = truth[0].merge(truth[1], on='recid', how='outer')
for df in truth[2:]:
    dfj = dfj.merge(df, on='recid', how='outer')

dfj = dfj.drop(columns=['recid'])
true_matches = set()
for row in dfj.itertuples(index=False):
    cand = [(i, int(x)) for i, x in enumerate(row) if not math.isnan(x)]
    if len(cand) > 1:
        true_matches.add(tuple(cand))

print(f'we have {len(true_matches)} true matches')

we have 1649 true matches


## Solve with Anonlink

In [52]:
import json
import anonlink


def combine_clks_blocks(clk_f, block_f):
    """Combine CLKs and blocks to produce a dictionary of CLK to list of block IDs."""
    try:
        blocks = json.load(block_f)['blocks']
        clks = json.load(clk_f)['clks']
    except ValueError:
        msg = 'Invalid CLKs or Blocks'
        raise_from(ValueError(msg), e)

    output = [[clk] for clk in clks]

    for block_key, rec_ids in blocks.items():
        for rid in rec_ids:
            output[rid].append(block_key)
    return output


def deserialize_bitarray(bytes_data):
    ba = bitarray(endian='big')
    data_as_bytes = base64.decodebytes(bytes_data.encode())
    ba.frombytes(data_as_bytes)
    return ba


def deserialize_filters(filters):
    res = []
    for i, f in enumerate(filters):
        ba = deserialize_bitarray(f)
        res.append(ba)
    return res


def solve(encodings, rec_to_blocks, threshold: float = 0.8):
    """ entity resolution, baby

    calls anonlink to do the heavy lifting.

    :param encodings: a sequence of lists of Bloom filters (bitarray). One for each data provider
    :param rec_to_blocks: a sequence of dictionaries, mapping a record id to the list of blocks it is part of. Again,
                          one per data provider, same order as encodings.
    :param threshold: similarity threshold for solving
    :return: same as the anonlink solver.
             An sequence of groups. Each group is an sequence of
             records. Two records are in the same group iff they represent
             the same entity. Here, a record is a two-tuple of dataset index
             and record index.
    """
    def my_blocking_f(ds_idx, rec_idx, _):
        return rec_to_blocks[ds_idx][rec_idx]

    candidate_pairs = anonlink.candidate_generation.find_candidate_pairs(
        encodings,
        anonlink.similarities.dice_coefficient,
        threshold=threshold,
        blocking_f=my_blocking_f)
    # Need to use the probabilistic greedy solver to be able to remove the duplicate. It is not configurable
    # with the native greedy solver.
    return anonlink.solving.probabilistic_greedy_solve(candidate_pairs, merge_threshold=1.0)

In [53]:
clk_files = ['novt_clk_{}.json'.format(x) for x in range(3)]
block_files = ['novt_blocks_{}.json'.format(x) for x in range(3)]

clk_blocks = []

for i, (clk_f, block_f) in enumerate(zip(clk_files, block_files)):
    print('Combining CLKs and Blocks for Party {}'.format(i))
    clk_blocks.append(combine_clks_blocks(open(clk_f, 'rb'), open(block_f, 'rb')))
    
    
clk_groups = []
rec_to_blocks = {}

for i, clk_blk in enumerate(clk_blocks):
    clk_groups.append(deserialize_filters([r[0] for r in clk_blk]))
    rec_to_blocks[i] = {rind: clk_blk[rind][1:] for rind in range(len(clk_blk))}

    

Combining CLKs and Blocks for Party 0
Combining CLKs and Blocks for Party 1
Combining CLKs and Blocks for Party 2


In [54]:
threshold = 0.9
found_groups = solve(clk_groups, rec_to_blocks, threshold)

## Assess Precision and Recall

In [55]:
tp = len([x for x in found_groups if x in true_matches])
fp = len([x for x in found_groups if x not in true_matches])
fn = len([x for x in true_matches if x not in found_groups])

precision = tp / (tp + fp)
recall = tp / (tp + fn)

print(f'precision: {precision}, recall: {recall}')

precision: 0.9993939393939394, recall: 1.0


## Assess Reduction Ratio

In [56]:
import json
import pandas as pd
from blocklib import assess_blocks_2party

block_a = json.load(open('novt_blocks_0.json'))['blocks']
block_b = json.load(open('novt_blocks_1.json'))['blocks']
block_c = json.load(open('novt_blocks_2.json'))['blocks']

In [57]:
data = []
for party in [0, 1, 2]:
    dfa = pd.read_csv('data/ncvr_numrec_5000_modrec_2_ocp_0_myp_{}_nump_10.csv'.format(party))
    recid = dfa['recid'].values
    data.append(recid)

In [58]:
from collections import defaultdict

def reduction_ratio(filtered_reverse_indices, data, K):
    """Assess reduction ratio for multiple parties."""
    naive_num_comparison = 1
    for d in data:
        naive_num_comparison *= len(d)

    block_keys = defaultdict(int)  # type: Dict[Any, int]
    for reversed_index in filtered_reverse_indices:
        for key in reversed_index:
            block_keys[key] += 1
    final_block_keys = [key for key, count in block_keys.items() if count >= K]

    reduced_num_comparison = 0
    for key in final_block_keys:
        num_comparison = 1
        for reversed_index in filtered_reverse_indices:
            index = reversed_index.get(key, [0])
            num_comparison *= len(index)
        reduced_num_comparison += num_comparison
    rr = 1 - reduced_num_comparison / naive_num_comparison
    return rr, reduced_num_comparison, naive_num_comparison

rr, reduced_num_comparison, naive_num_comparison = reduction_ratio([block_a, block_b, block_c], data, K=3)

print('\nWith blocking, we reduced {:,} comparisons to {:,} comparisons i.e. the reduction ratio={}'
      .format(naive_num_comparison, reduced_num_comparison, rr))


With blocking, we reduced 125,000,000,000 comparisons to 189,271 comparisons i.e. the reduction ratio=0.999998485832
