# Tutorial: Privacy Preserving Record Linkage


This notebook demonstrates the full process of carrying out federated privacy preserving record linkage.

The sections are usually run by different organisations, but for illustration all is carried out in this one file.

The server carrying out the record linkage:

In [71]:
server = "http://localhost:8851"
#server = "https://testing.es.data61.xyz"

In [72]:
!python -m clkhash status --server={server}

{"project_count": 32, "status": "ok", "rate": 3454831}


Now we need some entity information to match. For testing purposes this tool can generate fake data:

In [17]:
# Generate some fake PII data
!clkutil generate 2000 raw_pii_2k.csv

# Split the fake PII data into somewhat overlapping alice and bob sets
!head -n 1 raw_pii_2k.csv > alice.txt
!tail -n 1500 raw_pii_2k.csv >> alice.txt
!head -n 1000 raw_pii_2k.csv > bob.txt

Now we need to define a linkage schema.

In [18]:
!head -n 2 raw_pii_2k.csv

INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
0,Damian Calendine,2009/01/12,M


The `clkhash` command line tool can output the schema for this PII (as it generated them):


In [22]:
!python -m clkhash generate-default-schema schema.json

## Integration Authority

Creates a project and distributes upload credentials to the data providers.

In [77]:
!python -m clkhash create-project -v --server={server} --schema=schema.json --output credentials.json --name "tutorial project"

[31mEntity Matching Server: http://localhost:8851[0m
[31mProject created[0m


In [78]:
import json
with open('credentials.json','r') as f:
    credentials = json.load(f)
    
!cat credentials.json

{"project_id": "54794c8871d2e44d4ddc78a2935cb6d1a5f40429f6e90271", "update_tokens": ["dc4ce36600795200be91fc6045ee4896d047fcba1e273d21", "37962227b1a9294abe09d813cf5abc4fa154e8ded5a71898"], "result_token": "651c3cf9747743e07ef3bc6d7cf19df477743b1e410d0ec0"}


In [74]:
!tail -n 2 bob.txt

997,Johny Suchocki,1963/07/24,M
998,Anita Watah,1987/04/25,F


We have generated *raw* identiy information. Looking at the help for the `upload` command we see that we first have to hash the raw entity information.

In [75]:
!python -m clkhash upload --help

Usage: __main__.py upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --project TEXT         Project identifier
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.


In [27]:
# Hash the data using the secret keys, which the linkage authority doesn't know
!clkutil hash alice.txt horse staple schema.json alice-hashed.json
!clkutil hash bob.txt horse staple schema.json bob-hashed.json

generating CLKs: 100%|█| 1.50K/1.50K [00:00<00:00, 1.70Kclk/s, mean=406, std=22.4]
[31mCLK data written to alice-hashed.json[0m
generating CLKs: 100%|███| 999/999 [00:00<00:00, 2.27Kclk/s, mean=406, std=22.2]
[31mCLK data written to bob-hashed.json[0m


In [28]:
!ls -lsh alice*

260K -rwxrwxrwx 1 root root 258K Jul  4 14:46 alice-hashed.json
 52K -rwxrwxrwx 1 root root  50K Jul  4 10:36 alice.txt


In [81]:
out = !python -m clkhash upload -v --server="http://localhost:8851" --project="{credentials['project_id']}" --apikey="{credentials['update_tokens'][0]}" alice-hashed.json

Every upload gets a receipt token. In some operating modes this receipt is required to access the results. For ease of use lets save this so we can use it later.

In [25]:
print(out)
mid = "9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34"
alice_receipt_token = "e7595ad375669a7cd7a199ebaf0569a263dcaca940f5371d"

['Uploading CLK data from bob-hashed.json', 'To Entity Matching Server: https://es.data61.xyz', 'Mapping ID: 9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34', 'Checking server status', 'Status: ok', 'Uploading CLK data to the server', '{', '    "receipt-token": "b5af03da6f3cb3a83816901272351f7d2fe81cce81c74815",', '    "message": "Updated"', '}', '']


In [12]:
# Upload Bob's data
out = !clkutil upload \
    --mapping="{credentials['resource_id']}" \
    --apikey="{credentials['update_tokens'][1]}" \
    bob-hashed.json
    
out

['Uploading CLK data from bob-hashed.json',
 'To Entity Matching Server: https://es.data61.xyz',
 'Mapping ID: 9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34',
 'Checking server status',
 'Status: ok',
 'Uploading CLK data to the server',
 '{',
 '    "receipt-token": "b5af03da6f3cb3a83816901272351f7d2fe81cce81c74815",',
 '    "message": "Updated"',
 '}',
 '']

In [26]:
bob_receipt_token = "b5af03da6f3cb3a83816901272351f7d2fe81cce81c74815"

In [27]:
# Now after some delay (depending on the size) we can fetch the mask
!clkutil results \
    --mapping="{credentials['resource_id']}" \
    --apikey="{credentials['result_token']}" --output results.txt

[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 200[0m
[31mReceived result[0m


In [28]:
import json
with open('results.txt','r') as f:
    mask = json.load(f)['mask']

In [38]:
print(mask[:10])

[1, 1, 1, 1, 1, 0, 1, 0, 1, 0]


In [29]:
import requests
url = 'https://es.data61.xyz/api/v1'

alice_res = requests.get('{}/mappings/{}'.format(url, mid), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/mappings/{}'.format(url, mid), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation&mdash;a new ordering for their data.

In [31]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]

[648, 147, 262, 916, 36, 189, 274, 89, 67, 0]

In [32]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]

[847, 767, 749, 485, 513, 286, 141, 782, 545, 762]

In [33]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item
    
    return neworder

In [34]:
with open('alice.txt', 'r') as f:
    alice_raw = f.readlines()
    alice_reordered = reorder(alice_raw, alice_permutation)

with open('bob.txt', 'r') as f:
    bob_raw = f.readlines()
    bob_reordered = reorder(bob_raw, bob_permutation)

In [35]:
alice_reordered[:10]

['508,Alysha Lesly,1920/04/10,F\n',
 '772,Addilynn Kasprowicz,1946/05/03,F\n',
 '536,Murry Cothran,1967/10/03,M\n',
 '715,Wallace Hillier,1950/08/22,M\n',
 '861,Marcie Obierne,1992/08/10,F\n',
 '1591,Donte Nuth,1971/11/20,M\n',
 '694,Joana Nesselrodt,1980/05/03,F\n',
 '1169,Woodson Clum,1938/03/20,M\n',
 '686,Allie Fludd,1920/07/04,F\n',
 '1317,Mel Neveu,1968/01/05,M\n']

In [36]:
bob_reordered[:10]

['508,Alysha Lesly,1920/04/10,F\n',
 '772,Addilynn Kasprowicz,1946/05/03,F\n',
 '536,Murry Cothran,1967/10/03,M\n',
 '715,Wallace Hillier,1950/08/22,M\n',
 '861,Marcie Obierne,1992/08/10,F\n',
 '495,Lynsey Boyda,2004/06/15,F\n',
 '694,Joana Nesselrodt,1980/05/03,F\n',
 '214,Amari Ruland,1922/04/01,M\n',
 '686,Allie Fludd,1920/07/04,F\n',
 '347,Gaynell Seedborg,1922/11/16,F\n']

In [37]:
for i, m in enumerate(mask[:20]):
    if m:
        print(alice_reordered[i].strip(), alice_reordered[i] == bob_reordered[i])

508,Alysha Lesly,1920/04/10,F True
772,Addilynn Kasprowicz,1946/05/03,F True
536,Murry Cothran,1967/10/03,M True
715,Wallace Hillier,1950/08/22,M True
861,Marcie Obierne,1992/08/10,F True
694,Joana Nesselrodt,1980/05/03,F True
686,Allie Fludd,1920/07/04,F True
842,Shona Kalathas,1999/06/27,F True
953,Bessie Moderski,1944/03/14,M True
602,Andon Quicksey,1997/12/09,M True
851,Camryn Greenstreet,1959/05/02,F True
