# Entity Service Permutation Output

This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties _Alice_ and _Bob_ have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the _Analyst_.

The chosen output type is `permuatations`, which consists of two permutations and one mask.


### Who learns what?

After the linkage has been carried out Alice and Bob will be able to retrieve a `permutation` - a reordering of their respective data sets such that shared entities line up.

The Analyst - who creates the linkage project - learns the `mask`. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.


### Steps
These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are _Alice_ and *Bob*, and the *Analyst* acting the integration authority.

* [Check connection to Entity Service](#check_con)
* [Data preparation](#data_prep)
  * Write CSV files with PII
  * [Create a Linkage Schema](#schema_prep)
* [Create Linkage Project](#create_pro)
* [Generate CLKs from PII](#hash_n_up)
* [Upload the PII](#hash_n_up)
* [Create a run](#create_run)
* [Retrieve and analyse results](#results)

<a id="check_con"></a>
## Check Connection

If you are connecting to a custom entity service, change the address here.

In [1]:
url = 'https://testing.es.data61.xyz'

In [2]:
!clkutil status --server "{url}"

[31mConnecting to Entity Matching Server: https://testing.es.data61.xyz[0m
[31mResponse: 200[0m
[31mStatus: ok[0m
{"project_count": 1659, "rate": 532061, "status": "ok"}


<a id="data_prep"></a>
## Data preparation

Following the [clkhash tutorial](http://clkhash.readthedocs.io/en/latest/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.

If you are following along yourself you may have to adjust the file names in all the `!clkutil` commands.

In [3]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

In [4]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head()
print("Datasets written to {} and {}".format(a_csv.name, b_csv.name))

Datasets written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpsys8qk24 and /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpx9a3sf8_


In [5]:
dfA.head()

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-1070-org,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218
rec-1016-org,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625
rec-4405-org,charles,green,38,salkauskas crescent,kela,dapto,4566,nsw,19480930,4365168
rec-1288-org,vanessa,parr,905,macquoid place,broadbridge manor,south grafton,2135,sa,19951119,9239102
rec-3585-org,mikayla,malloney,37,randwick road,avalind,hoppers crossing,4552,vic,19860208,7207688


<a id="schema_prep"></a>
## Schema Preparation

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the [api docs](http://clkhash.readthedocs.io/en/latest/schema.html). We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.

In [6]:
schema = NamedTemporaryFile('wt')

In [30]:
%%writefile {schema.name}
{
  "version": 1,
  "clkConfig": {
    "l": 1024,
    "k": 30,
    "hash": {
      "type": "doubleHash"
    },
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 0.5, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "ngram": 1, "positional": true, "weight": 0.5 }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}

Overwriting /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpyrcurwsb


<a id="create_pro"></a>
## Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.


In [31]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "permutations" --server "{url}"
creds.seek(0)

import json
with open(creds.name, 'r') as f:
    credentials = json.load(f)

project_id = credentials['project_id']
credentials

Credentials will be saved in /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpblu34r48
[31mProject created[0m


{'project_id': 'dd7dd905e7092ff4d3e41b0dbfba727bdc1b13a7f221e0e0',
 'result_token': '8ea8acb1739d8bf9cb30e63bd655cc83db25725bc9f0268e',
 'update_tokens': ['69cf6bd636072842a9a6d622d68c1fc2a2250d2bad06778e',
  '24d9e7a9f0e7b3e301b0f9cff22853eb8edfa1abf7c6f181']}

**Note:** the analyst will need to pass on the `project_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.

<a id="hash_n_up"></a>
## Hash and Upload

At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. We need:
- the *clkhash* library
- the linkage schema from above
- and two secret passwords which are only known to Alice and Bob. (here: `horse` and `staple`)

Please see [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this.

In [32]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"

generating CLKs: 100%|█| 5.00k/5.00k [00:05<00:00, 786clk/s, mean=765, std=37.1]
[31mCLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpo9zkshw4.json[0m
generating CLKs: 100%|█| 5.00k/5.00k [00:04<00:00, 863clk/s, mean=756, std=43.3]
[31mCLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpx0o91git.json[0m


Now the two clients can upload their data providing the appropriate *upload tokens* and the *project_id*. As with all commands in `clkhash` we can output help:

In [33]:
!clkutil upload --help

Usage: clkutil upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --project TEXT         Project identifier
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.


### Alice uploads her data

In [34]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt_token']

Every upload gets a receipt token. This token is required to access the results.

### Bob uploads his data

In [35]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][1]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{b_clks.name}"
    
    bob_receipt_token = json.load(open(f.name))['receipt_token']

<a id="create_run"></a>
## Create a run

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

In [52]:
with NamedTemporaryFile('wt') as f:
    !clkutil create \
        --project="{project_id}" \
        --apikey="{credentials['result_token']}" \
        --server "{url}" \
        --threshold 0.85 \
        --output "{f.name}"
    
    run_id = json.load(open(f.name))['run_id']

<a id="results"></a>
## Results

Now after some delay (depending on the size) we can fetch the mask.
This can be done with clkutil:

    !clkutil results --server "{url}" \
        --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" --output results.txt
        
However for this tutorial we are going to use the Python `requests` library:

In [53]:
import requests
import clkhash.rest_client
import json
import time
from IPython.display import display, clear_output

In [54]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))

State: completed
Stage (3/3): compute output


  """Entry point for launching an IPython kernel.


In [55]:
results = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': credentials['result_token']}).json()

In [56]:
mask = results['mask']

This mask is a boolean array that specifies where rows of permuted data line up.

In [57]:
print(mask[:10])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


The number of 1s in the mask will tell us how many matches were found.

In [58]:
sum([1 for m in mask if m == 1])

4856

We also use `requests` to fetch the permutations for each data provider:

In [59]:
alice_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation - a new ordering for their data.

In [60]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]

[1116, 4282, 602, 2490, 2784, 4288, 4363, 4330, 3495, 1684]

This permutation says the first row of Alice's data should be moved to position 308.

In [61]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]

[4554, 570, 279, 2055, 2726, 3830, 324, 4013, 1948, 3383]

In [62]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item
    
    return neworder

In [63]:
with open(a_csv.name, 'r') as f:
    alice_raw = f.readlines()[1:]
    alice_reordered = reorder(alice_raw, alice_permutation)

with open(b_csv.name, 'r') as f:
    bob_raw = f.readlines()[1:]
    bob_reordered = reorder(bob_raw, bob_permutation)

Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don't.

In [64]:
alice_reordered[:10]

['rec-1060-org,belinda,hilton,38,beltana road,villa 2,botany,2171,vic,19640816,6397676\n',
 'rec-1529-org,zachariah,campbell,32,gellibrand street,carowood,keswick,3148,vic,19271210,2544494\n',
 'rec-2541-org,rebecca,boothroyd,1,gatton street,rosedale,wiseleigh,3690,nsw,19001103,3436597\n',
 'rec-1792-org,layla,ban,17,howitt street,oaklane,kirwan,4210,sa,19201015,4834950\n',
 'rec-4496-org,hari,warnock,8,jansz crescent,brentwood vlge,broadmeadows,2486,qld,19350219,7539077\n',
 'rec-1200-org,sara,bibrowicz,46,kingscote crescent,rowethorpe,smiths beach,2026,nsw,19120602,4090488\n',
 'rec-1516-org,trey,goetze,4,petterd street,bonny doon,brighton,7054,qld,19520118,1744778\n',
 'rec-2072-org,kelsey,bacskai,7,northbourne avenue,kooyong,como,2541,qld,19700110,3814852\n',
 'rec-3463-org,,webb,15,honyong crescent,wurtulla shopping vlge,narooma,4114,qld,19430207,5897606\n',
 'rec-1909-org,kayla,braithwaite,19,forwood street,braemar vlge,chewton,2261,nsw,19640906,6775714\n']

In [65]:
bob_reordered[:10]

['rec-1060-dup-0,belnda,hilton,3,beltana road,villa 2,botany,2171,vix,,6397676\n',
 'rec-1529-dup-0,ebonie,campbell,32,gellibrand street,carowood,kessick,3148,vic,19271210,2544494\n',
 'rec-2541-dup-0,rebecca,boothoryd,1,gatto s treet,rosedae,wiseleigh,3690,nsw,19002103,3436597\n',
 'rec-1792-dup-0,layla,,71,howitt street,oaklan,kirwhan,4210,wa,19201015,4834950\n',
 'rec-4496-dup-0,warnock,hark,,jansz cr escent,brentwood vlge,broadmeadows,2486,,19350219,7539077\n',
 'rec-1200-dup-0,sara,bibrlwicz,46,kingscote crescent,rowetho rpe,smithsxbeach,2026,nsw,19120602,4090488\n',
 'rec-1516-dup-0,trey,goetdze,4,petterd street,bonnu doon,brighton,7054,vic,19250118,1744778\n',
 'rec-2072-dup-0,kelsey,bacskai,7,northbourne avenue,kooyong,como,5241,qld,19700110,3814852\n',
 'rec-3463-dup-0,,georhe,15,,wurtulla shooping vlge,naroomj,4114,qld,19430207,5897606\n',
 'rec-1909-dup-0,kayla,braitahite,25,forwoods treet,braemar vlge,chewton,2261,nsw,19640906,6775714\n']

## Accuracy

To compute how well the matching went we will use the first index as our reference.

For example in `rec-1396-org` is the original record which has a match in `rec-1396-dup-0`. To satisfy ourselves we can preview the first few supposed matches:

In [66]:
for i, m in enumerate(mask[:10]):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        name_a = ' '.join(entity_a[1:3]).title()
        name_b = ' '.join(entity_b[1:3]).title()
        
        print("{} ({})".format(name_a, entity_a[0]), '=?', "{} ({})".format(name_b, entity_b[0]))

Belinda Hilton (rec-1060-org) =? Belnda Hilton (rec-1060-dup-0)
Zachariah Campbell (rec-1529-org) =? Ebonie Campbell (rec-1529-dup-0)
Rebecca Boothroyd (rec-2541-org) =? Rebecca Boothoryd (rec-2541-dup-0)
Layla Ban (rec-1792-org) =? Layla  (rec-1792-dup-0)
Hari Warnock (rec-4496-org) =? Warnock Hark (rec-4496-dup-0)
Sara Bibrowicz (rec-1200-org) =? Sara Bibrlwicz (rec-1200-dup-0)
Trey Goetze (rec-1516-org) =? Trey Goetdze (rec-1516-dup-0)
Kelsey Bacskai (rec-2072-org) =? Kelsey Bacskai (rec-2072-dup-0)
 Webb (rec-3463-org) =?  Georhe (rec-3463-dup-0)
Kayla Braithwaite (rec-1909-org) =? Kayla Braitahite (rec-1909-dup-0)


### Metrics
If you know the ground truth — the correct mapping between the two datasets — you can compute performance metrics of the linkage.

**Precision**: The percentage of actual matches out of all found matches. (`tp/(tp+fp)`)

**Recall**: How many of the actual matches have we found? (`tp/(tp+fn)`)

In [67]:
tp = 0
fp = 0

for i, m in enumerate(mask):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        if entity_a[0].split('-')[1] == entity_b[0].split('-')[1]:
            tp += 1
        else:
            fp += 1
            #print('False positive:',' '.join(entity_a[1:3]).title(), '?', ' '.join(entity_b[1:3]).title(), entity_a[-1] == entity_b[-1])

print("Found {} correct matches out of 5000. Incorrectly linked {} matches.".format(tp, fp))
precision = tp/(tp+fp)
recall = tp/5000

print("Precision: {:.1f}%".format(100*precision))
print("Recall: {:.1f}%".format(100*recall))

Found 4846 correct matches out of 5000. Incorrectly linked 10 matches.
Precision: 99.8%
Recall: 96.9%
