# Entity Service Permutation Output

This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. 
The output type is permutation and mask.

The sections are usually run on different companies - but for illustration all is carried out in this one file. The participants providing data are *Alice* and *Bob*, and the analyst acting the integration authority.

### Who learns what?

Alice and Bob will both generate and upload their CLKs. After the linkage has been carried out they will be able to retrieve a `permutation` - a reordering of their respective data sets such that shared entities line up.

The analyst - who creates the linkage project - learns the `mask`. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.

### Steps

* Check connection to Entity Service
* Data preparation
  * Write CSV files with PII
  * Create a Linkage Schema
* Create Linkage Project
* Generate CLKs from PII
* Upload the PII
* Create a run
* Retrieve and analyse results

## Check Connection

If you are connecting to a custom entity service, change the address here.

In [1]:
url = 'https://testing.es.data61.xyz'

In [2]:
!clkutil status --server "{url}"

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed',)).History will not be written to the database.
{"project_count": 1412, "rate": 53129, "status": "ok"}


## Data preparation

Following the [clkhash tutorial](http://clkhash.readthedocs.io/en/latest/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.

If you are following along yourself you may have to adjust the file names in all the `!clkutil` commands.

In [3]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

In [4]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head()
print("Datasets written to {} and {}".format(a_csv.name, b_csv.name))

Datasets written to /tmp/tmp3kx41lxs and /tmp/tmpfiwx8pis


In [5]:
dfA.head()

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-1070-org,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218
rec-1016-org,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625
rec-4405-org,charles,green,38,salkauskas crescent,kela,dapto,4566,nsw,19480930,4365168
rec-1288-org,vanessa,parr,905,macquoid place,broadbridge manor,south grafton,2135,sa,19951119,9239102
rec-3585-org,mikayla,malloney,37,randwick road,avalind,hoppers crossing,4552,vic,19860208,7207688


## Schema Preparation

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.

In [6]:
schema = NamedTemporaryFile('wt')

In [7]:
%%writefile {schema.name}
{
  "version": 1,
  "clkConfig": {
    "l": 1024,
    "k": 30,
    "hash": {
      "type": "doubleHash"
    },
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "ngram": 1, "positional": true, "weight": 1 }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}

Overwriting /tmp/tmp8kzi4os5


## Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.


In [8]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "permutations" --server "{url}"
creds.seek(0)

import json
with open(creds.name, 'r') as f:
    credentials = json.load(f)

project_id = credentials['project_id']
credentials

Credentials will be saved in /tmp/tmpp2qwgwlz
[31mProject created[0m


{'project_id': '432e0100559b4341aa474d62339a3a8f3a7d2b4db816e316',
 'result_token': 'da272f18ae832a21df064b8815d47f74e28da5e4b776c7d8',
 'update_tokens': ['a042146986db4d57824db09ad9bf1019093037f72d7395cf',
  '9a3896d90b57cd873b5e3408ef3774fc19dceb15a8b879aa']}

**Note:** the analyst will need to pass on the `project_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.

## Hash and Upload

At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. Please see [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this.

In [9]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"

generating CLKs: 100%|█| 5.00k/5.00k [00:03<00:00, 1.22kclk/s, mean=883, std=33.6]
[31mCLK data written to /tmp/tmpgufu8og6.json[0m
generating CLKs: 100%|█| 5.00k/5.00k [00:02<00:00, 524clk/s, mean=875, std=39.7]
[31mCLK data written to /tmp/tmpjx0otna1.json[0m


Now the two clients can upload their data providing the appropriate *upload tokens*. As with all commands in `clkhash` we can output help:

In [10]:
!clkutil upload --help

Usage: clkutil upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --project TEXT         Project identifier
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.


### Alice uploads her data

In [11]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt_token']

Every upload gets a receipt token. In some operating modes this receipt is required to access the results.

### Bob uploads his data

In [12]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][1]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{b_clks.name}"
    
    bob_receipt_token = json.load(open(f.name))['receipt_token']

## Create a run

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

In [43]:
with NamedTemporaryFile('wt') as f:
    !clkutil create \
        --project="{project_id}" \
        --apikey="{credentials['result_token']}" \
        --server "{url}" \
        --threshold 0.9 \
        --output "{f.name}"
    
    run_id = json.load(open(f.name))['run_id']

## Results

Now after some delay (depending on the size) we can fetch the mask.
This can be done with clkutil:

    !clkutil results --server "{url}" \
        --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" --output results.txt
        
However for this tutorial we are going to use the Python `requests` library:

In [44]:
import requests
import clkhash.rest_client
import json
import time
from IPython.display import display, clear_output

In [45]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))

State: completed
Stage (3/3): compute output


In [47]:
results = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': credentials['result_token']}).json()

In [48]:
mask = results['mask']

This mask is a boolean array that specifies where rows of permuted data line up.

In [49]:
print(mask[:10])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


The number of 1s in the mask will tell us how many matches were found.

In [50]:
sum([1 for m in mask if m == 1])

4830

We also use `requests` to fetch the permutations for each data provider:

In [51]:
alice_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation - a new ordering for their data.

In [52]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]

[3074, 2978, 827, 4760, 4501, 2399, 2546, 235, 4126, 2618]

This permutation says the first row of Alice's data should be moved to position 308.

In [53]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]

[2046, 4958, 662, 4681, 4129, 3674, 4663, 3117, 2148, 4336]

In [54]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item
    
    return neworder

In [55]:
with open(a_csv.name, 'r') as f:
    alice_raw = f.readlines()[1:]
    alice_reordered = reorder(alice_raw, alice_permutation)

with open(b_csv.name, 'r') as f:
    bob_raw = f.readlines()[1:]
    bob_reordered = reorder(bob_raw, bob_permutation)

Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don't.

In [56]:
alice_reordered[:10]

['rec-2015-org,arabella,sorsa,29,corringle close,mari-ma farm,mulgrave east,4871,qld,19450119,2313601\n',
 'rec-4927-org,daniel,glass,8,roe street,rowethorpe,williamstown,2026,tas,19170309,4569679\n',
 'rec-3783-org,alannah,newport,7,wakefield avenue,glenmore,moree,4662,qld,19401219,4895884\n',
 'rec-4812-org,anthony,roche,11,waller crescent,gungarlin,inglewood,3103,act,19930109,7209435\n',
 "rec-1448-org,teal,donaldson,,o'sullivan street,willow lodge,springwood,7030,nsw,19241229,8492921\n",
 'rec-489-org,brandon,strangway,177,robert campbell road,dryrock gulch,toowoomba,4121,nsw,19220610,4691109\n',
 'rec-4403-org,harriet,paterson,5,smiths road,jaybees,ringwood east,6012,nsw,19321027,3012880\n',
 'rec-328-org,ronan,zatorski,13,lott place,bobblegigbie,ringwood east,5109,qld,19870107,5925683\n',
 'rec-2518-org,makayla,thoonen,119,tullaroop street,rosetta village,byford,3977,vic,19750818,1423820\n',
 'rec-2323-org,rachael,yallop,15,fullagar crescent,rsde 668,beenleigh,4860,qld,19760214,1

In [57]:
bob_reordered[:10]

['rec-2015-dup-0,arabella,sors,92,corringle close,mari-ma farm,mulgrave east,4871,qld,19450119,2313601\n',
 'rec-4927-dup-0,daniel,glad,,roe street,rowethorpe,tungamull,2026,tas,19170309,4569679\n',
 'rec-3783-dup-0,katelyn,newport,7,wakefield avenue,glenmore,moree,4662,qld,19401219,4895884\n',
 'rec-4812-dup-0,anthony,roche,11,waller ceescent,gungarlin,st andrews,3103,act,19930109,7209435\n',
 "rec-1448-dup-0,tea,donaklsdon,,o'sullivan street,willow lodge,springwood,7030,nsw,19891118,8492921\n",
 'rec-489-dup-0,brandon,strangway,177,everard lace,dryroc k gulch,hinchinbrook,4121,nsw,19220610,4691109\n',
 'rec-4403-dup-0,harriet,paterson,5,smiths road,jaybees,ringwood past,6012,nsw,19321027,3012880\n',
 'rec-328-dup-0,ronan,zatorski,13,lott place,bobblegigbie,ringwoo east,5109,qld,19870107,5925683\n',
 'rec-2518-dup-0,thoonen,makayla,119,tullaroop street,rosetta village,byford,3977,vic,19750818,1423820\n',
 'rec-2323-dup-0,rachael,yallop,5,fullagar crescent,rsde 668,been leigh,4860,qld,

## Accuracy

To compute how well the matching went we will use the first index as our reference.

For example in `rec-1396-org` is the original record which has a match in `rec-1396-dup-0`. To satisfy ourselves we can preview the first few supposed matches:

In [58]:
for i, m in enumerate(mask[:10]):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        name_a = ' '.join(entity_a[1:3]).title()
        name_b = ' '.join(entity_b[1:3]).title()
        
        print("{} ({})".format(name_a, entity_a[0]), '=?', "{} ({})".format(name_b, entity_b[0]))

Arabella Sorsa (rec-2015-org) =? Arabella Sors (rec-2015-dup-0)
Daniel Glass (rec-4927-org) =? Daniel Glad (rec-4927-dup-0)
Alannah Newport (rec-3783-org) =? Katelyn Newport (rec-3783-dup-0)
Anthony Roche (rec-4812-org) =? Anthony Roche (rec-4812-dup-0)
Teal Donaldson (rec-1448-org) =? Tea Donaklsdon (rec-1448-dup-0)
Brandon Strangway (rec-489-org) =? Brandon Strangway (rec-489-dup-0)
Harriet Paterson (rec-4403-org) =? Harriet Paterson (rec-4403-dup-0)
Ronan Zatorski (rec-328-org) =? Ronan Zatorski (rec-328-dup-0)
Makayla Thoonen (rec-2518-org) =? Thoonen Makayla (rec-2518-dup-0)
Rachael Yallop (rec-2323-org) =? Rachael Yallop (rec-2323-dup-0)


### Metrics

**Precision**: The percentage of actual matches out of all found matches. (`tp/(tp+fp)`)

**Recall**: How many of the actual matches have we found? (`tp/(tp+fn)`)

In [59]:
tp = 0
fp = 0

for i, m in enumerate(mask):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        if entity_a[0].split('-')[1] == entity_b[0].split('-')[1]:
            tp += 1
        else:
            fp += 1
            #print('False positive:',' '.join(entity_a[1:3]).title(), '?', ' '.join(entity_b[1:3]).title(), entity_a[-1] == entity_b[-1])

print("Found {} correct matches out of 5000. Incorrectly linked {} matches.".format(tp, fp))
precision = tp/(tp+fp)
recall = tp/5000

print("Precision: {:.1f}%".format(100*precision))
print("Recall: {:.1f}%".format(100*recall))

Found 4447 correct matches out of 5000. Incorrectly linked 383 matches.
Precision: 92.1%
Recall: 88.9%
