# Tutorial: Getting Started with the Entity Service


This notebook demonstrates creating a new mapping on the entity service, and how to retrieve the results. The output type is an unencrypted permutation and mask.

The sections are usually run on different companies - but for illustration all is carried out in this one file.


In [95]:
url = 'https://es.data61.xyz/api/v1'

## Data preparation

Following the clkhash tutorial we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.

If you are following along yourself you may have to adjust the file names in all the `!clkutil` commands.

In [96]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

In [98]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head()
print("Datasets written to {} and {}".format(a_csv.name, b_csv.name))

Datasets written to /tmp/tmp6yb5k2g1 and /tmp/tmp2i9artbn


## Schema Preparation

The linkage schema must be agreed on by the two parties.

In [100]:
column_metadata = [
    'INDEX',
    'NAME Surname',
    'NAME First Name',
    'ADDRESS House Number',
    'ADDRESS Place Name',
    'ADDRESS Place Name',
    'ADDRESS Place Name',
    'ADDRESS POSTCODE',
    'ADDRESS Place Name',
    'DOB YYYY/MM/DD',
    'INDEX'
]

schema = NamedTemporaryFile("wt", suffix='.yaml')
for col in column_metadata:
    print('- identifier: "{}"'.format(col), file=schema)

schema.seek(0)
print("Schema written to", schema.name)

Schema written to /tmp/tmpf4vyc2qk.yaml


## Integration Authority

The analyst carrying out the linkage starts by creating a mapping with the Entity Service and stores the credentials which are returned.

In [101]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

Credentials will be saved in /tmp/tmp51q_lx05


In [102]:
!clkutil create --schema "{schema.name}" --output "{creds.name}" --type "permutation_unencrypted_mask" --server "{url}"
creds.seek(0)

[31mEntity Matching Server: https://es.data61.xyz[0m
[31mChecking server status[0m
[31mServer Status: ok[0m
[31mSchema: [{"identifier": "INDEX"}, {"identifier": "NAME Surname"}, {"identifier": "NAME First Name"}, {"identifier": "ADDRESS House Number"}, {"identifier": "ADDRESS Place Name"}, {"identifier": "ADDRESS Place Name"}, {"identifier": "ADDRESS Place Name"}, {"identifier": "ADDRESS POSTCODE"}, {"identifier": "ADDRESS Place Name"}, {"identifier": "DOB YYYY/MM/DD"}, {"identifier": "INDEX"}][0m
[31mType: permutation_unencrypted_mask[0m
[31mCreating new mapping[0m
[31mMapping created[0m


0

In [103]:
import json
with open(creds.name, 'r') as f:
    credentials = json.load(f)
    
credentials

{'resource_id': '6a864f764009c77fb47044c5fd3cc1b9eb65c19790e7d502',
 'result_token': 'bef2c6c76d7458edac9ad4512457ad4512732c999a852327',
 'update_tokens': ['3845f9e4b741e4bbf26780cafac6039cc6472f222ed02e97',
  '3d43092dea9d38c1cf1454a92fd70be3097e00e2aa17611d']}

Now the data providers will need to be told the `resource_id` and each need to be given one of the two `update_tokens`.

## Hash and Upload

At the moment both data providers have *raw* personally identiy information. We first have to hash the raw entity information. Please see [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this.

In [104]:
!clkutil hash --schema "{schema.name}" "{a_csv.name}" horse staple "{a_clks.name}"
!clkutil hash --schema "{schema.name}" "{b_csv.name}" horse staple "{b_clks.name}"

generating CLKs: 100%|█| 5.00K/5.00K [00:02<00:00, 921clk/s, mean=912, std=27.8]
[31mCLK data written to /tmp/tmpo1m109bt.json[0m
generating CLKs: 100%|█| 5.00K/5.00K [00:02<00:00, 900clk/s, mean=905, std=32.7]
[31mCLK data written to /tmp/tmpmobcy7op.json[0m


Now the two clients can upload their data providing the appropriate *upload tokens*.

In [6]:
!clkutil upload --help

Usage: clkutil upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --mapping TEXT         Server identifier of the mapping
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.


In [105]:
# Upload Alice's data
!clkutil upload \
    --mapping="{credentials['resource_id']}" \
    --apikey="{credentials['update_tokens'][0]}" \
    "{a_clks.name}"

Every upload gets a receipt token. In some operating modes this receipt is required to access the results. For ease of use we will **manually** save this receipt token so we can use it later.

In [107]:
mid = credentials['resource_id']
alice_receipt_token = "9a9e3c5471af13169387029d6f5efa0bf84880549f5d42d9"

In [108]:
# Upload Bob's data
!clkutil upload \
    --mapping="{credentials['resource_id']}" \
    --apikey="{credentials['update_tokens'][1]}" \
    "{b_clks.name}"


[31mUploading CLK data from /tmp/tmpmobcy7op.json[0m
[31mTo Entity Matching Server: https://es.data61.xyz[0m
[31mMapping ID: 6a864f764009c77fb47044c5fd3cc1b9eb65c19790e7d502[0m
[31mChecking server status[0m
[31mStatus: ok[0m
[31mUploading CLK data to the server[0m
{"message": "Updated", "receipt-token": "f9c394a4e7687f5ecd263d4a4c800f743677fac9667a2900"}



In [109]:
bob_receipt_token = "f9c394a4e7687f5ecd263d4a4c800f743677fac9667a2900"

## Results

Now after some delay (depending on the size) we can fetch the mask.
This can be done with clkutil:

    !clkutil results \
        --mapping="{credentials['resource_id']}" \
        --apikey="{credentials['result_token']}" --output results.txt
        
But we are going to use the Python `requests` library:

In [110]:
import requests
import json

In [111]:
mask = requests.get('{}/mappings/{}'.format(url, mid), headers={'Authorization': credentials['result_token']}).json()['mask']

This mask is a boolean array that specifies where the permuted data lines up

In [112]:
print(mask[:10])

[1, 1, 0, 1, 1, 1, 1, 0, 1, 1]


We also use `requests` to fetch the permutations for each data provider:

In [113]:
alice_res = requests.get('{}/mappings/{}'.format(url, mid), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/mappings/{}'.format(url, mid), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation - a new ordering for their data.

In [114]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]

[503, 4123, 2722, 2223, 581, 1026, 1307, 4690, 4838, 3668]

In [115]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]

[671, 154, 2544, 1387, 4495, 4871, 2111, 3035, 2100, 4518]

In [116]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item
    
    return neworder

In [117]:
with open(a_csv.name, 'r') as f:
    alice_raw = f.readlines()
    alice_reordered = reorder(alice_raw, alice_permutation)

with open(b_csv.name, 'r') as f:
    bob_raw = f.readlines()
    bob_reordered = reorder(bob_raw, bob_permutation)

In [118]:
alice_reordered[:10]

['rec-1321-org,tyler,painter,74,michael holt crescent,three rivers tourist park,kangaroo flat,4858,nsw,19291017,6615450\n',
 'rec-245-org,taylah,meaney,60,lumholtz place,cordelia state,torquay,2282,vic,19470106,1744252\n',
 'rec-491-org,courtney,petersen,9,abercrombie circuit,myross,canterbury,2200,vic,19331227,9627876\n',
 'rec-3297-org,alicia,lamonaca,15,vickers crescent,ballawinna stud,deakin,6701,nsw,19780805,4465146\n',
 'rec-265-org,hamish,teague,46,middleton circuit,colooli village,burwood,4128,nsw,19731030,9222441\n',
 'rec-3198-org,kristen,white,5,sherlock street,morayfield exchange,wangaratta,5082,wa,19761201,8300732\n',
 'rec-3777-org,lachlan,ricard,40,maribyrnong avenue,tintagel,bowral,4158,sa,19471123,6012074\n',
 'rec-686-org,zarlia,hage,69,catchpole street,tryphinia view,casino,3180,vic,19130311,5080759\n',
 'rec-1095-org,tiarna,croker,3,roope close,westport,preston west,3318,,19460426,9509846\n',
 'rec-3779-org,ajay,chetwyn,13,cope place,c/-the student village,,2164,nsw

In [119]:
bob_reordered[:10]

['rec-4604-dup-0,caitlin,morrisln,8,augustus close,,clarkson,4170,nsw,19130704,9362649\n',
 'rec-610-dup-0,lachlan,rau,68,loxton place,montrose,dedearng,3377,wa,19260502,1638351\n',
 'rec-1108-dup-0,lachln,murton,191,barraclough crescent,medcl cntr (cnr queen s street,east hills,2707,nsw,19770527,2068877\n',
 'rec-819-dup-0,jessiac,bastiaans,130,beazley crescent,,ingleburn,3029,qld,19270912,7965000\n',
 'rec-2166-dup-0,isabelle,geraghty,16,hawdonstreet,council accomm,theodore,4812,nsw,19861012,4201009\n',
 'rec-871-dup-0,limbert,benjamin,201,blackman crescent,karinyah,woodbine,2228,nsw,19020429,5534369\n',
 'rec-3479-dup-0,makenzie,michelmore,29,marcus clarke street,leitrim,ivanhoe,2199,vic,19081127,5210533\n',
 'rec-4931-dup-0,rebecca,spzrk,10,walker c rescent,,casino,4121,nsw,19990319,3829447\n',
 'rec-584-dup-0,renee,renfrey,24,serpentine street,retmntvvlge,winchelsea,2176,wa,19690305,5153203\n',
 'rec-1174-dup-0,lilfy,browne,1,spring rsne road,,someret,2024,qld,19990812,1727298\n']

## Accuracy

To compute how well the matching went we will use the social security number as our reference.

In [123]:
for i, m in enumerate(mask[:20]):
    if m:
        print(alice_reordered[i].strip(), alice_reordered[i][-1] == bob_reordered[i][-1])

rec-1321-org,tyler,painter,74,michael holt crescent,three rivers tourist park,kangaroo flat,4858,nsw,19291017,6615450 True
rec-245-org,taylah,meaney,60,lumholtz place,cordelia state,torquay,2282,vic,19470106,1744252 True
rec-3297-org,alicia,lamonaca,15,vickers crescent,ballawinna stud,deakin,6701,nsw,19780805,4465146 True
rec-265-org,hamish,teague,46,middleton circuit,colooli village,burwood,4128,nsw,19731030,9222441 True
rec-3198-org,kristen,white,5,sherlock street,morayfield exchange,wangaratta,5082,wa,19761201,8300732 True
rec-3777-org,lachlan,ricard,40,maribyrnong avenue,tintagel,bowral,4158,sa,19471123,6012074 True
rec-1095-org,tiarna,croker,3,roope close,westport,preston west,3318,,19460426,9509846 True
rec-3779-org,ajay,chetwyn,13,cope place,c/-the student village,,2164,nsw,19290222,6258884 True
rec-3739-org,jacob,,1,barada crescent,challicum south,bacchus marsh,3206,nsw,19370320,8873242 True
rec-3508-org,jacob,campain,7,logan street,bimbadoon,toowoomba,4551,qld,19520416,1591935

In [124]:
correct_matches = 0
incorrect_matches = 0


for i, m in enumerate(mask):
    if m:
        if alice_reordered[i][-1] == bob_reordered[i][-1]:
            correct_matches += 1

print(correct_matches)

4047
