# Entity Service Mapping Output

This tutorial demonstrates generating CLKs from PII, creating a new mapping on the entity service, and how to retrieve the results. The output type is a simple mapping.

Note the sections are usually run on different companies - but for illustration all is carried out in this one file. The participants providing data are *Alice* and *Bob*, and the analyst acting the integration authority.

### Who learns what?

Alice and Bob will both generate and upload their CLKs.

The analyst - who creates the linkage project - learns the `mapping`. The mapping lines up rows from Alice's data set to Bob's.

### Steps

* Data preparation
  * Write CSV files with PII
  * Create a Linkage Schema
  
* Create Linkage Project
* Upload Data
* Wait for completion
* Analyse Results

In [3]:
url = 'https://testing.es.data61.xyz'

In [4]:
!clkutil status --server "{url}"

{"rate": 20973, "project_count": 190, "status": "ok"}


## Data preparation

Following the clkhash tutorial we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.

If you are following along yourself you may have to adjust the file names in all the `!clkutil` commands.

In [5]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

In [6]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head()
print("Datasets written to {} and {}".format(a_csv.name, b_csv.name))

Datasets written to /tmp/tmpf1bzhc9o and /tmp/tmp4g9wibnm


## Schema Preparation

The linkage schema must be agreed on by the two parties. The schema used in this tutorial is only the surname, first name and date of birth. All columns marked as `INDEX` will be ignored by clkhash when creating CLKs.

In [17]:
column_metadata = [
    'INDEX',
    'NAME Surname',
    'NAME First Name',
    'INDEX',
    'INDEX',
    'INDEX',
    'INDEX',
    'INDEX',
    'INDEX',
    'DOB YYYY/MM/DD',
    'INDEX'
]

schema = NamedTemporaryFile("wt", suffix='.yaml')
# TODO WRITE NEW SCHEMA HERE
schema.seek(0)
print("Schema written to", schema.name)

Schema written to /tmp/tmp78du25xm.yaml


## Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.
In this case we are going to use a very high threshold which should prevent any false matches from showing up - however this is at the expense of missing possible matches.

In [16]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!python -m clkhash create-project -v --schema "{schema.name}" --output "{creds.name}" --type "mapping" --server "{url}"
creds.seek(0)

import json
with open(creds.name, 'r') as f:
    print(f.read())
    #credentials = json.load(f)

#pid = credentials['project_id']
#credentials

Credentials will be saved in /tmp/tmpplwtc0wr
[31mEntity Matching Server: https://testing.es.data61.xyz[0m
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/testenv/lib/python3.5/site-packages/clkhash/__main__.py", line 4, in <module>
    cli()
  File "/root/testenv/lib/python3.5/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/root/testenv/lib/python3.5/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/root/testenv/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/testenv/lib/python3.5/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/testenv/lib/python3.5/site-packages

**Note:** the analyst will need to pass on the `resource_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.

## Hash and Upload

At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. Please see [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this.

In [7]:
!clkutil hash --schema "{schema.name}" "{a_csv.name}" horse staple "{a_clks.name}"
!clkutil hash --schema "{schema.name}" "{b_csv.name}" horse staple "{b_clks.name}"

generating CLKs: 100%|█| 5.00K/5.00K [00:01<00:00, 1.26Kclk/s, mean=495, std=46.1]
[31mCLK data written to /tmp/tmpqoj38cx5.json[0m
generating CLKs: 100%|█| 5.00K/5.00K [00:01<00:00, 1.26Kclk/s, mean=489, std=52.8]
[31mCLK data written to /tmp/tmp3uawaaok.json[0m


Now the two clients can upload their data providing the appropriate *upload tokens*. As with all commands in `clkhash` we can output help:

In [8]:
!clkutil upload --help

Usage: clkutil upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --mapping TEXT         Server identifier of the mapping
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.


### Alice uploads her data

In [9]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --mapping="{credentials['resource_id']}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt-token']    

[31mUploading CLK data from /tmp/tmpqoj38cx5.json[0m
[31mTo Entity Matching Server: https://linkage.data61.xyz[0m
[31mMapping ID: 8e993a2fd2036d6a4739d5369803f3c3fae8a13ee9dc5e19[0m
[31mChecking server status[0m
[31mStatus: ok[0m
[31mUploading CLK data to the server[0m


Every upload gets a receipt token. In some operating modes this receipt is required to access parts of the results.

### Bob uploads his data

In [10]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --mapping="{credentials['resource_id']}" \
        --apikey="{credentials['update_tokens'][1]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{b_clks.name}"
    bob_receipt_token = json.load(open(f.name))['receipt-token']

[31mUploading CLK data from /tmp/tmp3uawaaok.json[0m
[31mTo Entity Matching Server: https://linkage.data61.xyz[0m
[31mMapping ID: 8e993a2fd2036d6a4739d5369803f3c3fae8a13ee9dc5e19[0m
[31mChecking server status[0m
[31mStatus: ok[0m
[31mUploading CLK data to the server[0m


## Results

Now after some delay (depending on the size) we can fetch the mask.
This can be done with clkutil:

    !clkutil results \
        --mapping="{credentials['resource_id']}" \
        --apikey="{credentials['result_token']}" --output results.txt
        
However for this tutorial we are going to use the Python `requests` library:

In [11]:
import requests
import json
import time

In [12]:
result = requests.get('{}/api/v1/mappings/{}'.format(url, mid), headers={'Authorization': credentials['result_token']})
while result.status_code != 200:
    print(result.json())
    result = requests.get('{}/api/v1/mappings/{}'.format(url, mid), headers={'Authorization': credentials['result_token']})
    time.sleep(0.5)
else:
    results = result.json()

{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.0}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.0}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.0}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.0}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.0}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.0}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.039098}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.185084}
{'current': '0', 'progress': 0.0, 'message': "Mapping isn't ready.", 'total': '25000000', 'elapsed': 0.340995}
{'current': '0', 'progress': 0.

The results from the Entity Service are mapping str -> str, to make it easier to work with we will create a dict of int -> int:

In [13]:
mapping = {int(k): int(results['mapping'][k]) for k in results['mapping']}
len(mapping)

2273

This mapping links rows in Alice's data set to rows in Bob's - let's check that now. For the first few lines of Alice's data let's see if there a corrosponding entity has been found.

In [14]:
for i in range(10):
    if i in mapping:
        print(i, mapping[i])

1 2750
2 4656
3 4119
4 3306
5 2305
6 3944
7 992
8 4612


Now let's look at the raw data from Alice and Bob to see if the names match up.

In [15]:
with open(a_csv.name, 'r') as f:
    alice_raw = f.readlines()

with open(b_csv.name, 'r') as f:
    bob_raw = f.readlines()

In [16]:
alice_raw[:3]

['rec_id,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id\n',
 'rec-1070-org,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218\n',
 'rec-1016-org,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625\n']

Note the file includes a header row so we will add an offset of `+1` to our file lookups.

In [17]:
for alice_i in range(10):
    if alice_i in mapping:
        bob_i = mapping[alice_i]
        alice_row = alice_raw[alice_i+1].split(',')
        bob_row = bob_raw[bob_i+1].split(',')
        print(alice_i, bob_i, alice_row[0], bob_row[0], ' '.join(alice_row[1:3]).title(), ' '.join(bob_row[1:3]).title())

1 2750 rec-1016-org rec-1016-dup-0 Courtney Painter Courtney Painter
2 4656 rec-4405-org rec-4405-dup-0 Charles Green Charles Green
3 4119 rec-1288-org rec-1288-dup-0 Vanessa Parr Vanessa Parr
4 3306 rec-3585-org rec-3585-dup-0 Mikayla Malloney Mikayla Malloney
5 2305 rec-298-org rec-298-dup-0 Blake Howie Blake Howie
6 3944 rec-1985-org rec-1985-dup-0  Lund  Lund
7 992 rec-2404-org rec-2404-dup-0 Blakeston Broadby Blakeston Broadby
8 4612 rec-1473-org rec-1473-dup-0  Leslie  Leslie


## Accuracy

To compute how well the matching went we will use the original index as our reference.

In [18]:
number_correct = 0
for alice_i in mapping:
    bob_i = mapping[alice_i]
    alice_row = alice_raw[alice_i+1].split(',')
    bob_row = bob_raw[bob_i+1].split(',')
    
    if alice_row[0].split('-')[1] == bob_row[0].split('-')[1]:
        number_correct += 1
    else:
        print("A false match!", alice_row[0], bob_row[0], alice_row[-1], bob_row[-1], ' '.join(alice_row[1:3]).title(), ' '.join(bob_row[1:3]).title())

In [19]:
print(number_correct)

2273
