# Tutorial: N1 Analytics hash utility


This notebook demonstrates creation of a new mapping on the Entity Service, and retrieval of the results.
The sections are usually run by different companies, but for illustration all is carried out in this one file.


## Integration Authority

Creates a mapping and is given credentials.

In [2]:
!clkutil create -v --output credentials.json

[31mEntity Matching Server: https://es.data61.xyz[0m
[31mChecking server status[0m
[31mServer Status: ok[0m
[31mSchema: NOT PROVIDED[0m
[31mType: permutation_unencrypted_mask[0m
[31mCreating new mapping[0m
[31mMapping created[0m
[31m
The generated tokens can be used to upload hashed data and
fetch the resulting linkage table from the service.

To upload using the cli tool for entity A:

    clkutil hash a_people.csv A_HASHED_FILE.json
    clkutil upload --mapping="9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34" --apikey="187a938e664473e93950fe60083e52f1f4a3fdd45d67d1ff"  A_HASHED_FILE.json

To upload using the cli tool for entity B:

    clkutil hash b_people.csv B_HASHED_FILE.json
    clkutil upload --mapping="9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34" --apikey="12fad3c831a1ff09b83098ef401e1641081e4cfe3624a7a6" B_HASHED_FILE.json

After both users have uploaded their data one can watch for and retrieve the results:

    clkutil results -w --mapping="9f2ae583c5a

In [3]:
import json
with open('credentials.json','r') as f:
    credentials = json.load(f)
    
!cat credentials.json

{
    "result_token": "338058888d38cb3aaf071557c1507a8438fe563551563bd9",
    "resource_id": "9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34",
    "update_tokens": [
        "187a938e664473e93950fe60083e52f1f4a3fdd45d67d1ff",
        "12fad3c831a1ff09b83098ef401e1641081e4cfe3624a7a6"
    ]
}



Now we need some entity information to match. For testing purposes this tool can generate fake data:

In [4]:
# Generate some fake PII data
!clkutil generate 2000 raw_pii_2k.csv

# Split the fake PII data into somewhat overlapping alice and bob sets
!head -n 1 raw_pii_2k.csv > alice.txt
!tail -n 1500 raw_pii_2k.csv >> alice.txt
!head -n 1000 raw_pii_2k.csv > bob.txt

In [5]:
!tail -n 2 bob.txt

997,Gustav Henkhaus,1963/10/14,M
998,Scotty Mahaxay,1939/09/07,M


We have generated *raw* identiy information. Looking at the help for the `upload` command we see that we first have to hash the raw entity information.

In [6]:
!clkutil upload --help

Usage: clkutil upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --mapping TEXT         Server identifier of the mapping
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.


In [7]:
%%time
# Hash the data using the secret keys, which the linkage authority doesn't know
!clkutil hash alice.txt horse staple  alice-hashed.json
!clkutil hash bob.txt horse staple  bob-hashed.json

[31mCLK data written to alice-hashed.json[0m
[31mCLK data written to bob-hashed.json[0m
CPU times: user 68 ms, sys: 8 ms, total: 76 ms
Wall time: 2.98 s


In [8]:
!head -n 1 alice-hashed.json

{"clks": ["FrLb/v6P6dD5rEwj6qg/vVj3X2ZlzKEOsULAXrns/0Bu64YAd/+TzxJPy+gwZ/ZMbgisByjf7plB\nvHzYFl9WPOviEI7aQFq8plxOflNbcevNqrDxpC4viL9155apUD48wi0+M/HSvPZMqNjtLyxxT9Ea\nJ3iPF8YBp7Hp+7p2bUI=", "S+sh6Zay63SrPDrnPpjAqJgSDmQd6LY+Q0aFpihsD6iGS38xsKqarmr3pasLw+bkjqDpta6uVrV/\nsIKvXm5e63e6t35LxCqiZrc+an9s8qaCL9s+6DGra6YAPPI0LoYkjbw2M2GCLKpuaIg2Vxippqqe\nTCqAk+9rMKni5rsubDo=", "IraYJ1Pm4B2yORFGd/hfLEkXmyZ3xKEKKmbgAny0O0y7rEo3cBIyltaTqbdULLMUhxSvy1DPWgQE\nYDqnUzyQLPJxkJ5coJw5FlyS/HiOMvIRzmw08qhXzT8XI47AQBYm9mI9imEmsF+aTVqw4kVl4rGq\nenXKB14F+bC8YerLVQI=", "nrNQP68t5Tv+DZiy6xhjPdv1+mZF5LsPRG/sVxiuz4G7tbYk8023x6KHCq+QJ69MJR4wb6jXReVp\nvXtXC3opnuent4dPT1E8s/u+vHXve6NLKmPn5DVnuEdsR+6ow/c0ojqXlxleuUcw6DlnN2JrTdcO\nLrBvUncFFrm9+esfjcM=", "BpyCYuLwobmBs9I6sTybDGhS7W9yTJFd1X/Yv3wv7mFA+e4RsuOvoyaPt/dE0+pH6TG+QbGj5hnb\nY0r3xu/mJ/Y/V/3YY2g1g0NXa9eJ4avWM4Ij7C/qqfXIwtaq8hk3hiwkS6WGKVbJYHjHV/3XA4pK\nvnCg1sBCrhCw7itP7Sg=", "inaQdEMsuca/qc6CSVjmp3jNazDIUCg/nBYstlQlYc6eZMxANLAyYyuHD2IKajtADcS/RWqn/H1b\naU12Gyu0weB

In [9]:
!ls -lsh alice*

264K -rw-r--r-- 1 jovyan users 264K Aug  9 03:57 alice-hashed.json
 52K -rw-r--r-- 1 jovyan users  50K Aug  9 03:57 alice.txt


In [10]:
# Upload Alice's data
out = !clkutil upload \
    --mapping="{credentials['resource_id']}" \
    --apikey="{credentials['update_tokens'][0]}" \
    alice-hashed.json

Every upload gets a receipt token. In some operating modes this receipt is required to access the results. For ease of use lets save this so we can use it later.

In [25]:
print(out)
mid = "9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34"
alice_receipt_token = "e7595ad375669a7cd7a199ebaf0569a263dcaca940f5371d"

['Uploading CLK data from bob-hashed.json', 'To Entity Matching Server: https://es.data61.xyz', 'Mapping ID: 9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34', 'Checking server status', 'Status: ok', 'Uploading CLK data to the server', '{', '    "receipt-token": "b5af03da6f3cb3a83816901272351f7d2fe81cce81c74815",', '    "message": "Updated"', '}', '']


In [12]:
# Upload Bob's data
out = !clkutil upload \
    --mapping="{credentials['resource_id']}" \
    --apikey="{credentials['update_tokens'][1]}" \
    bob-hashed.json
    
out

['Uploading CLK data from bob-hashed.json',
 'To Entity Matching Server: https://es.data61.xyz',
 'Mapping ID: 9f2ae583c5aa379f7ad41c2abbbbcdeae63940f4a269af34',
 'Checking server status',
 'Status: ok',
 'Uploading CLK data to the server',
 '{',
 '    "receipt-token": "b5af03da6f3cb3a83816901272351f7d2fe81cce81c74815",',
 '    "message": "Updated"',
 '}',
 '']

In [26]:
bob_receipt_token = "b5af03da6f3cb3a83816901272351f7d2fe81cce81c74815"

In [27]:
# Now after some delay (depending on the size) we can fetch the mask
!clkutil results \
    --mapping="{credentials['resource_id']}" \
    --apikey="{credentials['result_token']}" --output results.txt

[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 200[0m
[31mReceived result[0m


In [28]:
import json
with open('results.txt','r') as f:
    mask = json.load(f)['mask']

In [38]:
print(mask[:10])

[1, 1, 1, 1, 1, 0, 1, 0, 1, 0]


In [29]:
import requests
url = 'https://es.data61.xyz/api/v1'

alice_res = requests.get('{}/mappings/{}'.format(url, mid), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/mappings/{}'.format(url, mid), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation&mdash;a new ordering for their data.

In [31]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]

[648, 147, 262, 916, 36, 189, 274, 89, 67, 0]

In [32]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]

[847, 767, 749, 485, 513, 286, 141, 782, 545, 762]

In [33]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item
    
    return neworder

In [34]:
with open('alice.txt', 'r') as f:
    alice_raw = f.readlines()
    alice_reordered = reorder(alice_raw, alice_permutation)

with open('bob.txt', 'r') as f:
    bob_raw = f.readlines()
    bob_reordered = reorder(bob_raw, bob_permutation)

In [35]:
alice_reordered[:10]

['508,Alysha Lesly,1920/04/10,F\n',
 '772,Addilynn Kasprowicz,1946/05/03,F\n',
 '536,Murry Cothran,1967/10/03,M\n',
 '715,Wallace Hillier,1950/08/22,M\n',
 '861,Marcie Obierne,1992/08/10,F\n',
 '1591,Donte Nuth,1971/11/20,M\n',
 '694,Joana Nesselrodt,1980/05/03,F\n',
 '1169,Woodson Clum,1938/03/20,M\n',
 '686,Allie Fludd,1920/07/04,F\n',
 '1317,Mel Neveu,1968/01/05,M\n']

In [36]:
bob_reordered[:10]

['508,Alysha Lesly,1920/04/10,F\n',
 '772,Addilynn Kasprowicz,1946/05/03,F\n',
 '536,Murry Cothran,1967/10/03,M\n',
 '715,Wallace Hillier,1950/08/22,M\n',
 '861,Marcie Obierne,1992/08/10,F\n',
 '495,Lynsey Boyda,2004/06/15,F\n',
 '694,Joana Nesselrodt,1980/05/03,F\n',
 '214,Amari Ruland,1922/04/01,M\n',
 '686,Allie Fludd,1920/07/04,F\n',
 '347,Gaynell Seedborg,1922/11/16,F\n']

In [37]:
for i, m in enumerate(mask[:20]):
    if m:
        print(alice_reordered[i].strip(), alice_reordered[i] == bob_reordered[i])

508,Alysha Lesly,1920/04/10,F True
772,Addilynn Kasprowicz,1946/05/03,F True
536,Murry Cothran,1967/10/03,M True
715,Wallace Hillier,1950/08/22,M True
861,Marcie Obierne,1992/08/10,F True
694,Joana Nesselrodt,1980/05/03,F True
686,Allie Fludd,1920/07/04,F True
842,Shona Kalathas,1999/06/27,F True
953,Bessie Moderski,1944/03/14,M True
602,Andon Quicksey,1997/12/09,M True
851,Camryn Greenstreet,1959/05/02,F True
