# Tutorial: N1 Analytics hash utility

## Integration Authority

This notebook demonstrates creating a new mapping on the entity service, and how to retrieve the results.

In [1]:
!clkutil --version

clkutil, version 0.4.5


In [2]:
!clkutil create -v --output credentials.json

[31mEntity Matching Server: https://es.data61.xyz[0m
[31mChecking server status[0m
[31mServer Status: ok[0m
[31mSchema: NOT PROVIDED[0m
[31mType: permutation_unencrypted_mask[0m
[31mCreating new mapping[0m
[31mMapping created[0m
[31m
The generated tokens can be used to upload hashed data and
fetch the resulting linkage table from the service.

To upload using the cli tool for entity A:

    clkutil hash a_people.csv A_HASHED_FILE.json
    clkutil upload --mapping="784ddf405aff0671e89290879e7720760efa4e4c5020cf1a" --apikey="5428b33624db25188ab9bf9366046561915657083d815c06"  A_HASHED_FILE.json

To upload using the cli tool for entity B:

    clkutil hash b_people.csv B_HASHED_FILE.json
    clkutil upload --mapping="784ddf405aff0671e89290879e7720760efa4e4c5020cf1a" --apikey="101e2b26140a071e0c84a4529d70cd1ff43a991c6fa94683" B_HASHED_FILE.json

After both users have uploaded their data one can watch for and retrieve the results:

    clkutil results -w --mapping="784ddf405af

In [3]:
import json
with open('credentials.json','r') as f:
    credentials = json.load(f)
    
!cat credentials.json

{
    "resource_id": "784ddf405aff0671e89290879e7720760efa4e4c5020cf1a",
    "result_token": "a2ff08a76123c2d2fb0bec6f6a7adad641cb1fc8d1694a41",
    "update_tokens": [
        "5428b33624db25188ab9bf9366046561915657083d815c06",
        "101e2b26140a071e0c84a4529d70cd1ff43a991c6fa94683"
    ]
}



Now we need some entity information to match. For testing purposes the tool can generate fake data:

In [4]:
# Generate some fake PII data
!clkutil generate 2000 raw_pii_2k.csv

# Split the fake PII data into somewhat overlapping alice and bob sets
!head -n 1 raw_pii_2k.csv > alice.txt
!tail -n 1500 raw_pii_2k.csv >> alice.txt
!head -n 1000 raw_pii_2k.csv > bob.txt

In [5]:
!tail -n 2 bob.txt

997,Giovanni Longsdorf,1938/09/06,M
998,Monica Bisignano,2000/02/04,F


We have generated *raw* identiy information. Looking at the help for the `upload` command we see that we have to firsh hash the raw entity information.

In [6]:
!clkutil upload --help

Usage: clkutil upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --mapping TEXT         Server identifier of the mapping
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.


In [7]:
%%time
# Hash the data using the secret keys that the linkage authority doesn't know
!clkutil hash --keys horse staple alice.txt alice-hashed.json
!clkutil hash --keys horse staple bob.txt bob-hashed.json

Error: no such option: --keys
Error: no such option: --keys
CPU times: user 20 ms, sys: 8 ms, total: 28 ms
Wall time: 1.36 s


In [16]:
!head -n 1 alice-hashed.json

["vZ7+8z1Y/efDe9fz79aeOFdhqnb2H21H7t/zTzVnKci1pJXVIejj9vMef5y9/YnMZvfbpfun/+L+\nhbt9dpvrp5ZZfd7fvvnpJVIl2aqw5OU9xlcLbuc2r06P2zd67nn/t9Lq/ezhM2V3rcT2by1++vfn\n99/9wvcXxeY408p+7vc=", "07CasVQGj05YUvCNrb7zHP4zJ/2bn4YGvtlqVknUSmvnXjFP5ph0ywM7tqpyU5LHeiV/1uAhklaz\nh/VqUgiWh0eud4k2Ij+4FspzEYPvARMBGw8TSsF7kSQWDltqU+My8dSOMxr72ftYArVLU+Rq55HQ\nCwefcnk+trsXiRIvOdQ=", "6qnVr8nXDdjnda4eBU0fhX+qaFPTqJz1RC3vxYKy9+0l+EO6/tcH8X548YCf/R5I6QcK3UaBh0rz\n64Pzi0nOph9d3Y1GxPQ73NwC0Wd6q+/cg71FDQ54n8nvTyqsLVy4tHZvaqntUeT9wIwf94vrDXle\n7iV6vvB/bvq98yHuRW4=", "6r4jFmkm51q90PMJvwbb+bJooZV5rLvWPX3vAi9VYm85Kj+y+Sh1iWM9sZU5WN8brzX6XOqXkzbz\ng2dksu+qDTcVJb6l+J+xdJ8xnyjyO+PrmZtDU6Po20yuVVtsLSm6dtJ8eBnzdfQXmvT7w5prrLvU\n542/sTl2B9M/s1pv/WU=", "WpFf7UfXC+qvnm0GHiZLe/ckcFdp6qFEG21uIiZj900jwfJT9vBWonYIoOI/9QrM8G6M38Qktoev\ngwty8/7z9QU75JKnn7v65n4kqmpoqksi+w1v4il9n3mk7i6Lm4Mqt+QfCOK94t9YV6Fq7YKeJl1r\nZ7V+ZnR6IP0fo3Yu9zY=", "1u133G2GF9ktGH8ElffPH32dwZQbRe8WSxhLFPuRf2sB/iOPZQCdTv7E3HRXeFrC82AM2kzjoQ+D\nB/d+CAamxBeuTDc7SBe2

In [12]:
!ls -lsh alice*

 18M -rw-r--r-- 1 brian brian  18M Jan  6 15:48 alice-1M.txt
 86M -rw-r--r-- 1 brian brian  86M Jan  6 15:58 alice-hashed-1M.json
264K -rw-r--r-- 1 brian brian 264K Feb 19 20:18 alice-hashed.json
 52K -rw-r--r-- 1 brian brian  49K Feb 19 20:18 alice.txt


In [44]:
# Upload Alice's data
out = !clkutil upload \
    --mapping="0f01a73e75f5cb37b7062a7a7aa5ac06829b0ab7f8d1d333" \
    --apikey="b48513d89f55066fd15263e635ef10a3cb557c4647f6c5eb" \
    alice-hashed.json

[31mUploading CLK data from alice-hashed.json[0m
[31mTo Entity Matching Server: http://es.data61.xyz[0m
[31mMapping ID: 0f01a73e75f5cb37b7062a7a7aa5ac06829b0ab7f8d1d333[0m
[31mChecking server status[0m
[31mStatus: ok[0m
[31mUploading CLK data to the server[0m
[31m<html>
  <head>
    <title>Internal Server Error</title>
  </head>
  <body>
    <h1><p>Internal Server Error</p></h1>
    
  </body>
</html>
[0m


Every upload gets a receipt token. In some operating modes this receipt is required to access the results. For ease of use lets save this so we can use it later.

In [31]:
out.grep("receipt-token").strip().split()

#alice_receipt_token: "50d0dd8ebbce76d65bc55573f2ff8a7a4181eb4b949be695"

In [34]:
# Upload Bob's data
out = !clkutil upload \
    --mapping="0f01a73e75f5cb37b7062a7a7aa5ac06829b0ab7f8d1d333" \
    --apikey="90a6ddc17091febb86e7fb196760fd04feef137541c32e1a" \
    bob-hashed.json

['Uploading CLK data from bob-hashed.json',
 'To Entity Matching Server: http://es.data61.xyz',
 'Mapping ID: 0f01a73e75f5cb37b7062a7a7aa5ac06829b0ab7f8d1d333',
 'Checking server status',
 'Status: ok',
 'Uploading CLK data to the server',
 '<html>',
 '  <head>',
 '    <title>Internal Server Error</title>',
 '  </head>',
 '  <body>',
 '    <h1><p>Internal Server Error</p></h1>',
 '    ',
 '  </body>',
 '</html>',
 '']

In [22]:
# Now after some delay (depending on the size) we can fetch the mask
!clkutil results -w \
    --mapping="0f01a73e75f5cb37b7062a7a7aa5ac06829b0ab7f8d1d333" \
    --apikey="a3e753909b718a440cf934a1d6c7a6c61926c083c6615407" --output results.txt

[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 200[0m
[31mReceived result[0m


In [24]:
!head results.txt

{
    "mask": [
        1,
        0,
        1,
        0,
        0,
        1,
        0,
        1,


In [19]:
alice_res = requests.get('{}/mappings/{}'.format(url, id), headers={'Authorization': alice_upload_resp.json()['receipt-token']}).json()
bob_res = requests.get('{}/mappings/{}'.format(url, id), headers={'Authorization': bob_upload_resp.json()['receipt-token']}).json()

Now Alice and Bob both have a new permutation - a new ordering for their data.

In [20]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]

[140, 645, 47, 687, 591, 435, 60, 880, 239, 569]

In [21]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]

[140, 80, 391, 172, 90, 922, 274, 819, 816, 13]

In [22]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item
    
    return neworder

In [23]:
with open('alice.txt', 'r') as f:
    alice_raw = f.readlines()
    alice_reordered = reorder(alice_raw, alice_permutation)

with open('bob.txt', 'r') as f:
    bob_raw = f.readlines()
    bob_reordered = reorder(bob_raw, bob_permutation)

In [24]:
alice_reordered[:10]

['1491,Lia Lewandoski,1992/04/14,F\n',
 '1698,Meryl Wiese,1979/04/24,F\n',
 '1875,Meghan Easterling,1999/01/04,F\n',
 '983,Lyndsay Matsushima,1942/03/05,F\n',
 '535,Leone Baumgarten,1919/10/05,F\n',
 '1321,Leilani Tinkham,1976/07/17,F\n',
 '1461,Jacque Kendricks,1989/06/15,F\n',
 '1414,Shannon Mankus,2010/05/30,F\n',
 '882,Brittney Arant,1996/01/21,F\n',
 '1775,Pamella Hunsaker,1966/09/30,F\n']

In [25]:
bob_reordered[:10]

['268,Scott Bigney,1994/05/14,F\n',
 '432,Faustino Wisnosky,2011/06/05,M\n',
 '384,Sophia Viteo,1934/06/26,F\n',
 '983,Lyndsay Matsushima,1942/03/05,F\n',
 '535,Leone Baumgarten,1919/10/05,F\n',
 '93,Joette Swails,1955/06/18,F\n',
 '38,Simon Fiscalini,1922/07/09,M\n',
 '398,Cosmo Corza,1991/11/30,M\n',
 '882,Brittney Arant,1996/01/21,F\n',
 '169,Kiersten Oniell,1983/08/26,F\n']

In [26]:
for i, m in enumerate(mask[:20]):
    if m:
        print(alice_reordered[i].strip(), alice_reordered[i] == bob_reordered[i])

983,Lyndsay Matsushima,1942/03/05,F True
535,Leone Baumgarten,1919/10/05,F True
882,Brittney Arant,1996/01/21,F True
701,Triston Bustios,1989/01/15,M True
676,Indiana Cheaney,1979/09/22,F True
947,Elmo Vanelderen,1942/11/02,M True
570,Collier Cusack,1991/02/24,M True
772,Derwin Sigers,1921/09/29,M True
