# Tutorial: N1 Analytics hash utility

## Integration Authority

This notebook demonstrates creating a new permutation mapping on the entity service using our command line tool, and how to retrieve the resulting mask.

In [1]:
!clkutil --version

clkutil, version 0.4.5


In [2]:
!clkutil create -v --output credentials.json

[31mEntity Matching Server: https://es.data61.xyz[0m
[31mChecking server status[0m
[31mServer Status: ok[0m
[31mSchema: NOT PROVIDED[0m
[31mType: permutation_unencrypted_mask[0m
[31mCreating new mapping[0m
[31mMapping created[0m
[31m
The generated tokens can be used to upload hashed data and
fetch the resulting linkage table from the service.

To upload using the cli tool for entity A:

    clkutil hash a_people.csv A_HASHED_FILE.json
    clkutil upload --mapping="9f942ffdf20a999bf7255a2111095c0d5aabe6a34d0a11e8" --apikey="92df0b3a4799c1bd4b17e77975ddcd140e9de3004ae12061"  A_HASHED_FILE.json

To upload using the cli tool for entity B:

    clkutil hash b_people.csv B_HASHED_FILE.json
    clkutil upload --mapping="9f942ffdf20a999bf7255a2111095c0d5aabe6a34d0a11e8" --apikey="9491202b7528fc75b2d066bf3cdc35998abf1dcea5abe8a9" B_HASHED_FILE.json

After both users have uploaded their data one can watch for and retrieve the results:

    clkutil results -w --mapping="9f942ffdf20

In [3]:
import json
with open('credentials.json','r') as f:
    credentials = json.load(f)
    
!cat credentials.json

{
    "resource_id": "9f942ffdf20a999bf7255a2111095c0d5aabe6a34d0a11e8",
    "result_token": "c3108dfde89890e7bf23cbd21ffc153eebc441c88fcbfa44",
    "update_tokens": [
        "92df0b3a4799c1bd4b17e77975ddcd140e9de3004ae12061",
        "9491202b7528fc75b2d066bf3cdc35998abf1dcea5abe8a9"
    ]
}



Now for this demo we are going to generate some fake PII data (with overlaps) for Alice and Bob. 

In [4]:
# Generate some fake PII data
!clkutil generate 2000 raw_pii_2k.csv

# Split the fake PII data into somewhat overlapping alice and bob sets
!head -n 1 raw_pii_2k.csv > alice.txt
!tail -n 1500 raw_pii_2k.csv >> alice.txt
!head -n 1000 raw_pii_2k.csv > bob.txt

!rm raw_pii_2k.csv

In [5]:
!tail -n 2 bob.txt

997,Shenna Gaitor,2011/03/29,F
998,Roscoe Zielinski,1947/03/14,M


Now lets create a credentials file that we will give to Alice and Bob to enable them to upload hashed data to the server.

In [6]:
alice_credentials = credentials['resource_id'] + ' ' + credentials['update_tokens'][0]
bob_credentials = credentials['resource_id'] + ' ' + credentials['update_tokens'][1]

with open('alice-credentials.txt','wt') as f:
    f.write(alice_credentials)
    
with open('bob-credentials.txt','wt') as f:
    f.write(bob_credentials)

In [7]:
!cat alice-credentials.txt

9f942ffdf20a999bf7255a2111095c0d5aabe6a34d0a11e8 92df0b3a4799c1bd4b17e77975ddcd140e9de3004ae12061

This info is all that we share with alice. Alice and Bob will have privately worked out a secret for hashing their data.

In [8]:
# We can check to see if there is a result (which there won't be)
mid = credentials['resource_id']
token = credentials['result_token']

!clkutil results --mapping="$mid" --apikey="$token"

[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 503[0m
[31mNo result yet[0m
[31m{
    "current": "0",
    "elapsed": 0.0,
    "message": "Mapping isn't ready.",
    "progress": 0.0,
    "total": "NA"
}
[0m


## Switch to Data Providers (Alice & Bob)

Now after the participants have uploaded their data and we give the server some time for computing the result 
(depending on the size) we can fetch the mask:

In [12]:

!clkutil results --mapping="$mid" --apikey="$token" --output results.txt


[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 200[0m
[31mReceived result[0m


In [13]:
!head results.txt

{
    "mask": [
        0,
        0,
        0,
        1,
        1,
        1,
        0,
        0,


In [14]:
import json
mask = json.load(open('results.txt'))['mask']

# Fin

Now just to check our results, lets break the illusion and bring everything back together and see if they line up.

Alice and Bob both have a new permutation - a new ordering for their data.

In [15]:
alice_reordered = open('alice-reordered.txt', 'rt').readlines()
alice_reordered[:10]

['1865,Durward Iverslie,2007/01/15,M\n',
 '1985,Mark Bedson,1966/03/10,F\n',
 '1886,Brantlee Gislason,1995/08/29,M\n',
 '867,Braulio Peinado,1950/06/12,M\n',
 '767,Bernice Cabellero,1930/06/30,F\n',
 '806,Milo Durling,1920/07/11,M\n',
 '1649,Kya Candill,1960/05/27,F\n',
 '1822,Mardell Becknell,1918/03/27,F\n',
 '572,Blair Roewe,1969/03/29,F\n',
 '979,Todd Torian,1917/01/14,M\n']

In [16]:
bob_reordered = open('bob-reordered.txt', 'rt').readlines()
bob_reordered[:10]

['155,Azariah Serasio,1921/11/16,M\n',
 '492,Deidra Minniti,2015/01/16,F\n',
 '52,Alida Frankl,2002/08/04,F\n',
 '867,Braulio Peinado,1950/06/12,M\n',
 '767,Bernice Cabellero,1930/06/30,F\n',
 '806,Milo Durling,1920/07/11,M\n',
 '370,Rhoda Shotwell,1987/10/25,F\n',
 '72,Cassandra Shufford,1945/09/03,F\n',
 '572,Blair Roewe,1969/03/29,F\n',
 '979,Todd Torian,1917/01/14,M\n']

The mask is required to reveal where the entities line up:

In [17]:
for i, m in enumerate(mask[:30]):
    if m:
        print(alice_reordered[i].strip(), alice_reordered[i] == bob_reordered[i])

867,Braulio Peinado,1950/06/12,M True
767,Bernice Cabellero,1930/06/30,F True
806,Milo Durling,1920/07/11,M True
572,Blair Roewe,1969/03/29,F True
979,Todd Torian,1917/01/14,M True
578,Jodi Kazmi,1936/09/28,F True
592,Edith Ratel,1932/03/11,M True
958,Trevin Sininger,1995/06/20,M True
945,Gustavo Fusha,1954/10/13,M True
726,Ariella Bergami,1955/11/06,F True
718,Rheta Cassara,1938/08/30,F True
567,Joaquin Eguia,1923/09/09,M True
935,Babyboy Moyler,1931/08/08,M True
787,Claudius Traux,1983/02/27,M True
872,Myrtie Mcteer,2010/12/23,F True
