# Tutorial: N1 Analytics hash utility

## First data provider (Alice)

This notebook demonstrates local hashing of personally identifiable information (PII), upload to the entity service, and retrieval of the results.

In [1]:
!clkutil --version

clkutil, version 0.4.5


In [2]:
# Our data is already in our local directory...
!ls alice*

alice-credentials.txt  alice-hashed.json  alice.txt


In [3]:
!head alice.txt

INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
500,Loran Urik,1949/09/03,M
501,Arch Kaanana,1919/07/09,M
502,Ewald Ronda,1972/02/12,M
503,Tavian George,1958/04/25,M
504,Elliott Palmieri,1958/06/05,M
505,Lyda Chesson,1979/03/12,F
506,King Saran,1990/08/21,M
507,Lacy Motz,1985/11/06,F
508,Syreeta Mieszala,2012/10/05,F


## Step 1 - Locally hash PII data

First we need to hash the PII file. To do that, the two data providers need to come up with a secret. The data linkage authority is not allowed to know this secret. Two words will do; here I'll use the name of a fish, `"Smooth Oreo"`.

<img src="http://www.foxtrade.lv/assets/Uploads/_resampled/SetRatioSize350350-zeus-faber-sw.png"/>

In [4]:
!clkutil hash --help

Usage: clkutil hash [OPTIONS] INPUT KEYS... OUTPUT

  Process data to create CLKs

  Given a file containing csv data as INPUT, and optionally a json document
  defining the expected schema, verify the schema, then hash the data to
  create CLKs writing to OUTPUT. Note the CSV file should contain a header
  row - however this row is not used by this tool.

  It is important that the keys are only known by the two data providers.
  Two words should be provided. For example:

  $clkutil hash input.txt horse staple output.txt

  Use "-" to output to stdout.

Options:
  -s, --schema FILENAME
  --help                 Show this message and exit.


In [5]:
%%time
# Hash the data using the secret keys, which the linkage authority doesn't know
!clkutil hash alice.txt smooth oreo alice-hashed.json

[31mAssuming default schema[0m
[31mHashing data[0m
[31mHeader Row: INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
[0m
[31mCLK data written to alice-hashed.json[0m
CPU times: user 32 ms, sys: 4 ms, total: 36 ms
Wall time: 1.58 s


Lets take a sneak peek at the hashed data to convince ourselves that the created file isn't obviously full of PII.

In [6]:
open('alice-hashed.json').read(200)

'{"clks": ["9yBXbGFLdoFeMMMjexDiucYPZpngbHAV4QVMvgXSbxsn4NVjPNJPrCEk8YCFMfQKZsleJJcg8RTQ\\nfdFRdFBxYVFzxEnpREpGlKtkUBJpQqSh5ks3YynDGCg3WJYLVnNGI5RlZxBE8YetCnoqSRR0KBQ7\\nTwY0AUCuXJhk7FKkyCA=", "zytfDFAjU'

These "clks" are the *cryptographic long term keys*, sometimes refered to as Bloom filter hashes.


## Step 2 - Upload
Next we can upload this hashed data to the entity linkage service:

In [42]:
!clkutil upload --help

Usage: clkutil upload [OPTIONS] INPUT

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as INPUT, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --mapping TEXT  Server identifier of the mapping
  --apikey TEXT   Authentication API key for the server.
  --server TEXT   Server address including protocol
  --help          Show this message and exit.


Looks like we need some authentication information from the linkage authority.

In [7]:
# Securely provided by the data linkage authority:
with open('alice-credentials.txt','r') as f:
    linkage_id, provider_token = f.read().split()

linkage_id, provider_token

('9f942ffdf20a999bf7255a2111095c0d5aabe6a34d0a11e8',
 '92df0b3a4799c1bd4b17e77975ddcd140e9de3004ae12061')

In [8]:
# Upload the data
out = !clkutil upload \
    --mapping="$linkage_id" \
    --apikey="$provider_token" \
    alice-hashed.json

Every upload gets a receipt token. In some operating modes this receipt is required to access the results. For ease of use let's save this so we can use it later.

In [9]:
# Pull out the receipt token
receipt_token = out.grep("receipt-token")[0].strip().split('"receipt-token": ')[1].strip('"')

In [10]:
receipt_token

'ce81211a0706f23628c0470a466ac38ef3698123962bf713'

Now we can check to see if the results are ready (which they won't be...)

In [11]:
!clkutil results \
    --mapping="$linkage_id" \
    --apikey="$receipt_token"

[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 503[0m
[31mNo result yet[0m
[31m{
    "current": "0",
    "elapsed": 0.0,
    "message": "Mapping isn't ready.",
    "progress": 0.0,
    "total": "NA"
}
[0m


Now Bob has to do his part too! Afterwards we can come back to look at the results.

In [12]:
!clkutil results \
    --mapping="$linkage_id" \
    --apikey="$receipt_token" --output="alice-results.txt"

[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 200[0m
[31mReceived result[0m


In [13]:
import json
with open('alice-results.txt','r') as f:
    alice_res = json.load(f)

Now this result is a new permutation - a new ordering for our data.

In [14]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]

[625, 698, 390, 743, 671, 385, 288, 525, 579, 379]

We can reorder our local data with this new permutation.

In [15]:
def reorder(items, order):
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item
    
    return neworder

In [16]:
with open('alice.txt', 'r') as f:
    alice_raw = f.readlines()

alice_reordered = reorder(alice_raw, alice_permutation)

with open('alice-reordered.txt', 'wt') as f:
    f.writelines(alice_reordered)

In [17]:
alice_reordered[:10]

['1865,Durward Iverslie,2007/01/15,M\n',
 '1985,Mark Bedson,1966/03/10,F\n',
 '1886,Brantlee Gislason,1995/08/29,M\n',
 '867,Braulio Peinado,1950/06/12,M\n',
 '767,Bernice Cabellero,1930/06/30,F\n',
 '806,Milo Durling,1920/07/11,M\n',
 '1649,Kya Candill,1960/05/27,F\n',
 '1822,Mardell Becknell,1918/03/27,F\n',
 '572,Blair Roewe,1969/03/29,F\n',
 '979,Todd Torian,1917/01/14,M\n']

Note that Bob doesn't actually know which of these people line up with Alice's entities, because the mask is held by the linkage authority.