In [1]:
import csv
import json

import pandas as pd

SERVER = 'http://testing.es.data61.xyz'
KEY1 = 'correct'
KEY2 = 'horse'

# Scenario

We have Alice, Bob, and Charlie. The datasets have about 3200 record each, but with incomplete overlap. The common features are given name, surname, date of birth, and phone number.

Alice has a person's gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don't need any other information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers).

## Alice, Bob, and Charlie: agree on keys and a schema

They keep the keys to themselves, but the schema may be revealed to the analyst.

In [2]:
print(f'keys: {KEY1}, {KEY2}')

keys: correct, horse


In [3]:
with open('data/schema.json') as f:
    print(f.read())


{
  "version": 1,
  "clkConfig": {
    "l": 1024,
    "k": 15,
    "hash": {
      "type": "doubleHash"
    },
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "info": "c2NoZW1hX2V4YW1wbGU=",
      "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "id",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 1,
        "weight": 0
      }
    },
    {
      "identifier": "givenname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": false,
        "weight": 1
      }
    },
    {
      "identifier": "surname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": false,
        "weight": 1
      }
    },
    {
 

# Sneak peek at input data
### Alice

In [4]:
pd.read_csv('data/dataset-alice.csv').head()

Unnamed: 0,id,givenname,surname,dob,phone number,gender
0,0,tara,hilton,27-08-1941,08 2210 0298,male
1,3,saJi,vernre,22-12-2972,02 1090 1906,mals
2,7,sliver,paciorek,,,mals
3,9,ruby,george,09-05-1939,07 4698 6255,male
4,10,eyrinm,campbell,29-1q-1983,08 299y 1535,male


### Bob

In [5]:
pd.read_csv('data/dataset-bob.csv').head()

Unnamed: 0,id,givenname,surname,dob,phone number,city
0,3,zali,verner,22-12-1972,02 1090 1906,perth
1,4,samuel,tremellen,21-12-1923,03 3605 9336,melbourne
2,5,amy,lodge,16-01-1958,07 8286 9372,canberra
3,7,oIji,pacioerk,10-02-1959,04 4220 5949,sydney
4,10,erin,kampgell,29-12-1983,08 2996 1445,perth


## Charlie

In [6]:
pd.read_csv('data/dataset-charlie.csv').head()

Unnamed: 0,id,givenname,surname,dob,phone number,income
0,1,joshua,arkwright,16-02-1903,04 8511 9580,70189.446
1,3,zal:,verner,22-12-1972,02 1090 1906,50194.118
2,7,oliyer,paciorwk,10-02-1959,04 4210 5949,31750.993
3,8,nacoya,ranson,17-08-1925,07 6033 4580,102446.131
4,10,erih,campbell,29-12-1i83,08 299t 1435,331476.599


## Analyst: create the project

The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone.

In [7]:
!clkutil create-project --server $SERVER --type groups --schema data/schema.json --parties 3 --output credentials.json

with open('credentials.json') as f:
    credentials = json.load(f)
    project_id = credentials['project_id']
    result_token = credentials['result_token']
    update_token_alice = credentials['update_tokens'][0]
    update_token_bob = credentials['update_tokens'][1]
    update_token_charlie = credentials['update_tokens'][2]

[31mProject created[0m


## Alice: hash the data and upload it to the server
The data is hashed according to the schema and the keys. Alice's update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes.

In [8]:
!clkutil hash data/dataset-alice.csv $KEY1 $KEY2 data/schema.json dataset-alice-hashed.json --check-header false

generating CLKs: 100%|█| 3.23k/3.23k [00:00<00:00, 5.47kclk/s, mean=373, std=34.9]
[31mCLK data written to dataset-alice-hashed.json[0m


In [9]:
!clkutil upload --server $SERVER --apikey $update_token_alice --project $project_id dataset-alice-hashed.json

{"message": "Updated", "receipt_token": "cb50b0244d6d20879be4108fb1e8914821ebf508f8629ca1"}

## Bob: hash the data and upload it to the server

In [10]:
!clkutil hash data/dataset-bob.csv $KEY1 $KEY2 data/schema.json dataset-bob-hashed.json --check-header false

generating CLKs: 100%|█| 3.24k/3.24k [00:00<00:00, 5.31kclk/s, mean=373, std=35.6]
[31mCLK data written to dataset-bob-hashed.json[0m


In [11]:
!clkutil upload --server $SERVER --apikey $update_token_bob --project $project_id dataset-bob-hashed.json

{"message": "Updated", "receipt_token": "39325e8be2147cdb99e2167716b512f2a6600b6a807e06d5"}

## Charlie: hash the data and upload it to the server

In [12]:
!clkutil hash data/dataset-charlie.csv $KEY1 $KEY2 data/schema.json dataset-charlie-hashed.json --check-header false

generating CLKs: 100%|█| 3.26k/3.26k [00:00<00:00, 5.26kclk/s, mean=374, std=34.8]
[31mCLK data written to dataset-charlie-hashed.json[0m


In [13]:
!clkutil upload --server $SERVER --apikey $update_token_charlie --project $project_id dataset-charlie-hashed.json

{"message": "Updated", "receipt_token": "834aeb097f1865fc549056248d36d20afcb506a25bf0d0f2"}

## Analyst: start the linkage run

This will start the linkage computation. We will wait a little bit and then retrieve the results.

In [14]:
!clkutil create --server $SERVER --project $project_id --apikey $result_token --threshold 0.68 --output=run-credentials.json

with open('run-credentials.json') as f:
    run_credentials = json.load(f)
    run_id = run_credentials['run_id']

## Analyst: retreve the results

In [15]:
!clkutil results --server $SERVER --project $project_id --apikey $result_token --run $run_id --output linkage-output.json

[31mState: completed
Stage (3/3): compute output[0m
[31mDownloading result[0m
[31mReceived result[0m


In [16]:
with open('linkage-output.json') as f:
    linkage_output = json.load(f)
    linkage_groups = linkage_output['groups']

## Everyone: make table of interesting information

We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII.

In [17]:
with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    genders = tuple(row[-1] for row in r)
    
with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    cities = tuple(row[-1] for row in r)
    
with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    incomes = tuple(row[-1] for row in r)

In [18]:
table = []
for group in linkage_groups:
    row = [''] * 3
    for i, j in group:
        row[i] = [genders, cities, incomes][i][j]
    if sum(map(bool, row)) > 1:
        table.append(row)
pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)

Unnamed: 0,gender,city,income
0,,sydney,178071.246
1,,canberra,170081.321
2,female,syfdney,125874.591
3,male,melbourne,
4,,melbourne,68548.966
5,male,melbovrne,157723.128
6,,brisbane,73787.731
7,male,canbrrra,
8,female,sydney,
9,femaoe,canberra,


In [19]:
linkage_groups[:15]

[[[1, 798], [2, 834]],
 [[0, 1465], [2, 1430]],
 [[0, 2409], [2, 2429]],
 [[1, 597], [2, 598]],
 [[1, 84], [2, 2718]],
 [[1, 1478], [2, 1499]],
 [[1, 834], [2, 879], [0, 856]],
 [[0, 804], [1, 946]],
 [[1, 981], [2, 1027]],
 [[0, 1561], [1, 1506], [2, 1528]],
 [[1, 1637], [2, 1659]],
 [[0, 1922], [1, 1895]],
 [[0, 149], [1, 146]],
 [[0, 1538], [1, 1482]],
 [[0, 809], [1, 788]]]

# Sneak peek at the result

We obviously can't do this in a real-world setting, but let's view the linkage using the PII. If the IDs match, then we are correct.

In [20]:
with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_alice = tuple(r)
    
with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_bob = tuple(r)
    
with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_charlie = tuple(r)

In [21]:
table = []
for group in linkage_groups:
    for i, j in sorted(group):
        table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])
    table.append([''] * 6)
    
pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).head(30)

Unnamed: 0,id,given name,surname,dob,phone number,non-linking
0,3255.0,tay1a,clarke,08-94-2003,04 1350 7153,
1,3255.0,dayla,clarke,08-04-2093,04 1350 6153,160504.960
2,,,,,,
3,4950.0,eli2a,parr,06-04-1911,,
4,4950.0,eliza,parr,06-04-1921,04 1525 2602,155179.877
5,,,,,,
6,7653.0,jrd,bengwr,03-12-1963,08 9970 9475,male
7,7653.0,jed,benjyr,93-12-1963,08 9070 9475,
8,,,,,,
9,2635.0,fltn17,reex,01-05-2946,07 9331 w748,sydney
