In [1]:
import csv
import itertools

import requests

# Entity Service: Multiparty linkage demo
This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.

We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.

## Check the status of the Entity Service
Ensure that it is running and that we have the correct version.

In [2]:
PREFIX = "http://testing.es.data61.xyz/api/v1"
print(requests.get(f"{PREFIX}/status").json())
print(requests.get(f"{PREFIX}/version").json())

{'project_count': 609, 'rate': 379192754, 'status': 'ok'}
{'anonlink': '0.11.2', 'entityservice': 'v1.11.0', 'python': '3.6.8'}


## Make a new project
Make a new project, specifying the number of parties (five in this example) and the output type (currently only the `group` output type supports multiparty linkage. Retain the `project_id`, so we can find the project later. Also retain the `result_token`, so we can retrieve the results (careful: anyone with this token has access to the results). Finally, the `update_tokens` identify the five data data providers and permit them to upload CLKs.

In [3]:
project_info = requests.post(
    f"{PREFIX}/projects",
    json={
        "schema": {},
        "result_type": "groups",
        "number_parties": 5,
        "name": "example project"
    }
).json()
project_id = project_info["project_id"]
result_token = project_info["result_token"]
update_tokens = project_info["update_tokens"]

print("project_id:", project_id)
print()
print("result_token:", result_token)
print()
print("update_tokens:", update_tokens)

project_id: 2fe4dc0235f7e6fa80e738c393bfa4fea6797283086ef184

result_token: 6dc4019577848c7b9e17f36b1895cea1410ae21714ee6315

update_tokens: ['8989ef2645c9e4c3346187818267a701b7c899620e150459', 'dd0158f2d53a91600ebb0745e509fb73f3df9e440278fd12', '0090a2734ad3c1111c917348ec5fd04ccad1826763a35b83', '890df3e84907e371ca818c074c14533d2daaf50b5e289186', '748ef806e231f792d4850f8c8fd05936a2881ebbf0e76577']


## Upload the hashed data
This is where each party uploads their CLKs into the service. Here, we do the work of all five data providers inside this for loop. In a deployment scenario, each data provider would be uploading their own CLKs using their own update token.

These CLKs are already hashed using clkhash, so for each data provider, we just need to upload their corresponding hash file.

In [4]:
for i, token in enumerate(update_tokens, start=1):
    with open(f"data/clks-{i}.json") as f:
        r = requests.post(
            f"{PREFIX}/projects/{project_id}/clks",
            data=f,
            headers={
                "Authorization": token,
                "content-type": "application/json"
            }
        )
    print(f"Data provider {i}: {r.text}")

Data provider 1: {
  "message": "Updated",
  "receipt_token": "e0396d304ab247d128d406b2968dafa033e1888b56986961"
}

Data provider 2: {
  "message": "Updated",
  "receipt_token": "10218532f725c552983581e3b76b8aa8d5cb086e859da8ec"
}

Data provider 3: {
  "message": "Updated",
  "receipt_token": "c04598565b600d9b2f50dfe7d8644a711a396f46e9a3eb2e"
}

Data provider 4: {
  "message": "Updated",
  "receipt_token": "8e60a39480c01fcf283767405e515c8cdae436927c44688f"
}

Data provider 5: {
  "message": "Updated",
  "receipt_token": "ffe32cfb14eb2f106301603e75ebd5e2641406dfd5a280f7"
}



## Begin a run
The data providers have uploaded their CLKs, so we may begin the computation. This computation may be repeated multiple times, each time with different parameters. Each such repetition is called a run. The most important parameter to vary between runs is the similarity threshold. Two records whose similarity is above this threshold will be considered to describe the same entity.

Here, we perform one run. We (somewhat arbitrarily) choose the threshold to be 0.8.

In [5]:
r = requests.post(
    f"{PREFIX}/projects/{project_id}/runs",
    headers={
        "Authorization": result_token
    },
    json={
        "threshold": 0.8
    }
)
run_id = r.json()["run_id"]

## Check the status
Let's see whether the run has finished!

In [6]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/status",
    headers={
        "Authorization": result_token
    }
)
r.json()

{'current_stage': {'description': 'compute output', 'number': 3},
 'stages': 3,
 'state': 'completed',
 'time_added': '2019-06-03T02:12:13.555112+00:00',
 'time_completed': '2019-06-03T02:12:15.152790+00:00',
 'time_started': '2019-06-03T02:12:13.607466+00:00'}

## Retrieve the results
We retrieve the results of the linkage. As we selected earlier, the result is a list of groups of records. Every record in such a group belongs to the same entity.

To sanity check, we pick the first 20 groups and print their records' corresponding PII.

In [7]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/result",
    headers={
        "Authorization": result_token
    }
)
groups = r.json()
groups

{'groups': [[[0, 2503], [1, 2687]],
  [[3, 712], [4, 753]],
  [[1, 2225], [2, 301]],
  [[2, 2480], [3, 149]],
  [[2, 2118], [4, 272]],
  [[1, 1892], [4, 1931]],
  [[0, 1409], [3, 1224]],
  [[3, 840], [4, 841]],
  [[0, 634], [2, 1003]],
  [[2, 2871], [4, 615]],
  [[0, 741], [3, 752]],
  [[1, 595], [2, 595]],
  [[0, 2989], [3, 2721]],
  [[0, 2078], [2, 2076]],
  [[0, 880], [4, 874]],
  [[2, 2544], [4, 2513]],
  [[2, 332], [3, 310]],
  [[0, 1585], [3, 1547]],
  [[4, 2507], [2, 2537]],
  [[1, 1519], [3, 1782]],
  [[1, 2085], [2, 2237]],
  [[0, 2548], [1, 1046]],
  [[0, 1632], [2, 971]],
  [[0, 3186], [4, 3210]],
  [[0, 266], [4, 2632]],
  [[1, 1527], [4, 1585]],
  [[0, 2820], [3, 2832]],
  [[2, 2271], [3, 2257]],
  [[0, 2547], [1, 2566]],
  [[2, 2277], [4, 2256]],
  [[2, 1732], [4, 1846]],
  [[2, 328], [3, 307]],
  [[1, 2375], [2, 2388]],
  [[2, 445], [3, 2710]],
  [[1, 2539], [3, 2534]],
  [[0, 1990], [3, 506]],
  [[1, 3205], [3, 3195]],
  [[0, 1840], [4, 2674]],
  [[0, 3057], [2, 831]],


In [8]:
def load_dataset(i):
    dataset = []
    with open(f"data/dataset-{i}.csv") as f:
        reader = csv.reader(f)
        next(reader)  # ignore header
        for row in reader:
            dataset.append(row[1:])
    return dataset

datasets = list(map(load_dataset, range(1,6)))

In [9]:
for group in itertools.islice(groups["groups"], 20):
    for i, j in group:
        print(i, datasets[i][j])
    print()

0 ['destynii', 'white', '11-03-1995', 'female', 'sydney', '66408.876', '04 3899 1604']
1 ['destynii', 'white', '29-10-1919', 'female', 'sydney', '172171.250', '02 1530 6604']

3 ['christopher', 'yu', '09-12-1932', 'male', 'melbourne', '169159.457', '07 5359 4974']
4 ['christopher', 'mcgregor', '20-12-1939', 'male', 'melbourne', '159633.469', '04 6203 7278']

1 ['lachlan', 'white', '22-01-1900', 'female', 'brisbane', '58477.105', '02 0756 6419']
2 ['lachlan', 'snajdar', '22-01-1990', 'female', 'brisbane', '80302.601', '07 0429 5504']

2 ['charlotte', 'burford', '13-06-1933', 'female', 'melbourne', '1387144.898', '04 4666 7874']
3 ['chloe', 'longhurst', '13-05-1953', 'female', 'melbourne', '1387111.531', '02 7623 5968']

2 ['mitchell', 'excell', '22-07-1916', 'female', 'brisbane', '48745.968', '03 1715 8277']
4 ['mitchell', 'har', '04-07-1957', 'female', 'brisbane', '46706.931', '07 3987 4146']

1 ['olcky', 'reis', '13-91-1949', 'male', 'canberra', '61718.921', '03 5754 5443']
4 ['ollh',

This matching appears impressively accurate for such noisy data. However, there is still clearly some room for improvement. We may be able to improve on this results by fine-tuning the hashing schema or by changing the threshold.

## Delete the project

In [10]:
r = requests.delete(
    f"{PREFIX}/projects/{project_id}",
    headers={
        "Authorization": result_token
    }
)
print(r.status_code)

204
