In [1]:
import csv
import itertools
import os

import requests

# Entity Service: Multiparty linkage demo
This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.

We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.

## Check the status of the Entity Service
Ensure that it is running and that we have the correct version. Multiparty support was introduced in version 1.11.0.

In [2]:
SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")
PREFIX = f"{SERVER}/api/v1"
print(requests.get(f"{PREFIX}/status").json())
print(requests.get(f"{PREFIX}/version").json())

{'project_count': 5944, 'rate': 2260983, 'status': 'ok'}
{'anonlink': '0.12.5', 'entityservice': 'v1.13.0-alpha', 'python': '3.7.5'}


## Create a new project
We create a new multiparty project for five parties by specifying the number of parties and the output type (currently only the `group` output type supports multiparty linkage). Retain the `project_id`, so we can find the project later. Also retain the `result_token`, so we can retrieve the results (careful: anyone with this token has access to the results). Finally, the `update_tokens` identify the five data data providers and permit them to upload CLKs.

In [3]:
project_info = requests.post(
    f"{PREFIX}/projects",
    json={
        "schema": {},
        "result_type": "groups",
        "number_parties": 5,
        "name": "example project"
    }
).json()
project_id = project_info["project_id"]
result_token = project_info["result_token"]
update_tokens = project_info["update_tokens"]

print("project_id:", project_id)
print()
print("result_token:", result_token)
print()
print("update_tokens:", update_tokens)

project_id: 21d8916332764c00c0861f1dda132c633c731c377fd89696

result_token: 4b8c53796161aad56414631fd553d5905256ea5cba0476e8

update_tokens: ['f3dafb72996cbc0f453f2acde9dd0e037066039d492c96ee', '28c6cb8b3f85bb528574d51c1f67953af7bb9b835b119451', '028b0b1c05b1e669c7b5bf13caf3a53022481d867c3c0fb9', '105c8d242b51f30388f6f8b0bd4d32189127ea760d22377e', '36955c914e3e0d1aed86a5af32027dfb8a8169532ba4125e']


## Upload the hashed data
This is where each party uploads their CLKs into the service. Here, we do the work of all five data providers inside this for loop. In a deployment scenario, each data provider would be uploading their own CLKs using their own update token.

These CLKs are already hashed using [clkhash](https://github.com/data61/clkhash), so for each data provider, we just need to upload their corresponding hash file.

In [4]:
for i, token in enumerate(update_tokens, start=1):
    with open(f"data/clks-{i}.json") as f:
        r = requests.post(
            f"{PREFIX}/projects/{project_id}/clks",
            data=f,
            headers={
                "Authorization": token,
                "content-type": "application/json"
            }
        )
    print(f"Data provider {i}: {r.text}")

Data provider 1: {
  "message": "Updated",
  "receipt_token": "3e102ce587ae97feb18aebf7596aee5ba3ba5b6a41d5bedf"
}

Data provider 2: {
  "message": "Updated",
  "receipt_token": "ab758b30126ddc083bf65749773fc5856719b4273adc0703"
}

Data provider 3: {
  "message": "Updated",
  "receipt_token": "e013c252746cbc5ceb00b4009500769ceb63389de886137c"
}

Data provider 4: {
  "message": "Updated",
  "receipt_token": "f2f38a3206197dd46b53c4c6da079527552d7c6e24b9b63e"
}

Data provider 5: {
  "message": "Updated",
  "receipt_token": "e489cf14d65b211dd6c8b98b1a902f04e3b09c0e3da21a44"
}



## Begin a run
The data providers have uploaded their CLKs, so we may begin the computation. This computation may be repeated multiple times, each time with different parameters. Each such repetition is called a run. The most important parameter to vary between runs is the similarity threshold. Two records whose similarity is above this threshold will be considered to describe the same entity.

Here, we perform one run. We (somewhat arbitrarily) choose the threshold to be 0.8.

In [5]:
r = requests.post(
    f"{PREFIX}/projects/{project_id}/runs",
    headers={
        "Authorization": result_token
    },
    json={
        "threshold": 0.8
    }
)
run_id = r.json()["run_id"]

## Check the status
Let's see whether the run has finished ('state' is 'completed')!

In [6]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/status",
    headers={
        "Authorization": result_token
    }
)
r.json()

{'current_stage': {'description': 'compute similarity scores',
  'number': 2,
  'progress': {'absolute': 31440720,
   'description': 'number of already computed similarity scores',
   'relative': 0.2984721650891483}},
 'stages': 3,
 'state': 'running',
 'time_added': '2019-11-18T02:52:30.352381+00:00',
 'time_started': '2019-11-18T02:52:30.373760+00:00'}

Now after some delay (depending on the size) we can fetch the results. Waiting for completion can be achieved by directly polling the REST API using `requests`, however for simplicity we will just use the `watch_run_status` function provided in `clkhash.rest_client`.

In [7]:
from IPython.display import clear_output
from clkhash.rest_client import RestClient
from clkhash.rest_client import format_run_status
rest_client = RestClient(SERVER)
for update in rest_client.watch_run_status(project_id, run_id, result_token, timeout=300):
    clear_output(wait=True)
    print(format_run_status(update))

State: completed
Stage (3/3): compute output


## Retrieve the results
We retrieve the results of the linkage. As we selected earlier, the result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party id and the row index.

The last 20 groups look like this.

In [8]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/result",
    headers={
        "Authorization": result_token
    }
)
groups = r.json()
groups['groups'][-20:]

[[[0, 287], [2, 293], [4, 277]],
 [[0, 2387], [1, 2386]],
 [[0, 264], [3, 252], [1, 272]],
 [[0, 2496], [4, 2498]],
 [[3, 147], [4, 147]],
 [[3, 815], [4, 812]],
 [[3, 1302], [4, 1343]],
 [[0, 1691], [3, 1674]],
 [[0, 3085], [3, 3117]],
 [[1, 2559], [4, 2545]],
 [[0, 574], [3, 576], [4, 554]],
 [[0, 424], [4, 387]],
 [[1, 1087], [2, 1140]],
 [[1, 468], [2, 489], [3, 482], [4, 469]],
 [[3, 2102], [4, 2115]],
 [[1, 981], [3, 1007]],
 [[0, 696], [3, 704]],
 [[0, 2475], [2, 2501], [1, 2485]],
 [[1, 1034], [2, 1090]],
 [[0, 2785], [4, 2797]]]

To sanity check, we print their records' corresponding PII:

In [9]:
def load_dataset(i):
    dataset = []
    with open(f"data/dataset-{i}.csv") as f:
        reader = csv.reader(f)
        next(reader)  # ignore header
        for row in reader:
            dataset.append(row[1:])
    return dataset

datasets = list(map(load_dataset, range(1, 6)))

for group in itertools.islice(groups["groups"][-20:], 20):
    for (i, j) in group:
        print(i, datasets[i][j])
    print()

0 ['mackenzie', 'tremellen', '11-01-2947', 'maoe', 'melbourne', '79469.112', '']
2 ['mackenzie', 'dremellen', '11-01-2937', 'mals', 'mceloburne', '70469.122', '07 5988 5208']
4 ['macckenzie', 'tremellen', '', 'malr', 'melbovrne', '70469.122', '07 5988 5208']

0 ['sophi', 'couljon', '12-03-1841', 'female', 'sydney', '80972.256', '04 3854 3784']
1 ['sophie', 'coulson', '12-03-1941', 'female', 'sydney', '80972.356', '04 3854 3784']

0 ['jasmine', 'clarke', '04-00-2009', 'maje', 'melb0urme', '99853.100', '02 1507 1520']
3 ['jasmine', 'clarke', '04-09-2009', 'male', 'melbourne', '99853.200', '02 1507 1520']
1 ['jasminr', 'klarle', '04-99-2009', 'male', 'melbourne', '99863.200', '02 1507 1520']

0 ['zoel', 'ev', '06-09-1990', 'gemale', 'ysdnvvy', '183366.696', '02 5578 4520']
4 ['joel', 'everett', '06-09-1990', 'female', 'sydney', '183366.696', '02 5578 4520']

3 ['katelyn', 'matthets', '23-07-1977', '', 'melbourne', '118010.996', '07 9265 9238']
4 ['kateyln', 'matth4ws', '23-07-1978', 'male

Despite the high amount of noise in the data, the entity service was able to produce a fairly accurate matching. However, Isabella George and Mia/Talia Galbraith are most likely not an actual match.

We may be able to improve on this results by fine-tuning the hashing schema or by changing the threshold.

## Delete the project

In [10]:
r = requests.delete(
    f"{PREFIX}/projects/{project_id}",
    headers={
        "Authorization": result_token
    }
)
print(r.status_code)

204
