# Anonlink Entity Service API

This tutorial demonstrates interacting with the entity service via the REST API. The primary alternative is to use
a library or command line tool such as [`clkhash`](http://clkhash.readthedocs.io/) which can handle the communication with the anonlink entity service.

### Dependencies

In this tutorial we interact with the REST API using the `requests` Python library. Additionally we use the `clkhash` Python library in this tutorial to define the linkage schema and to encode the PII. The synthetic dataset comes from the `recordlinkage` package. All the dependencies can be installed with pip:

```
pip install requests clkhash recordlinkage
```


### Steps

* Check connection to Anonlink Entity Service
* Synthetic Data generation and encoding
* Create a new linkage project
* Upload the encodings
* Create a run
* Retrieve and analyse results

In [1]:
import json
import os
import time
import requests

from IPython.display import clear_output

## Check Connection

If you are connecting to a custom entity service, change the address here.

In [2]:
server = os.getenv("SERVER", "https://testing.es.data61.xyz")
url = server + "/api/v1/"
print(f'Testing anonlink-entity-service hosted at {url}')

Testing anonlink-entity-service hosted at https://testing.es.data61.xyz/api/v1/


In [3]:
requests.get(url + 'status').json()

{'project_count': 7871, 'rate': 301990, 'status': 'ok'}

## Data preparation

This section won't be explained in great detail as it directly follows the 
[clkhash tutorials](http://clkhash.readthedocs.io/en/latest/).

We encode a synthetic dataset from the `recordlinkage` library using `clkhash`.

In [4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

In [5]:
dfA, dfB = load_febrl4()

In [6]:
with open('a.csv', 'w') as a_csv:
    dfA.to_csv(a_csv, line_terminator='\n')

with open('b.csv', 'w') as b_csv:    
    dfB.to_csv(b_csv, line_terminator='\n')

## Schema Preparation

The linkage schema must be agreed on by the two parties. A hashing schema instructs `clkhash` how to treat each column for encoding PII into CLKs. A detailed description of the hashing schema can be found in the [clkhash documentation](https://clkhash.readthedocs.io/en/latest/schema.html).

A linkage schema can either be defined as Python code as shown here, or as a JSON file (shown in other tutorials). The importance of each field is controlled by the `k` parameter in the `FieldHashingProperties`.
We ignore the record id and social security id fields so they won't be incorporated into the encoding.

In [7]:
import clkhash
from clkhash.field_formats import *
schema = clkhash.randomnames.NameList.SCHEMA
_missing = MissingValueSpec(sentinel='')
schema.fields = [
    Ignore('rec_id'),
    StringSpec('given_name', 
               FieldHashingProperties(ngram=2, k=15)),
    StringSpec('surname', 
               FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('street_number', 
                FieldHashingProperties(ngram=1, 
                                       positional=True, 
                                       k=15, 
                                       missing_value=_missing)),
    StringSpec('address_1', 
               FieldHashingProperties(ngram=2, k=15)),
    StringSpec('address_2', 
               FieldHashingProperties(ngram=2, k=15)),
    StringSpec('suburb', 
               FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('postcode', 
                FieldHashingProperties(ngram=1, positional=True, k=15)),
    StringSpec('state', 
               FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('date_of_birth', 
                FieldHashingProperties(ngram=1, positional=True, k=15, missing_value=_missing)),
    Ignore('soc_sec_id')
]

## Encoding

Transforming the *raw* personally identity information into CLK encodings following the defined schema. See the [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this.

In [8]:
from clkhash import clk
with open('a.csv') as a_pii:
    hashed_data_a = clk.generate_clk_from_csv(a_pii, ('key1',), schema, validate=False)
    
with open('b.csv') as b_pii:
    hashed_data_b = clk.generate_clk_from_csv(b_pii, ('key1',), schema, validate=False)

generating CLKs:   0%|          | 0.00/5.00k [00:00<?, ?clk/s, mean=0, std=0]generating CLKs:   4%|▍         | 200/5.00k [00:00<00:07, 682clk/s, mean=643, std=44.8]generating CLKs:  52%|█████▏    | 2.60k/5.00k [00:00<00:02, 959clk/s, mean=643, std=45.7]generating CLKs: 100%|██████████| 5.00k/5.00k [00:00<00:00, 9.71kclk/s, mean=644, std=45.4]
generating CLKs:   0%|          | 0.00/5.00k [00:00<?, ?clk/s, mean=0, std=0]generating CLKs:   4%|▍         | 200/5.00k [00:00<00:04, 1.12kclk/s, mean=625, std=57.3]generating CLKs:  52%|█████▏    | 2.60k/5.00k [00:00<00:01, 1.56kclk/s, mean=632, std=52.9]generating CLKs: 100%|██████████| 5.00k/5.00k [00:00<00:00, 12.4kclk/s, mean=632, std=53]  


## Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.


In [9]:
project_spec = {
  "schema": {},
  "result_type": "groups",
  "number_parties": 2,
  "name": "API Tutorial Test"
}
credentials = requests.post(url + 'projects', json=project_spec).json()

project_id = credentials['project_id']
a_token, b_token = credentials['update_tokens']
credentials

{'project_id': '0989ebe812ab245b3639c2ffae0ac82a9e6efb97d32f3a1f',
 'result_token': '29687113baf41a7947d049e52f7804ca77fbba0abd243931',
 'update_tokens': ['0d2321c208a03af044074ac4131ebc795ecb5516749b34d1',
  '55cb5f601fb982ae33dbe4ba27c625fd32f37b08ae410dff']}

**Note:** the analyst will need to pass on the `project_id` (the 
id of the linkage project) and one of the two `update_tokens` to 
each data provider.

The `result_token` can also be used to carry out project API requests:

In [10]:
requests.get(url + 'projects/{}'.format(project_id), 
             headers={"Authorization": credentials['result_token']}).json()

{'error': False,
 'name': 'API Tutorial Test',
 'notes': '',
 'number_parties': 2,
 'parties_contributed': 0,
 'project_id': '0989ebe812ab245b3639c2ffae0ac82a9e6efb97d32f3a1f',
 'result_type': 'groups',
 'schema': {}}

Now the two clients can upload their data providing the appropriate *upload tokens*.

## CLK Upload

In [11]:
a_response = requests.post(
    '{}projects/{}/clks'.format(url, project_id),
    json={'clks': hashed_data_a},
    headers={"Authorization": a_token}
).json()

In [12]:
b_response = requests.post(
    '{}projects/{}/clks'.format(url, project_id),
    json={'clks': hashed_data_b},
    headers={"Authorization": b_token}
).json()

Every upload gets a receipt token. In some operating modes this receipt is required to access the results.

## Create a run

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

In [13]:
run_response = requests.post(
    "{}projects/{}/runs".format(url, project_id),
    headers={"Authorization": credentials['result_token']},
    json={
        'threshold': 0.80,
        'name': "Tutorial Run #1"
    }
).json()

In [14]:
run_id = run_response['run_id']

## Run Status

In [15]:
requests.get(
        '{}projects/{}/runs/{}/status'.format(url, project_id, run_id),
        headers={"Authorization": credentials['result_token']}
    ).json()

{'current_stage': {'description': 'compute output', 'number': 3},
 'stages': 3,
 'state': 'completed',
 'time_added': '2019-11-01T02:39:19.310376+00:00',
 'time_completed': '2019-11-01T02:39:20.389791+00:00',
 'time_started': '2019-11-01T02:39:19.336674+00:00'}

## Results

Now after some delay (depending on the size) we can fetch the results. This can of course be done by directly polling the REST API using `requests`, however for simplicity we will just use the watch_run_status function provided in `clkhash.rest_client`.

> Note the `server` is provided rather than `url`.

In [16]:
import clkhash.rest_client
for update in clkhash.rest_client.watch_run_status(server, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))


State: completed
Stage (3/3): compute output


In [17]:
data = json.loads(clkhash.rest_client.run_get_result_text(
    server, 
    project_id, 
    run_id, 
    credentials['result_token']))

This result is the 1-1 mapping between rows that were more similar than the given threshold.

In [21]:
for i in range(10):
    ((_, a_index), (_, b_index)) = sorted(data['groups'][i])
    print("a[{}] maps to b[{}]".format(a_index, b_index))
print("...")

a[1859] maps to b[3906]
a[950] maps to b[3115]
a[3466] maps to b[3210]
a[1006] maps to b[3452]
a[2325] maps to b[3248]
a[2291] maps to b[687]
a[2144] maps to b[1101]
a[1768] maps to b[3890]
a[1307] maps to b[2441]
a[2932] maps to b[3006]
...


In this dataset there are 5000 records in common. With the chosen threshold and schema we currently retrieve:

In [19]:
len(data['groups'])

4851

## Cleanup

If you want you can delete the run and project from the anonlink-entity-service.

In [20]:
requests.delete(
    "{}/projects/{}".format(url, project_id), 
    headers={"Authorization": credentials['result_token']})

<Response [204]>