# Anonlink Entity Service API

This tutorial demonstrates interacting with the entity service via the REST API. The primary alternative is to use
a library or tool such as `clkhash` which handles the communication.

### Steps

* Check connection to Anonlink Entity Service
* Synthetic Data generation and encoding
* Create a new linkage project
* Upload the encodings
* Create a run
* Retrieve and analyse results

In [1]:
import json
import os
import time
import requests

from IPython.display import clear_output

## Check Connection

If you are connecting to a custom entity service, change the address here.

In [2]:
server = os.getenv("SERVER", "https://testing.es.data61.xyz")
url = server + "/api/v1/"
print(f'Testing anonlink-entity-service hosted at {url}')

Testing anonlink-entity-service hosted at https://testing.es.data61.xyz/api/v1/


In [3]:
requests.get(url + 'status').json()

{'project_count': 2278, 'rate': 3863861, 'status': 'ok'}

## Data preparation

This section won't be explained in great detail as it directly follows the 
[clkhash tutorials](http://clkhash.readthedocs.io/en/latest/).

We will encode a synthetic dataset from the `recordlinkage` library using `clkhash`.


In [4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

In [5]:
dfA, dfB = load_febrl4()

In [6]:
with open('a.csv', 'w') as a_csv:
    dfA.to_csv(a_csv, line_terminator='\n')

with open('b.csv', 'w') as b_csv:    
    dfB.to_csv(b_csv, line_terminator='\n')

## Schema Preparation

The linkage schema must be agreed on by the two parties. 
A hashing schema instructs clkhash how to treat each column for 
generating CLKs. A detailed description of the hashing schema can
be found in the [clkhash documentation](https://clkhash.readthedocs.io/en/latest/schema.html). 


In [7]:
import clkhash
from clkhash.field_formats import *
schema = clkhash.randomnames.NameList.SCHEMA

schema.fields = [
    Ignore('rec_id'),
    StringSpec('given_name', FieldHashingProperties(ngram=2, k=15)),
    StringSpec('surname', FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('street_number', FieldHashingProperties(ngram=1, positional=True, k=15, missing_value=MissingValueSpec(sentinel=''))),
    StringSpec('address_1', FieldHashingProperties(ngram=2, k=15)),
    StringSpec('address_2', FieldHashingProperties(ngram=2, k=15)),
    StringSpec('suburb', FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('postcode', FieldHashingProperties(ngram=1, positional=True, k=15)),
    StringSpec('state', FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('date_of_birth', FieldHashingProperties(ngram=1, positional=True, k=15, missing_value=MissingValueSpec(sentinel=''))),
    Ignore('soc_sec_id')
]

## Encoding

Transforming the *raw* personally identity information into CLK encodings following the defined schema. See the [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this.

In [8]:
from clkhash import clk
with open('a.csv') as a_pii:
    hashed_data_a = clk.generate_clk_from_csv(a_pii, ('key1',), schema, validate=False)
    
with open('b.csv') as b_pii:
    hashed_data_b = clk.generate_clk_from_csv(b_pii, ('key1',), schema, validate=False)

generating CLKs: 100%|██████████| 5.00k/5.00k [00:02<00:00, 1.78kclk/s, mean=645, std=43.8]
generating CLKs: 100%|██████████| 5.00k/5.00k [00:02<00:00, 1.35kclk/s, mean=634, std=50.3]


## Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.


In [9]:
project_spec = {
  "schema": {},
  "result_type": "mapping",
  "number_parties": 2,
  "name": "API Tutorial Test"
}
credentials = requests.post(url + 'projects', json=project_spec).json()

project_id = credentials['project_id']
a_token, b_token = credentials['update_tokens']
credentials

{'project_id': 'e98ababc1a02a4057a13b39c846e9f219acf71bd0a4143c7',
 'result_token': '693c423c0c021f92a9f7b1658ef8f19beaa7b9c1b27ea22c',
 'update_tokens': ['57401d6c0edfa78abf3bd4a87936159f8c974f93dc352d21',
  '8c44139db950ca88f58f18d18e219f001fa105543a7b25e6']}

**Note:** the analyst will need to pass on the `project_id` (the 
id of the linkage project) and one of the two `update_tokens` to 
each data provider.

The `result_token` can also be used to carry out project API requests:

In [10]:
requests.get(url + 'projects/{}'.format(project_id), 
             headers={"Authorization": credentials['result_token']}).json()

{'error': False,
 'name': 'API Tutorial Test',
 'notes': '',
 'number_parties': 2,
 'parties_contributed': 0,
 'project_id': 'e98ababc1a02a4057a13b39c846e9f219acf71bd0a4143c7',
 'result_type': 'mapping',
 'schema': {}}

Now the two clients can upload their data providing the appropriate *upload tokens*.

## CLK Upload

In [12]:
a_response = requests.post(
    '{}projects/{}/clks'.format(url, project_id),
    json={'clks': hashed_data_a},
    headers={"Authorization": a_token}
).json()

In [13]:
b_response = requests.post(
    '{}projects/{}/clks'.format(url, project_id),
    json={'clks': hashed_data_b},
    headers={"Authorization": b_token}
).json()

Every upload gets a receipt token. In some operating modes this receipt is required to access the results.

## Create a run

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

In [21]:
run_response = requests.post(
    "{}projects/{}/runs".format(url, project_id),
    headers={"Authorization": credentials['result_token']},
    json={
        'threshold': 0.80,
        'name': "Tutorial Run #1"
    }
).json()

In [22]:
run_id = run_response['run_id']

# Run Status

In [23]:
requests.get(
        '{}projects/{}/runs/{}/status'.format(url, project_id, run_id),
        headers={"Authorization": credentials['result_token']}
    ).json()

{'current_stage': {'description': 'compute similarity scores',
  'number': 2,
  'progress': {'absolute': 25000000,
   'description': 'number of already computed similarity scores',
   'relative': 1.0}},
 'stages': 3,
 'state': 'running',
 'time_added': '2019-04-30T12:18:44.633541+00:00',
 'time_started': '2019-04-30T12:18:44.778142+00:00'}

## Results

Now after some delay (depending on the size) we can fetch the results. This can of course be done by directly polling the REST API using `requests`, however for simplicity we will just use the watch_run_status function provided in `clkhash.rest_client`.

> Note the `server` is provided rather than `url`.

In [24]:
import clkhash.rest_client
for update in clkhash.rest_client.watch_run_status(server, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))


State: completed
Stage (3/3): compute output


In [25]:
data = json.loads(clkhash.rest_client.run_get_result_text(
    server, 
    project_id, 
    run_id, 
    credentials['result_token']))

This result is the 1-1 mapping between rows that were more similar than the given threshold.

In [30]:
for i in range(10):
    print("a[{}] maps to b[{}]".format(i, data['mapping'][str(i)]))
print("...")

a[0] maps to b[1449]
a[1] maps to b[2750]
a[2] maps to b[4656]
a[3] maps to b[4119]
a[4] maps to b[3306]
a[5] maps to b[2305]
a[6] maps to b[3944]
a[7] maps to b[992]
a[8] maps to b[4612]
a[9] maps to b[3629]
...


In this dataset there are 5000 records in common. With the chosen threshold and schema we currently retrieve:

In [31]:
len(data['mapping'])

4853

# Cleanup

If you want you can delete the run and project from the anonlink-entity-service.

In [44]:
requests.delete(
    "{}/projects/{}".format(url, project_id), 
    headers={"Authorization": credentials['result_token']})

<Response [403]>