# Entity Service Similarity Score Output

This example shows how to retrieve the similarity scores from the Entity Service. Note if you are running this notebook it will save multiple files into your working directory.

First we create a new mapping with the type set to `similarity_scores`.

In [1]:
!clkutil create \
    --type similarity_scores \
    --threshold 0.95 \
    --output credentials.json \
    --server https://es.data61.xyz

[31mEntity Matching Server: https://es.data61.xyz[0m
[31mChecking server status[0m
[31mServer Status: ok[0m
[31mSchema: NOT PROVIDED[0m
[31mType: similarity_scores[0m
[31mCreating new mapping[0m
[31mMapping created[0m


We load the credentials saved by the command line tool into a Python dict

In [2]:
import json
import requests

with open('credentials.json','r') as f:
    credentials = json.load(f)

Now we need some entity information to match. For testing purposes the `clkhash` tool can generate fake data:

In [3]:
!clkutil generate 2000 raw_pii_2k.csv

Split the fake PII data into somewhat overlapping sets.
Alice will have 1500 enties, bob will have 1000, and 500 entities overlap.

In [4]:
!head -n 1 raw_pii_2k.csv > alice.txt
!tail -n 1500 raw_pii_2k.csv >> alice.txt
!head -n 1000 raw_pii_2k.csv > bob.txt

The generated data is a very simple fake PII: `ID, Name, YOB, Gender`

In [5]:
!tail -n 2 bob.txt

997,Vivian Modi,1933/03/31,F
998,Latonia Shumpert,1958/01/15,F


We have generated *raw* identiy information which will have to be hashed:

In [6]:
# Hash the data using the secret keys that the linkage authority doesn't know
!clkutil hash alice.txt horse staple  alice-hashed.json
!clkutil hash bob.txt horse staple  bob-hashed.json

generating CLKs: 100%|█| 1.50K/1.50K [00:00<00:00, 1.11Kclk/s, mean=521, std=36.4]
[31mCLK data written to alice-hashed.json[0m
generating CLKs: 100%|███| 999/999 [00:00<00:00, 4.86Kclk/s, mean=521, std=36.4]
[31mCLK data written to bob-hashed.json[0m


In [7]:
def upload_data(mapping, apikey, server, data):
    response = requests.put(
        '{}/api/v1/mappings/{}'.format(server, mapping),
        data=data,
        headers={
            "Authorization": apikey,
            'content-type': 'application/json'
        }
    )
    return response.json()

In [8]:
alice_upload_response = upload_data(
    mapping=credentials['resource_id'], 
    apikey=credentials['update_tokens'][0], 
    server="https://es.data61.xyz",
    data=open('alice-hashed.json','r')
)

In [9]:
bob_upload_response = upload_data(
    mapping=credentials['resource_id'], 
    apikey=credentials['update_tokens'][1], 
    server="https://es.data61.xyz",
    data=open('bob-hashed.json','r')
)

Every upload gets a receipt token. In some operating modes this receipt is required to access the results.

In [10]:
mid = credentials['resource_id']
alice_receipt_token = alice_upload_response['receipt-token']
bob_receipt_token = bob_upload_response['receipt-token']

In [11]:
# Now after some delay (depending on the size) we can fetch the resulting sparse matrix
!clkutil results \
    --mapping="{credentials['resource_id']}" \
    --server https://es.data61.xyz \
    --apikey="{credentials['result_token']}" --output results.txt

[31mChecking server status[0m
[31mStatus: ok[0m
[31mResponse code: 200[0m
[31mReceived result[0m


In [12]:
import json
with open('results.txt','r') as f:
    sparse_scores = json.load(f)['similarity_scores']

In [13]:
print(len(sparse_scores))

499


In [14]:
for i, (index_a, index_b, dice_score) in enumerate(sparse_scores):
    print(index_a, index_b, dice_score)
    
    if i > 20:
        break

500 0 1.0
501 1 1.0
502 2 1.0
503 3 1.0
504 4 1.0
505 5 1.0
506 6 1.0
507 7 1.0
508 8 1.0
509 9 1.0
510 10 1.0
511 11 1.0
512 12 1.0
513 13 1.0
514 14 1.0
515 15 1.0
516 16 1.0
517 17 1.0
518 18 1.0
519 19 1.0
520 20 1.0
521 21 1.0


In this case the data hasn't been pertubed so the results are all 1.0.