# Tutorial for Python API

For this tutorial we are going to process a data set for private linkage with clkhash using the Python API. Note you can also use the command line tool.

The Python package `recordlinkage` has a [tutorial](http://recordlinkage.readthedocs.io/en/latest/notebooks/link_two_dataframes.html) linking data sets in the clear, we will try duplicate that in a privacy preserving setting.

First install clkhash, recordlinkage and a few data science tools (pandas and numpy).

In [None]:
!pip install -U clkhash recordlinkage numpy pandas

In [1]:
import io
import numpy as np
import pandas as pd

In [2]:
import clkhash
from clkhash.field_formats import *
import recordlinkage
from recordlinkage.datasets import load_febrl4

## Data Exploration

First we have a look at the dataset.

In [3]:
dfA, dfB = load_febrl4()

dfA.head()

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-1070-org,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218
rec-1016-org,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625
rec-4405-org,charles,green,38,salkauskas crescent,kela,dapto,4566,nsw,19480930,4365168
rec-1288-org,vanessa,parr,905,macquoid place,broadbridge manor,south grafton,2135,sa,19951119,9239102
rec-3585-org,mikayla,malloney,37,randwick road,avalind,hoppers crossing,4552,vic,19860208,7207688


For this linkage we will **not** use the social security id column.

In [4]:
dfA.columns

Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2',
       'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id'],
      dtype='object')

In [5]:
a_csv = io.StringIO()
dfA.to_csv(a_csv)
a_csv.seek(0)

0

## Hashing Schema Definition

A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the [api docs](http://clkhash.readthedocs.io/en/latest/schema.html). We will ignore the columns 'rec_id' and 'soc_sec_id' for CLK generation.



In [6]:
schema = clkhash.randomnames.NameList.SCHEMA
schema.fields = [
    Ignore('rec_id'),
    StringSpec('given_name', FieldHashingProperties(ngram=2, weight=1)),
    StringSpec('surname', FieldHashingProperties(ngram=2, weight=1)),
    IntegerSpec('street_number', FieldHashingProperties(ngram=1, positional=True, weight=1)),
    StringSpec('address_1', FieldHashingProperties(ngram=2, weight=1)),
    StringSpec('address_2', FieldHashingProperties(ngram=2, weight=1)),
    StringSpec('suburb', FieldHashingProperties(ngram=2, weight=1)),
    IntegerSpec('postcode', FieldHashingProperties(ngram=1, positional=True, weight=1)),
    StringSpec('state', FieldHashingProperties(ngram=2, weight=1)),
    IntegerSpec('date_of_birth', FieldHashingProperties(ngram=1, positional=True, weight=1)),
    Ignore('soc_sec_id')
    ]

## Hash the data

We can now hash our PII data from the CSV file using our defined schema. We must provide two *secret keys* to this command - these keys have to be used by both parties hashing data. For this toy example we will use the keys _'key1'_ and _'key2'_, for real data, make sure that the keys contain enough entropy, as knowledge of these keys is sufficient to reconstruct the PII information from a CLK! Also, **do not share these keys with anyone, except the other participating party.**

In [7]:
from clkhash import clk
hashed_data_a = clk.generate_clk_from_csv(a_csv, ('key1', 'key2'), schema, validate=False)

generating CLKs: 100%|██████████| 5.00k/5.00k [00:05<00:00, 685clk/s, mean=885, std=33.4]


## Inspect the output

clkhash has hashed the PII, creating a Cryptographic Longterm Key for each entity. The output of `generate_clk_from_csv` shows that the mean popcount is quite high (885 out of 1024) which can effect accuracy.

There are two ways to control the popcount:
- You can change the _'k'_ value in the hashConfig section of the schema. It controls the number of entries in the CLK for each n-gram
- or you can modify the individual _'weight'_ values for the different fields. It allows to tune the contribution of a column to the CLK. This can be used to de-emphasise columns which are less suitable for linkage (e.g. information that changes frequently).

First, we will change the value of *k* from 30 to 15.

In [8]:
schema.hashing_globals.k = 15
a_csv.seek(0)
hashed_data_a = clk.generate_clk_from_csv(a_csv, ('key1', 'key2'), schema, validate=False)

generating CLKs: 100%|██████████| 5.00k/5.00k [00:04<00:00, 934clk/s, mean=648, std=44.1]


And now we will modify the weights to de-emphasise the contribution of the address related columns.

In [9]:
schema.hashing_globals.k = 20
schema.fields = [
    Ignore('rec_id'),
    StringSpec('given_name', FieldHashingProperties(ngram=2, weight=1)),
    StringSpec('surname', FieldHashingProperties(ngram=2, weight=1)),
    IntegerSpec('street_number', FieldHashingProperties(ngram=1, positional=True, weight=0.5)),
    StringSpec('address_1', FieldHashingProperties(ngram=2, weight=0.5)),
    StringSpec('address_2', FieldHashingProperties(ngram=2, weight=0.5)),
    StringSpec('suburb', FieldHashingProperties(ngram=2, weight=0.5)),
    IntegerSpec('postcode', FieldHashingProperties(ngram=1, positional=True, weight=0.5)),
    StringSpec('state', FieldHashingProperties(ngram=2, weight=0.5)),
    IntegerSpec('date_of_birth', FieldHashingProperties(ngram=1, positional=True, weight=1)),
    Ignore('soc_sec_id')
    ]
a_csv.seek(0)
hashed_data_a = clk.generate_clk_from_csv(a_csv, ('key1', 'key2'), schema, validate=False)

generating CLKs: 100%|██████████| 5.00k/5.00k [00:05<00:00, 894clk/s, mean=602, std=39.8]


Each CLK is serialized in a JSON friendly base64 format:

In [10]:
hashed_data_a[0]

'BD8JWW7DzwP82PjV5/jbN40+bT3V4z7V+QBtHYcdF32WpPvDvHUdLXCX3tuV1/4rv+23v9R1fKmJcmoNi7OvoecRLMnHzqv9J5SfT15VXe7KPht9d49zRt73+l3Tfs+Web8kx32vSdo+SfnlHqKbn11V6w9zFm3kb07e67MX7tw='

## Hash data set B

Now we hash the second dataset using the same keys and same schema.

In [11]:
b_csv = io.StringIO()
dfB.to_csv(b_csv)
b_csv.seek(0)
hashed_data_b = clkhash.clk.generate_clk_from_csv(b_csv, ('key1', 'key2'), schema, validate=False)

generating CLKs: 100%|██████████| 5.00k/5.00k [00:05<00:00, 866clk/s, mean=592, std=45.5]


In [12]:
len(hashed_data_b)

5000

## Find matches between the two sets of CLKs

We have generated two sets of CLKs which represent entity information in a privacy-preserving way. The more similar two CLKs are, the more likely it is that they represent the same entity.

For this task we will use [anonlink](https://github.com/n1analytics/anonlink), a Python (and optimised C++) implementation of anonymous linkage using CLKs.

In [None]:
!pip install -U anonlink

In [13]:
from anonlink.entitymatch import calculate_mapping_greedy
from bitarray import bitarray
import base64

def deserialize_bitarray(bytes_data):
    ba = bitarray(endian='big')
    data_as_bytes = base64.decodebytes(bytes_data.encode())
    ba.frombytes(data_as_bytes)
    return ba

def deserialize_filters(filters):
    res = []
    for i, f in enumerate(filters):
        ba = deserialize_bitarray(f)
        res.append((ba, i, ba.count()))
    return res

clks_a = deserialize_filters(hashed_data_a)
clks_b = deserialize_filters(hashed_data_b)

mapping = calculate_mapping_greedy(clks_a, clks_b, threshold=0.9, k=5000)
print('found {} matches'.format(len(mapping)))

found 3635 matches


Let's investigate some of those matches and the overall matching quality

In [14]:
a_csv.seek(0)
b_csv.seek(0)
a_raw = a_csv.readlines()
b_raw = b_csv.readlines()

num_entities = len(b_raw) - 1

print('idx_a, idx_b, rec_id_a, rec_id_b')
print('--------------------------------')
for a_i in range(10):
    if a_i in mapping:
        a_data = a_raw[a_i + 1].split(',')
        b_data = b_raw[mapping[a_i] + 1].split(',')
        print('{}, {}, {}, {}'.format(a_i+1, mapping[a_i]+1, a_data[0], b_data[0]))

TP = 0; FP = 0; TN = 0; FN = 0
for a_i in range(num_entities):
    if a_i in mapping:
        if a_raw[a_i + 1].split(',')[0].split('-')[1] == b_raw[mapping[a_i] + 1].split(',')[0].split('-')[1]:
            TP += 1
        else:
            FP += 1
            FN += 1 # as we only report one mapping for each element in PII_a, then a wrong mapping is not only a false positive, but also a false negative, as we won't report the true mapping.
    else:
        FN += 1 # every element in PII_a has a partner in PII_b

print('--------------------------------')
print('Precision: {}, Recall: {}, Accuracy: {}'.format(TP/(TP+FP), TP/(TP+FN), (TP+TN)/(TP+TN+FP+FN)))

idx_a, idx_b, rec_id_a, rec_id_b
--------------------------------
2, 2751, rec-1016-org, rec-1016-dup-0
3, 4657, rec-4405-org, rec-4405-dup-0
4, 4120, rec-1288-org, rec-1288-dup-0
5, 3307, rec-3585-org, rec-3585-dup-0
7, 3945, rec-1985-org, rec-1985-dup-0
8, 993, rec-2404-org, rec-2404-dup-0
9, 4613, rec-1473-org, rec-1473-dup-0
10, 3630, rec-453-org, rec-453-dup-0
--------------------------------
Precision: 1.0, Recall: 0.727, Accuracy: 0.727


Precision tells us about how many of the found matches are actual matches. The score of 1.0 means that we did perfectly in this respect, however, recall, the measure of how many of the actual matches were correctly identified, is quite low with only 73%.

Let's go back to the mapping calculation (`calculate_mapping_greedy`) an reduce the value for `threshold` to `0.8`.

Great, for this threshold value we get a precision of 100% and a recall of 95.3%. 

The explanation is that when the information about an entity differs slightly in the two datasets (e.g. spelling errors, abbrevations, missing values, ...) then the corresponding CLKs will differ in some number of bits as well. For the datasets in this tutorial the perturbations are such that only 72.7% of the derived CLK pairs overlap more than 90%. Whereas almost all matching pairs overlap more than 80%.

If we keep reducing the threshold value, then we will start to observe mistakes in the found matches -- the precision decreases. But at the same time the recall value will keep increasing for a while, as a lower threshold allows for more of the actual matches to be found, e.g.: for threshold 0.72, we get precision: 0.997 and recall: 0.992. However, reducing the threshold further will eventually lead to a decrease in both precision and recall: for threshold 0.65 precision is 0.983 and recall is 0.980. Thus it is important to choose an appropriate threshold for the amount of perturbations present in the data.

This concludes the tutorial. Feel free to go back to the CLK generation and experiment on how different setting will affect the matching quality.