<a href="https://colab.research.google.com/github/izzyizzy44/data_science/blob/main/secure_record_linkage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Privacy preserving record linkage

This notebook demonstrates two basic privacy-preserving record matching techniques: Bloom Filters and Cryptographic Hashing.

We'll use mock data to simulate how these methods can be used to find common elements between datasets without revealing the actual data. 

Also a basic example of how this could be applied to a rdmf id - where we provide different ids across projects 



In [2]:
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage
import io
import hashlib

Part 1. Bloom Filters are probabilistic data structures used to test whether an element is a member of a set. They are efficient in terms of space, but they can result in false positives.

Bloom Filter and then use it to find common elements between two sets of data

In [3]:
def create_simple_bloom_filter(items, filter_size=100):
    bloom_filter = [False] * filter_size
    for item in items:
        index = hash(item) % filter_size
        bloom_filter[index] = True
    return bloom_filter

def is_in_bloom_filter(item, bloom_filter, filter_size=100):
    index = hash(item) % filter_size
    return bloom_filter[index]

# Mock data
names_hospital_a = ["Alice", "Bob", "Charlie", "David","izzy","test"]
names_hospital_b = ["Eva", "Frank", "Charlie", "Grace","izzy","test"]

# Creating Bloom Filters
bloom_filter_a = create_simple_bloom_filter(names_hospital_a)
bloom_filter_b = create_simple_bloom_filter(names_hospital_b)

# Checking for common elements
common_names = [name for name in names_hospital_b
                if is_in_bloom_filter(name, bloom_filter_a)]

common_names


['Charlie', 'izzy', 'test']

## Cryptographic Hashing Basics 

Cryptographic Hashing involves creating a unique, fixed-size string (hash) for each record. By comparing hashes, we can identify matching records without revealing the actual data.

In [4]:
# Example string
my_string = "izzy"

# Creating a SHA256 hash of the string
hash_object = hashlib.sha256(my_string.encode())
hash_hex = hash_object.hexdigest()

print(hash_hex)


7c7b6b57db3880fc0b4157c962100103f388191c24cf816c726854d17a2af696


In [5]:
def create_hashed_set(items):
    return {hashlib.sha256(item.encode()).hexdigest() for item in items}

# Hashing the data
hashed_names_a = create_hashed_set(names_hospital_a)
hashed_names_b = create_hashed_set(names_hospital_b)

# Finding common hashes
common_hashes = hashed_names_a.intersection(hashed_names_b)

common_hashes


{'6e81b1255ad51bb201a2b8afa9b66653297ae0217f833b14b39b5231228bf968',
 '7c7b6b57db3880fc0b4157c962100103f388191c24cf816c726854d17a2af696',
 '9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08'}

In [6]:
# Read the data from the blob as a DataFrame
client = storage.Client()

bucket_name = "business_data_synthetic99"
blob_name = "business_data.csv"

df = pd.read_csv(f"gs://{bucket_name}/{blob_name}")
df.head()


Unnamed: 0,business_name,industry,data_change,original_creation_date,business_id
0,Business 1,Technology,Yes,2023-01-01,1234567890123456
1,Business 2,Finance,No,2023-02-01,1234567890123457
2,Business 3,Healthcare,Yes,2023-03-01,1234567890123458
3,Business 4,Retail,No,2023-04-01,1234567890123459
4,Business 5,Education,Yes,2023-05-01,1234567890123460


## Cryptographic Hashing - different ids for different customers 
Cryptographic Hashing involves creating a unique, fixed-size string (hash) for each record. By comparing hashes, we can identify matching records without revealing the actual data. The method includes the following steps:
1. Create a hashed set for each dataset by applying a cryptographic hash function to the elements.
2. Find common hashes between the datasets by intersecting the hashed sets.

These techniques provide privacy-preserving record linkage by allowing the identification of common elements between datasets without revealing the actual data. It is important to note that in real-world applications, more sophisticated and secure methods would be used to handle larger datasets and mitigate privacy risks.


In [7]:

def generate_hash_key(project_name, input_data):
    """
    Generates a hash key for secure record linkage.

    Parameters:
    project_name (str): The name of the project.
    input_data (str): The input data to be hashed.

    Returns:
    str: The generated hash key.
    """
    salt = project_name.encode()
    input_data_with_salt = input_data.encode() + salt
    hash_key = hashlib.sha256(input_data_with_salt).hexdigest()
    return hash_key

# Example usage
project_name = "MyProject"
input_data = "secret_data"

hash_key = generate_hash_key(project_name, input_data)
print(hash_key)


54545415ad4742d84fa4acf087c9ce0a97c8b170b88250c816a7c79f9e3154d7


In [8]:

# Create the first dataframe
df_customer1 = df.copy()
df_customer1['business_id'] = df_customer1['business_id'].apply(lambda x: hashlib.sha256((str(x) + 'salt_customer1').encode()).hexdigest())

# Create the second dataframe
df_customer2 = df.copy()
df_customer2['business_id'] = df_customer2['business_id'].apply(lambda x: hashlib.sha256((str(x) + 'salt_customer2').encode()).hexdigest())


In [9]:
df_customer1

Unnamed: 0,business_name,industry,data_change,original_creation_date,business_id
0,Business 1,Technology,Yes,2023-01-01,32488c724473b54b0b1cac22766670c06e3126c86c4398...
1,Business 2,Finance,No,2023-02-01,382d7039f5bf377fb07b55cac8b7e8da2f849c496923d4...
2,Business 3,Healthcare,Yes,2023-03-01,bd8cbf6227c0aeed6a804eca05959ff44efb37c02e5cac...
3,Business 4,Retail,No,2023-04-01,57e837fb8ccbdfde4f3a2f6dc38d40c5bd20723b30bbe8...
4,Business 5,Education,Yes,2023-05-01,a7129a8ddcb13e6cc10f326e557ca8b9b6f0cbe54c49d4...
5,Business 6,Manufacturing,No,2023-06-01,579239b07af008c987ce2b54d4d87f2ada9e235382b9a1...
6,Business 7,Construction,Yes,2023-07-01,2afcb51da27dcaf041837d00553aebdca60d9b10a881a5...
7,Business 8,Hospitality,No,2023-08-01,8c4520975c559e33dd652ad2fec70ff7ac07710cdd2acc...
8,Business 9,Transportation,Yes,2023-09-01,d6f6527b893507144cd38862e6a7592c0e15a877d684cc...
9,Business 10,Utilities,No,2023-10-01,92ade10db845a857752c3ac0e91a59ebc8827de69af65b...


In [10]:
df_customer2

Unnamed: 0,business_name,industry,data_change,original_creation_date,business_id
0,Business 1,Technology,Yes,2023-01-01,be2ce3956f1c02aa517f5cd7b3c398a21396d954750ac0...
1,Business 2,Finance,No,2023-02-01,f3375fd0bdc50c7bf7a5d9e3ead6dae7b1958054d6ccf0...
2,Business 3,Healthcare,Yes,2023-03-01,12ad37a30ce8330c8c2666a3ee5c41a0530d69ae7f953f...
3,Business 4,Retail,No,2023-04-01,464089169b244ba4a5c1fc5504b078514e5899f1848322...
4,Business 5,Education,Yes,2023-05-01,2ae235709b62e65ff4bf3e9fcba2b36a7795bcf7c5993f...
5,Business 6,Manufacturing,No,2023-06-01,896aac8ebd375dd27dd80b6b30a2b653a34b1c5a7b2559...
6,Business 7,Construction,Yes,2023-07-01,5272d325734e2236b5a79cc7e70322af3515590e10f533...
7,Business 8,Hospitality,No,2023-08-01,460d2374210f8430ec69d7b2c813cea6ea8be251801cbc...
8,Business 9,Transportation,Yes,2023-09-01,a571b2e879671e7407cd302db031a0b85c213f25fd7bab...
9,Business 10,Utilities,No,2023-10-01,be5c5170c8e7fdbefe80b3d6387af5c2014e204ef8b150...


This notebook provides a basic illustration of using Bloom Filters and Cryptographic Hashing for privacy-preserving record linkage. In real-world applications, more sophisticated and secure methods would be used, especially to handle larger datasets and mitigate privacy risks such as false positives in Bloom Filters and potential vulnerabilities in hash functions

# Differential Privacy

Differential privacy is a concept in data privacy that aims to protect the privacy of individuals while allowing for the analysis of aggregate data. It provides a mathematical framework for quantifying the privacy guarantees of a data analysis algorithm.

The main idea behind differential privacy is to add noise to the output of a query or computation in such a way that the privacy of individual data points is preserved. This noise ensures that the output of the computation does not reveal sensitive information about any specific individual.

By incorporating differential privacy techniques, organisations can perform data analysis and share aggregated results without compromising the privacy of individuals. This is particularly important in scenarios where data contains sensitive information, such as healthcare records or personal financial data.

Differential privacy offers a rigorous and principled approach to balancing data utility and privacy protection. It provides a privacy budget that quantifies the amount of privacy loss allowed for a given analysis, allowing organizations to control the trade-off between privacy and data accuracy.

Overall, differential privacy is a powerful tool for ensuring privacy in data analysis and enabling responsible data sharing practices.


In [6]:
import numpy as np

# Original dataset
data = [1, 2, 3, 4, 5]

# Sensitivity of the dataset
sensitivity = 1

# Privacy parameter
epsilon = 0.5

# Add Laplace noise to the dataset
noisy_data = [x + np.random.laplace(0, sensitivity / epsilon) for x in data]

noisy_data


[-0.10203108810440686,
 1.0350905733006912,
 5.452744928487135,
 3.702675202209278,
 8.201822873853132]

In [7]:
import numpy as np

# Original dataset
data = ['A', 'B', 'C', 'D', 'E']

# Sensitivity of the dataset
sensitivity = 1

# Privacy parameter
epsilon = 0.5

# Create a dictionary to store the counts of each category
category_counts = {}

# Count the occurrences of each category in the dataset
for category in data:
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Add Laplace noise to the counts of each category
noisy_category_counts = {}
for category, count in category_counts.items():
    noisy_count = count + np.random.laplace(0, sensitivity / epsilon)
    noisy_category_counts[category] = noisy_count

noisy_category_counts


{'A': 0.9748259299708666,
 'B': 1.0502957833191793,
 'C': 1.4047421083368967,
 'D': -4.069501849148417,
 'E': -4.736197124500159}

# Homomorphic Encryption 


Note this doesnt work - tricky install. but im leaving it here as an idea 

In [2]:
import seal

# Create a SEAL context
context = seal.EncryptionParameters(seal.scheme_type.BFV)
context.set_poly_modulus_degree(4096)
context.set_coeff_modulus(seal.CoeffModulus.BFVDefault(4096))
context.set_plain_modulus(1 << 8)

# Generate keys
keygen = seal.KeyGenerator(context)
public_key = keygen.public_key()
secret_key = keygen.secret_key()

# Encrypt data
encryptor = seal.Encryptor(context, public_key)
encrypted_data = seal.Ciphertext()
encryptor.encrypt(seal.Plaintext("42"), encrypted_data)

# Perform computation on encrypted data
evaluator = seal.Evaluator(context)
encrypted_result = seal.Ciphertext()
evaluator.square(encrypted_data, encrypted_result)

# Decrypt the result
decryptor = seal.Decryptor(context, secret_key)
plain_result = seal.Plaintext()
decryptor.decrypt(encrypted_result, plain_result)

# Print the decrypted result
print(plain_result.to_string())


ModuleNotFoundError: No module named 'seal'