T3 Sprint

PTFI Suspect with HE


This code is designed to demonstrate a comprehensive framework for matching records across customer and suspect datasets using plain text, probabilistic, and encrypted probabilistic techniques. Here’s an explanation of the code and its components:

Key Components
Libraries and Setup:

Faker: Used to generate realistic synthetic customer and suspect data (e.g., names, dates of birth, addresses, IDs).
FuzzyWuzzy: Provides a similarity ratio between strings for probabilistic matching.

Pyfhel: Implements Homomorphic Encryption (HE) for securely processing and comparing encrypted fields.
Homomorphic Encryption Initialization:

The Pyfhel library is initialized with the BFV scheme for integer encryption. This enables secure, privacy-preserving matching by encrypting the data before performing comparisons.

In [None]:
!pip install numpy pandas
!pip install faker
!pip install Pyfhel
!pip install fuzzywuzzy
# Import libraries
import pandas as pd
import numpy as np
import hashlib
from faker import Faker
from fuzzywuzzy import fuzz
from Pyfhel import Pyfhel, PyCtxt

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0




In [None]:
# Setup faker for data generation
fake = Faker()

# Initialize Pyfhel for homomorphic encryption
HE = Pyfhel()
HE.contextGen(scheme='bfv', n=2**14, t=65537)
HE.keyGen()

In [None]:
# Function to hash and encrypt a string
def encrypt_field(field):
    """
    Hashes the input field, converts it to an integer, and encrypts it.
    """
    # Ensure the field is a string before hashing
    field_str = str(field)
    field_hash = int(hashlib.sha256(field_str.encode()).hexdigest(), 16) % HE.t
    # Convert field_hash to a NumPy array of type int64 before encryption
    return HE.encryptInt(np.array([field_hash], dtype=np.int64))

# Function to calculate similarity between two encrypted fields
def encrypted_similarity(enc_field1, enc_field2):
    """
    Calculate similarity between two encrypted fields.
    """
    diff = enc_field1 - enc_field2
    decrypted_diff = abs(HE.decryptInt(diff))
    # Check if all elements in the array are 0
    return 1 if (decrypted_diff == 0).all() else max(0, 1 - decrypted_diff[0] / HE.t)

# Enhanced Probabilistic Matching Across Multiple Fields
def probabilistic_match_encrypted(customer, suspect, weights, threshold=0.85):
    """
    Perform weighted probabilistic matching on encrypted fields.
    Args:
        customer, suspect: Dictionary-like objects with fields to compare.
        weights: A dictionary specifying the weights for each field.
        threshold: Similarity threshold for a match.
    Returns:
        Boolean indicating whether the records match.
    """
    similarities = []
    for field, weight in weights.items():
        enc_customer = encrypt_field(customer[field])  # Encrypt customer field
        enc_suspect = encrypt_field(suspect[field])    # Encrypt suspect field
        similarity = encrypted_similarity(enc_customer, enc_suspect)
        similarities.append(similarity * weight)

    # Calculate weighted similarity score
    total_similarity = sum(similarities) / sum(weights.values())
    return total_similarity >= threshold


In [None]:
# Function to perform matching across all methods
def match_records(customer_data, suspect_data, weights, threshold=0.85):
    matches = {
        "plain": [],
        "probabilistic": [],
        "encrypted_probabilistic": []
    }

    for _, customer in customer_data.iterrows():
        for _, suspect in suspect_data.iterrows():
            # Plain Text Matching
            if customer["ID"] == suspect["ID"]:
                matches["plain"].append(f"{customer['Name']} matches {suspect['Name']} by ID")

            # Probabilistic Matching (Plaintext)
            name_similarity = fuzz.ratio(customer["Name"].lower(), suspect["Name"].lower())
            if name_similarity >= threshold * 100:  # Convert threshold to percentage
                matches["probabilistic"].append(f"{customer['Name']} probabilistically matches {suspect['Name']}")

            # Probabilistic Matching (Encrypted Fields)
            if probabilistic_match_encrypted(customer, suspect, weights, threshold):
                matches["encrypted_probabilistic"].append(f"{customer['Name']} matches {suspect['Name']} using encrypted probabilistic matching")

    return matches

# Generate realistic data
def create_person_data(num_people, shared_ids=[]):
    data = []
    for _ in range(num_people):
        person_id = np.random.randint(1000000, 9999999)
        if shared_ids:
            person_id = shared_ids.pop(0)
        person = {
            "Name": fake.name(),
            "DOB": fake.date_of_birth(minimum_age=16, maximum_age=120).strftime('%Y-%m-%d'),
            "ID": person_id,
            "Address": fake.address()
        }
        data.append(person)
    return pd.DataFrame(data)

def create_suspect_data(num_suspects, shared_ids):
    data = []
    for i in range(num_suspects):
        person_id = shared_ids[i] if i < len(shared_ids) else np.random.randint(1000000, 9999999)
        person = {
            "Name": fake.name(),
            "DOB": fake.date_of_birth(minimum_age=16, maximum_age=120).strftime('%Y-%m-%d'),
            "ID": person_id,
            "Address": fake.address()
        }
        data.append(person)
    return pd.DataFrame(data)

# Generate datasets
num_people = 100
num_suspects = 10
shared_ids = [np.random.randint(1000000, 9999999) for _ in range(int(0.1 * num_suspects))]
customer_data = create_person_data(num_people, shared_ids.copy())
suspect_data = create_suspect_data(num_suspects, shared_ids)

# Define field weights and threshold
weights = {"Name": 0.5, "DOB": 0.3, "Address": 0.2}
threshold = 0.85

# Perform matching
matches = match_records(customer_data, suspect_data, weights, threshold)

# Print results
print("Plain Matches:")
print(matches["plain"])

print("\nProbabilistic Matches:")
print(matches["probabilistic"])

print("\nEncrypted Probabilistic Matches:")
print(matches["encrypted_probabilistic"])

Plain Matches:
['Henry Swanson matches Katelyn Marquez by ID']

Probabilistic Matches:
[]

Encrypted Probabilistic Matches:
['Christopher Kirby matches Katelyn Marquez using encrypted probabilistic matching', 'Christopher Kirby matches Courtney Coleman using encrypted probabilistic matching', 'Jasmine Rodriguez matches Katelyn Marquez using encrypted probabilistic matching', 'Jasmine Rodriguez matches Renee Duncan using encrypted probabilistic matching', 'Jasmine Rodriguez matches James Malone using encrypted probabilistic matching', 'Jasmine Rodriguez matches Matthew Briggs using encrypted probabilistic matching', 'Jasmine Rodriguez matches Ms. Nicole Cross using encrypted probabilistic matching', 'Jasmine Rodriguez matches Courtney Coleman using encrypted probabilistic matching', 'Krista Rodriguez matches Matthew Briggs using encrypted probabilistic matching', 'Krista Rodriguez matches Ms. Nicole Cross using encrypted probabilistic matching', 'Krista Rodriguez matches Amanda Baldwin 

Advantages
Privacy-Preserving Matching: Encrypted probabilistic matching ensures that sensitive fields (e.g., Name, DOB) remain private while enabling comparisons.

Flexibility:Probabilistic matching allows for fuzzy matches, making it robust to slight variations in data (e.g., typos or formatting differences).

Weighted Matching:Assigning weights to fields enables customized matching based on the importance of attributes.

Potential Use Cases

Fraud Detection:Securely matching customer and transaction records to detect anomalies or duplicate entries.

Suspect Identification:Privacy-preserving suspect matching for law enforcement or compliance checks.

Data Merging:Merging datasets from different sources while ensuring data privacy and accuracy.


Improvements Based on Current Advantages and Use Cases

Enhanced Privacy-Preserving Techniques: Explore advanced encryption schemes, such as Fully Homomorphic Encryption (FHE), to improve the accuracy and scalability of encrypted probabilistic matching.
Optimize encryption and decryption workflows to handle larger datasets with complex attributes efficiently.

Improved Flexibility in Matching:Integrate more advanced fuzzy matching libraries, such as RapidFuzz, which offer higher performance and precision.
Incorporate additional similarity metrics (e.g., Jaro-Winkler or Levenshtein distance) for nuanced probabilistic matching, improving robustness to complex variations.

Dynamic Weight Adjustment:Develop an adaptive weighting system that learns the relative importance of fields based on historical matching success rates or domain-specific requirements.
Incorporate machine learning models to dynamically assign weights to fields, improving matching outcomes for diverse datasets.

Real-Time Fraud Detection:Implement streaming analytics to enable real-time matching and anomaly detection in high-velocity transaction data.
Integrate fraud detection workflows into financial platforms for real-time decision-making while preserving privacy.

Scalable Suspect Identification:Expand the suspect matching framework to support larger and more complex datasets, including multi-field relationships (e.g., address hierarchies, location-based matching).
Incorporate additional data sources, such as social network analysis, for enhanced suspect identification.

Automated Data Integration:Develop automated pipelines for merging and reconciling datasets from multiple sources, using probabilistic matching to identify and resolve conflicts.
Integrate ontology-based matching to account for semantic differences between datasets (e.g., differing field names or formats).

Explainable Matching Results:Introduce interpretability features to explain how matches were determined, helping users understand why certain records were flagged.
Provide confidence scores for each match to assist in decision-making and prioritize further investigation.

Cloud-Based Deployment:Deploy the solution on scalable cloud platforms (e.g., AWS, GCP, Azure) to handle larger datasets and provide seamless integration for businesses.
Incorporate secure APIs for external systems to use privacy-preserving matching as a service.