# Multi-method Pseudonymization

This notebook demonstrates several ways to pseudonymize sensitive data:

1. **Tokenization** – Replace identifiers with random tokens, managed in dictionaries (mapping tables).
2. **Hashing** – Use a one-way cryptographic hash function (SHA-256).
3. **Encryption** – Use symmetric encryption (Fernet) for reversible pseudonymization.
4. **Hybrid** – Combine partial hashing and random tokens for custom use cases.

We'll explore the trade-offs in re-identification risk, key management, and data utility.

## Step 1: Install Necessary Libraries (if needed)

Make sure you have the following installed:
- [pandas](https://pandas.pydata.org/): For handling tabular data
- [cryptography](https://cryptography.io/en/latest/): For encryption (Fernet)

```bash
pip install pandas cryptography
```

<hr>

In [None]:
#pip install pandas cryptography

## Step 2: Imports & Setup

We'll import the libraries we need, then create a simple dataset to work with.
<hr>

In [None]:
import pandas as pd
import random
import string
import hashlib
from cryptography.fernet import Fernet

# Create a small synthetic dataset
data = {
    'Name': [
        'Alice Johnson',
        'Bob Smith',
        'Charlie Li',
        'Alice Johnson',
        'Dana White'
    ],
    'Email': [
        'alice.j@example.com',
        'bob.s@example.com',
        'charlie.l@example.com',
        'alice.j@example.com',
        'dana.w@example.com'
    ],
    'Phone': [
        '123-456-7890',
        '555-123-4567',
        '987-654-3210',
        '123-456-7890',
        '555-987-6543'
    ],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston'],
    'Age': [29, 34, 22, 29, 45]
}

df = pd.DataFrame(data)
print('Original DataFrame:')
print(df)


## Step 3: Tokenization

### Description
- We assign random tokens to each unique value in a column.
- A dictionary (mapping table) tracks which real values map to which tokens.
- If the same real value appears multiple times, it maps to the same token.

**Pros**:
- Can still recognize repeated occurrences of the same entity.
- Re-identification is possible only if you have the mapping dictionary.

**Cons**:
- Completely reversible if the mapping (token dictionary) is compromised.

<hr>

In [None]:
# 3) TOKENIZATION

# Helper function to generate random alphanumeric tokens
def generate_random_token(length=10):
    chars = string.ascii_letters + string.digits
    return ''.join(random.choice(chars) for _ in range(length))

# Dictionaries for storing the real-value-to-token mappings
name_token_map = {}
email_token_map = {}
phone_token_map = {}

def get_token(mapping_dict, real_value):
    """Return the token for real_value, generating one if needed."""
    if real_value not in mapping_dict:
        mapping_dict[real_value] = generate_random_token()
    return mapping_dict[real_value]

df['Name_TOKEN'] = df['Name'].apply(lambda x: get_token(name_token_map, x))
df['Email_TOKEN'] = df['Email'].apply(lambda x: get_token(email_token_map, x))
df['Phone_TOKEN'] = df['Phone'].apply(lambda x: get_token(phone_token_map, x))

print('\n--- After Tokenization ---')
print(df[['Name', 'Name_TOKEN', 'Email', 'Email_TOKEN', 'Phone', 'Phone_TOKEN', 'City', 'Age']])


## Step 4: Hashing (SHA-256)

### Description
- Convert each identifier into a **one-way** hash.
- Here, we add a salt (a constant string) to reduce the effectiveness of simple dictionary or rainbow table attacks.
- Still, if the underlying value space is small (e.g., phone numbers), brute force is possible.

**Pros**:
- Irreversible (assuming large, unpredictable inputs).
- No secret key needed for re-identification (you simply cannot re-identify unless you guess the original).

**Cons**:
- If an attacker can guess or brute force the original values, they can confirm matches.
- Does not allow re-identification for legitimate business needs (like emailing the user).

<hr>

In [None]:
# 4) HASHING

def hash_value(value, salt='RANDOM_SALT_123'):
    """Return the SHA-256 hash of (salt + value)."""
    to_hash = (salt + value).encode('utf-8')
    return hashlib.sha256(to_hash).hexdigest()

df['Name_HASH'] = df['Name'].apply(lambda x: hash_value(x))
df['Email_HASH'] = df['Email'].apply(lambda x: hash_value(x))
df['Phone_HASH'] = df['Phone'].apply(lambda x: hash_value(x))

print('\n--- After Hashing ---')
print(df[['Name', 'Name_HASH', 'Email', 'Email_HASH', 'Phone', 'Phone_HASH', 'City', 'Age']])


## Step 5: Encryption (Fernet)

### Description
- Use a **symmetric** encryption key to encrypt identifiers.
- Fernet automatically handles random IVs, but the same plaintext may produce different ciphertext each time.
- If you need deterministic encryption (same input always yields the same output), you have to configure it differently or use a separate approach.

**Pros**:
- *Reversible* if you have the key, so you can restore original values.
- Good for data that occasionally needs to be de-anonymized (e.g., customer service).

**Cons**:
- Requires **key management** (if the key is exposed, data is compromised).
- If deterministic encryption is desired for matching records, additional steps are needed.

<hr>

In [None]:
# 5) ENCRYPTION

# Generate or load a key
key = Fernet.generate_key()
fernet = Fernet(key)

def encrypt_value(value):
    """Encrypt a string using Fernet."""
    token = fernet.encrypt(value.encode('utf-8'))
    return token.decode('utf-8')

df['Name_ENCRYPT'] = df['Name'].apply(encrypt_value)
df['Email_ENCRYPT'] = df['Email'].apply(encrypt_value)
df['Phone_ENCRYPT'] = df['Phone'].apply(encrypt_value)

print('\n--- After Encryption ---')
print(df[['Name', 'Name_ENCRYPT', 'Email', 'Email_ENCRYPT', 'Phone', 'Phone_ENCRYPT', 'City', 'Age']])

# For demonstration, let's print the key, but do NOT do this in a real environment.
print(f"\nEncryption Key (store securely!): {key}")

## Step 6: Hybrid Approach

### Description
- Combine partial hashing, partial tokenization, or partial encryption.
- Example: **Hash** the first half of the string, keep the second half in plaintext, then append a random suffix.
- This can preserve partial readability (like partial phone digits) while still limiting re-identification.

There are infinite variations. The point is to *balance* data utility and privacy.

<hr>

In [None]:
# 6) HYBRID APPROACH

def hybrid_pseudonymize(value):
    """
    Example: Hash the first half of the string, keep second half as is, append random token.
    """
    mid = len(value) // 2
    part_to_hash = value[:mid]
    part_remaining = value[mid:]

    hashed_part = hash_value(part_to_hash)[0:10]  # Shorten the hash for demonstration
    random_suffix = generate_random_token(length=5)

    # Combine hashed half, the un-hashed half, and a random suffix
    return f"{hashed_part}_{part_remaining}_{random_suffix}"

df['Name_HYBRID'] = df['Name'].apply(hybrid_pseudonymize)
df['Email_HYBRID'] = df['Email'].apply(hybrid_pseudonymize)
df['Phone_HYBRID'] = df['Phone'].apply(hybrid_pseudonymize)

print('\n--- After Hybrid Approach ---')
print(df[['Name', 'Name_HYBRID', 'Email', 'Email_HYBRID', 'Phone', 'Phone_HYBRID', 'City', 'Age']])


## Step 7: Consolidated View

We can look at our DataFrame with **all** pseudonymized columns side by side.
<hr>

In [None]:
all_cols = [
    'Name', 'Name_TOKEN', 'Name_HASH', 'Name_ENCRYPT', 'Name_HYBRID',
    'Email', 'Email_TOKEN', 'Email_HASH', 'Email_ENCRYPT', 'Email_HYBRID',
    'Phone', 'Phone_TOKEN', 'Phone_HASH', 'Phone_ENCRYPT', 'Phone_HYBRID',
    'City', 'Age'
]

print('\n--- Full Comparison ---')
print(df[all_cols])


## Step 8: Mapping Tables & Key Management

For demonstration, let's print out our mapping dictionaries. In a real-world scenario:
- Store mappings in a separate **secure** data store.
- Limit read/write access to only the systems that need to re-identify (if applicable).
- **Encrypt** or protect these mappings at rest (e.g., using AWS KMS, HashiCorp Vault, etc.).

<hr>

In [None]:
print('\n--- TOKEN MAPPINGS ---')
print('Name Token Map:')
for real_name, token in name_token_map.items():
    print(f"  {real_name} -> {token}")

print('\nEmail Token Map:')
for real_email, token in email_token_map.items():
    print(f"  {real_email} -> {token}")

print('\nPhone Token Map:')
for real_phone, token in phone_token_map.items():
    print(f"  {real_phone} -> {token}")


## Discussion

**Tokenization**:
- Useful if you need to track repeated occurrences of the same individual.
- Reversible by design (if mapping dictionary is available).

**Hashing**:
- One-way. Irreversible without brute force or guess.
- May be vulnerable for small/guessable input spaces.

**Encryption** (Fernet):
- Reversible if you have the key.
- Strong security but requires strict key management.

**Hybrid**:
- Highly customizable.
- Careful not to reveal too much partial info.

In production:
- Use a **secrets manager** (e.g., AWS KMS, HashiCorp Vault) to store encryption keys.
- Restrict who can access token dictionaries or decryption services.
- Consider your threat model, regulatory environment, and data usage needs.

This completes the multi-method pseudonymization example!