# Data Pseudonymization with PostgreSQL
This notebook demonstrates how to implement **pseudonymization** techniques in **PostgreSQL** to protect sensitive data while maintaining usability.

Pseudonymization replaces personally identifiable information (PII) with artificial identifiers, enhancing **data security** and **compliance with privacy regulations** such as GDPR and HIPAA.


## 1️⃣ Setting Up the Database
First, we need to establish a connection to PostgreSQL and ensure that our database is created.

In [None]:
from sqlalchemy import create_engine
import psycopg2

# PostgreSQL connection settings
PG_ADDR = 'localhost'  # Server address
PG_PORT = '5432'       # PostgreSQL port
PG_USER = 'postgres'   # Admin user
PG_PASW = 'secure_password'  # Secure password
PG_DBNA = 'privacy_db'  # Database name

# Connect to the default PostgreSQL database to check and create if needed
default_engine = create_engine(f'postgresql://{PG_USER}:{PG_PASW}@{PG_ADDR}:{PG_PORT}/postgres')

with default_engine.connect().execution_options(autocommit=True) as conn:
    result = conn.execute("SELECT 1 FROM pg_database WHERE datname = %s;", (PG_DBNA,))
    exists = result.scalar()
    
    if not exists:
        conn.execute(f"CREATE DATABASE {PG_DBNA};")
        print(f"Database '{PG_DBNA}' created successfully!")

# Now connect to the newly created database
engine = create_engine(f'postgresql://{PG_USER}:{PG_PASW}@{PG_ADDR}:{PG_PORT}/{PG_DBNA}')
print(f"Connected to database '{PG_DBNA}'.")

## 2️⃣ Creating and Populating the Patients Table
Now, we will create a **patients** table containing sensitive data and insert some sample records.

In [None]:
with engine.connect().execution_options(autocommit=True) as conn:
    conn.execute('''DROP TABLE IF EXISTS patients;''')
    conn.execute('''
        CREATE TABLE patients (
            id SERIAL PRIMARY KEY,
            full_name VARCHAR(100),
            birth_date DATE,
            country VARCHAR(50),
            medical_condition VARCHAR(100),
            insurance_amount DECIMAL(10,2)
        );
    ''')
print("Table 'patients' created successfully!")

In [None]:
with engine.connect().execution_options(autocommit=True) as conn:
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('Lisa Crane', '1949-09-12', 'Estonia', 'Diabetes', 6299.37);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('James Miller', '1988-06-11', 'Syrian Arab Republic', 'Diabetes', 6947.08);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('Alyssa Scott', '1952-01-09', 'Barbados', 'Heart Condition', 4893.43);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('John Bennett', '1989-11-06', 'Guinea-Bissau', 'Kidney Failure', 4968.43);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('Jennifer Pearson', '2006-04-01', 'Trinidad and Tobago', 'Kidney Failure', 4032.72);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('Isaac Peterson', '1982-10-20', 'Israel', 'Diabetes', 1130.04);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('Christina Watkins', '1956-11-05', 'Cyprus', 'Heart Condition', 5324.68);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('Stanley Hodges', '1943-04-11', 'Ghana', 'Diabetes', 1692.54);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('William Chapman', '1965-03-03', 'Monaco', 'Kidney Failure', 3411.38);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, insurance_amount) VALUES ('Stacey Arroyo', '1993-04-27', 'Svalbard & Jan Mayen Islands', 'Heart Condition', 1977.34);")
print("Sample patient data inserted successfully!")

## 3️⃣ Implementing Pseudonymization
We will replace **full names** with a hashed identifier while keeping other fields unchanged.

In [None]:
with engine.connect().execution_options(autocommit=True) as conn:
    conn.execute("ALTER TABLE patients ADD COLUMN pseudonym VARCHAR(64);")
    conn.execute("UPDATE patients SET pseudonym = md5(full_name || birth_date::TEXT);")
print("Pseudonymization applied successfully!")

## 4️⃣ Verifying Pseudonymized Data
Let's check how the pseudonymized patient data looks.

In [None]:
import pandas as pd

with engine.connect() as conn:
    df = pd.read_sql("SELECT id, pseudonym, birth_date, country, medical_condition FROM patients;", conn)
print(df.head())

## 5️⃣ Key Takeaways and Insights
- **Pseudonymization enhances data security** while preserving analytical usability.
- **Full names are replaced** with hashed pseudonyms, preventing direct identification.
- **This method supports compliance** with privacy laws like GDPR and HIPAA.
- **Pseudonymized data can still be used** for statistical analysis and research.

🚀 Next Steps:
- Implement **reversible pseudonymization** using encryption keys.
- Explore **differential privacy** techniques for stronger anonymization.
