# Data Perturbation in PostgreSQL
This notebook demonstrates how to implement **data perturbation** techniques in **PostgreSQL** to introduce controlled randomness into datasets, enhancing privacy.

Data perturbation helps protect **sensitive numerical values** while preserving statistical properties, making it useful for **privacy-preserving data analysis** and compliance with **GDPR, HIPAA, and other privacy regulations**.


## 1️⃣ Setting Up the Database
First, we need to establish a connection to PostgreSQL and ensure that our database is created.

In [None]:
from sqlalchemy import create_engine
import psycopg2

# PostgreSQL connection settings
PG_ADDR = 'localhost'  # Server address
PG_PORT = '5432'       # PostgreSQL port
PG_USER = 'postgres'   # Admin user
PG_PASW = 'secure_password'  # Secure password
PG_DBNA = 'perturbation_db'  # Database name

# Connect to the default PostgreSQL database to check and create if needed
default_engine = create_engine(f'postgresql://{PG_USER}:{PG_PASW}@{PG_ADDR}:{PG_PORT}/postgres')

with default_engine.connect().execution_options(autocommit=True) as conn:
    result = conn.execute("SELECT 1 FROM pg_database WHERE datname = %s;", (PG_DBNA,))
    exists = result.scalar()
    
    if not exists:
        conn.execute(f"CREATE DATABASE {PG_DBNA};")
        print(f"Database '{PG_DBNA}' created successfully!")

# Now connect to the newly created database
engine = create_engine(f'postgresql://{PG_USER}:{PG_PASW}@{PG_ADDR}:{PG_PORT}/{PG_DBNA}')
print(f"Connected to database '{PG_DBNA}'.")

## 2️⃣ Creating and Populating the Patients Table
Now, we will create a **patients** table containing numerical data and insert some sample records before applying perturbation techniques.

In [None]:
with engine.connect().execution_options(autocommit=True) as conn:
    conn.execute('''DROP TABLE IF EXISTS patients;''')
    conn.execute('''
        CREATE TABLE patients (
            id SERIAL PRIMARY KEY,
            full_name VARCHAR(100),
            birth_date DATE,
            country VARCHAR(50),
            medical_condition VARCHAR(100),
            medical_expense DECIMAL(10,2)
        );
    ''')
print("Table 'patients' created successfully!")

In [None]:
with engine.connect().execution_options(autocommit=True) as conn:
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Holly Hess', '1984-03-05', 'Honduras', 'Diabetes', 6357.33);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Tina Quinn', '1955-08-24', 'Malta', 'Heart Condition', 1784.26);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Eric Doyle', '1982-07-17', 'Namibia', 'Respiratory Issues', 5204.99);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Lisa Ford', '2006-09-11', 'Suriname', 'Kidney Disease', 2389.86);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Madison Martinez', '2001-07-10', 'Svalbard & Jan Mayen Islands', 'Heart Condition', 6792.99);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Barbara Lawrence', '1969-02-27', 'Israel', 'Respiratory Issues', 1175.56);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Nathan Thornton', '1951-07-17', 'Colombia', 'Diabetes', 7734.64);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Tina Burns', '1958-05-29', 'Guadeloupe', 'Heart Condition', 1032.71);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Margaret Skinner DVM', '1989-10-19', 'Brunei Darussalam', 'Heart Condition', 8953.92);")
    conn.execute("INSERT INTO patients (full_name, birth_date, country, medical_condition, medical_expense) VALUES ('Vanessa Weber', '1943-08-18', 'Jamaica', 'Kidney Disease', 6620.94);")
print("Sample patient data inserted successfully!")

## 3️⃣ Implementing Data Perturbation
We will apply controlled noise to **medical expenses**, ensuring that the data remains useful for analysis while reducing the risk of re-identification.

In [None]:
import random

def perturb_expense(expense):
    """Adds controlled noise to medical expenses while keeping realistic values."""
    noise = random.uniform(-100, 100)  # Adding noise within a controlled range
    return max(0, round(expense + noise, 2))

# Apply perturbation to the 'medical_expense' column
with engine.connect().execution_options(autocommit=True) as conn:
    conn.execute("ALTER TABLE patients ADD COLUMN perturbed_expense DECIMAL(10,2);")
    conn.execute("UPDATE patients SET perturbed_expense = medical_expense + (RANDOM() * 200 - 100);")
print("Perturbation applied successfully!")

## 4️⃣ Verifying Perturbed Data
Let's check how the perturbed medical expenses compare to the original values.

In [None]:
import pandas as pd

with engine.connect() as conn:
    df = pd.read_sql("SELECT id, full_name, country, medical_condition, medical_expense, perturbed_expense FROM patients;", conn)
print(df.head())

## 5️⃣ Key Takeaways and Insights
- **Perturbation enhances data privacy** by introducing small controlled changes to numerical values.
- **Medical expenses are slightly altered**, reducing the risk of linking records to individuals.
- **This method ensures compliance** with privacy laws while maintaining data usability.
- **Perturbed data remains valuable** for aggregate analysis and machine learning models.

🚀 Next Steps:
- Implement **differential privacy** for more robust protection.
- Explore **synthetic data generation** as an alternative to perturbation.
