# Anonymization Techniques: K-Anonymity, L-Diversity, and Data Perturbation


## Introduction

In an era of increasing concerns over data privacy, ensuring anonymity in datasets is crucial. 
This notebook explores three key anonymization techniques: **K-Anonymity, L-Diversity, and Data Perturbation**.
We will generate a synthetic dataset and demonstrate how these techniques can protect sensitive data while maintaining analytical utility.

### Topics Covered
1. **K-Anonymity** - Ensuring each record is indistinguishable from at least k-1 others.
2. **L-Diversity** - Enhancing K-Anonymity by ensuring diversity in sensitive attributes.
3. **Data Perturbation** - Adding noise to protect numerical values while preserving usability.


## Importing Libraries

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random


## Generating a Synthetic Dataset

In [None]:

# Set seed for reproducibility
random.seed(42)
np.random.seed(42)

# Creating a synthetic dataset
num_samples = 1000
ages = np.random.randint(18, 80, num_samples)
incomes = np.random.randint(20000, 100000, num_samples)
conditions = np.random.choice(["Diabetes", "Hypertension", "None"], num_samples, p=[0.3, 0.3, 0.4])

df = pd.DataFrame({
    "id": range(num_samples),
    "age": ages,
    "income": incomes,
    "condition": conditions
})


## K-Anonymity Implementation

In [None]:

# Function to group by age and income range
def apply_k_anonymity(df, k=5):
    df["age_group"] = (df["age"] // 5) * 5  # Group ages into bins of 5 years
    df["income_group"] = (df["income"] // 10000) * 10000  # Group income in bins of 10k
    df.drop(columns=["age", "income"], inplace=True)
    return df

df_k_anonymous = apply_k_anonymity(df.copy())


## L-Diversity Implementation

In [None]:

# Function to check L-Diversity
def check_l_diversity(df, l=2):
    diverse = df.groupby(["age_group", "income_group"])['condition'].nunique()
    return (diverse >= l).all()

l_diverse = check_l_diversity(df_k_anonymous)


## Data Perturbation Implementation

In [None]:

# Define function to add noise to income
def perturb_income(value, noise_magnitude):
    noise = random.uniform(-noise_magnitude, noise_magnitude)
    return round(value + noise, 2)

df_k_anonymous["perturbed_income"] = df_k_anonymous["income_group"].apply(lambda x: perturb_income(x, 5000))


## Evaluating the Impact

In [None]:

# Function to calculate error
def calculate_error(original, perturbed):
    return np.sum((original - perturbed) ** 2)

error = calculate_error(df_k_anonymous["income_group"], df_k_anonymous["perturbed_income"])


## Visualizing the Results

In [None]:

# Plot impact of perturbation
plt.figure(figsize=(8, 5))
plt.hist(df_k_anonymous["income_group"], alpha=0.5, label="Original Income")
plt.hist(df_k_anonymous["perturbed_income"], alpha=0.5, label="Perturbed Income")
plt.xlabel("Income Group")
plt.ylabel("Frequency")
plt.legend()
plt.title("Income Distribution Before and After Perturbation")
plt.show()



## Conclusion

- **K-Anonymity** ensures individuals are grouped to prevent unique identification.
- **L-Diversity** extends K-Anonymity by ensuring sensitive attributes remain diverse.
- **Data Perturbation** modifies numerical attributes to further protect privacy.

These techniques are essential for maintaining privacy while allowing meaningful analysis.
