# K-Anonymity, L-Diversity, T-Closeness Demonstration

This notebook shows how to:

- **k-Anonymity**: Ensure each group of quasi-identifiers has *at least* k rows.
- **l-Diversity**: Ensure each group of quasi-identifiers has *l* distinct sensitive values.
- **t-Closeness**: Briefly check how similar each group's sensitive-attribute distribution is to the overall distribution.

We will:
1. Create or load a synthetic DataFrame.
2. Define generalization functions (for Age and ZipCode).
3. Check k-anonymity.
4. Check l-diversity.
5. Sketch a t-closeness check.
6. Summarize the process.

## Step 1: Create or Load Synthetic Data

We create a simple dataset with three columns:
- **Age** (numeric)
- **ZipCode** (string)
- **Disease** (string, treated as the sensitive attribute)


In [None]:
import pandas as pd
import math
from collections import Counter, defaultdict

data = {
    'Age': [25, 27, 27, 34, 35, 36, 36, 45, 46, 52, 52, 52],
    'ZipCode': ['12345','12345','12344','12345','12349','12349','12349','12350','12350','12350','12351','12351'],
    'Disease': ['Flu','Cold','Flu','Diabetes','Cold','Flu','Cold','Diabetes','Flu','Flu','Cold','Flu']
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)


## Step 2: Generalization Functions

We define functions to **bin** the Age values into decade-based ranges (e.g., 20–29) and to **truncate** the ZipCode to the first 3 characters. These transformations reduce granularity, helping to form larger groups.

In [None]:
def bin_age(age):
    start = (age // 10) * 10
    return f"{start}-{start+9}"

def truncate_zip(zipcode, keep=3):
    return zipcode[:keep]

df['Age_Bin'] = df['Age'].apply(bin_age)
df['Zip_Trunc'] = df['ZipCode'].apply(lambda z: truncate_zip(z, 3))
print("\nDataFrame after basic generalization:\n", df)


## Step 3: k-Anonymity

To check **k-Anonymity**, we group the DataFrame by the quasi-identifiers (`Age_Bin`, `Zip_Trunc`) and verify whether each group has at least `k` records. If a group has fewer than `k` rows, it fails.

You could then remove (suppress) or further generalize those groups in a real-world scenario, but we'll only identify them here.

In [None]:
def check_k_anonymity(df, quasi_identifiers, k):
    grouped = df.groupby(quasi_identifiers).size()
    failed_groups = grouped[grouped < k]
    return failed_groups

k = 3
qi_cols = ['Age_Bin', 'Zip_Trunc']
failed_k = check_k_anonymity(df, qi_cols, k)
print(f"\nChecking for k-anonymity with k={k}...")
if failed_k.empty:
    print("All groups satisfy k-anonymity!")
else:
    print("Groups that fail k-anonymity:")
    print(failed_k)


## Step 4: l-Diversity

To check **l-Diversity**, we verify that within each group of quasi-identifiers, the **sensitive attribute** column (here `Disease`) contains *at least* `l` different values. If all records in a group share the same disease, for example, that group fails l-diversity.

We'll use `l = 2` for demonstration. You could also explore entropy- or recursive-based l-diversity, but we'll keep it simple here.

In [None]:
def check_l_diversity(df, quasi_identifiers, sensitive_col, l):
    group_data = df.groupby(quasi_identifiers)[sensitive_col].apply(lambda x: len(set(x)))
    failed_groups = group_data[group_data < l]
    return failed_groups

l = 2
failed_l = check_l_diversity(df, qi_cols, 'Disease', l)
print(f"\nChecking for l-diversity with l={l}...")
if failed_l.empty:
    print("All groups satisfy l-diversity!")
else:
    print("Groups that fail l-diversity (distinct count < l):")
    print(failed_l)


## Step 5: t-Closeness (Brief Sketch)

We compare each group's distribution of `Disease` to the **global** distribution of `Disease`. If the difference (using Total Variation Distance here) is above a threshold `t`, the group fails t-closeness.

More sophisticated distance metrics (like Earth Mover's Distance) or advanced definitions can be used in real applications.

In [None]:
def compute_distribution(series):
    c = Counter(series)
    total = sum(c.values())
    return {k: v / total for k, v in c.items()}

def total_variation_distance(dist1, dist2):
    all_keys = set(dist1.keys()) | set(dist2.keys())
    return sum(abs(dist1.get(k, 0) - dist2.get(k, 0)) for k in all_keys)

global_dist = compute_distribution(df['Disease'])
t = 0.2
print(f"\nChecking t-closeness with t={t}...")
grouped = df.groupby(qi_cols)['Disease']
violations = []
for group_vals, diseases in grouped:
    gd = compute_distribution(diseases)
    distance = total_variation_distance(gd, global_dist)
    if distance > t:
        violations.append((group_vals, distance))

if not violations:
    print("All groups satisfy t-closeness!")
else:
    print("Groups that fail t-closeness (TVD > t):")
    for group_vals, dist_val in violations:
        print(f"  {group_vals}: TVD={dist_val:.2f}")


## Step 6: Conclusion

1. **k-Anonymity** checks the **size** of each group.
2. **l-Diversity** checks the **variety** of sensitive values in each group.
3. **t-Closeness** checks how closely each group's **distribution** of sensitive values matches the overall distribution.

In practice, you would iteratively **generalize or suppress** to meet these thresholds while balancing data **utility** and **privacy**. More advanced techniques (like **entropy l-diversity**, **recursive l-diversity**, or **strict** t-closeness definitions) can further reduce re-identification risks.