## k-Anonymity

### Imports

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Dataset and descriptive statistics

In [6]:
adult = pd.read_csv("adult_with_pii.csv")
adult.head()

Unnamed: 0,Name,DOB,SSN,Zip,Workclass,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Hours per week,Country,Target,Age,Capital Gain,Capital Loss
0,Karrie Trusslove,9/7/1967,732-14-6110,64152,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K,56,2174,0
1,Brandise Tripony,6/7/1988,150-19-2766,61523,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,<=50K,35,0,0
2,Brenn McNeely,8/6/1991,725-59-9860,95668,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,<=50K,32,0,0
3,Dorry Poter,4/6/2009,659-57-4974,25503,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,<=50K,14,0,0
4,Dick Honnan,9/16/1951,220-93-3811,75387,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K,72,0,0


How can we de-identify data? We could just remove columns.

In [9]:
adult_data = adult.copy().drop(columns=['Name', 'SSN'])
adult_pii = adult[['Name', 'SSN', 'DOB', 'Zip']]
adult_data.head(1)

Unnamed: 0,DOB,Zip,Workclass,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Hours per week,Country,Target,Age,Capital Gain,Capital Loss
0,9/7/1967,64152,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K,56,2174,0


Suppose we want to determine the income of a friend. Names have been removed, but we know some information like the name "Dick Honnan" and his birth data and zip code. We can then perform a linkage attack by looking at the overlapping columns (join of two tables). 


In [12]:
dicks_row = adult_pii[adult_pii['Name'] == 'Dick Honnan']
pd.merge(dicks_row, adult_data, left_on=['DOB', 'Zip'], right_on=['DOB', 'Zip'])

Unnamed: 0,Name,SSN,DOB,Zip,Workclass,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Hours per week,Country,Target,Age,Capital Gain,Capital Loss
0,Dick Honnan,220-93-3811,9/16/1951,75387,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K,72,0,0


### Checking k-Anonymity

In [15]:
raw_data = {
    'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
    'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
    'age': [42, 52, 36, 24, 73], 
    'preTestScore': [4, 24, 31, 2, 3],
    'postTestScore': [25, 94, 57, 62, 70]}
#df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df = pd.DataFrame(raw_data, columns = ['age', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,age,preTestScore,postTestScore
0,42,4,25
1,52,24,94
2,36,31,57
3,24,2,62
4,73,3,70


In [17]:
def isKAnonymized(df, k):
    for index, row in df.iterrows():
        query = ' & '.join([f'{col}=={row[col]}' for col in df.columns])
        rows = df.query(query)
        if rows.shape[0] < k:
            return False
    return True

In [19]:
isKAnonymized(df, 1)

True

In [21]:
isKAnonymized(df, 2)

False

### Generalizing data to satisfy k-Anonymity

In [23]:
def generalize(df, depths):
    return df.apply(lambda x: x.apply(lambda y: int(int(y/(10**depths[x.name]))*(10**depths[x.name]))))

In [25]:
depths = {
    'age': 1, 
    'preTestScore': 1,
    'postTestScore': 1
}
df2 = generalize(df, depths)
df2

Unnamed: 0,age,preTestScore,postTestScore
0,40,0,20
1,50,20,90
2,30,30,50
3,20,0,60
4,70,0,70


In [28]:
isKAnonymized(df2, 2)

False

In [30]:
depths = {
    'age': 2,
    'preTestScore': 2,
    'postTestScore': 2
}
df3 = generalize(df, depths)

In [32]:
isKAnonymized(df3, 2)

True

The DataFrame does now satisfy k-Anonymity for k=2, but all data has been lost.

### More data?

In [36]:
df = adult_data[['Age', 'Education-Num']]
df.columns = ['age', 'edu']
isKAnonymized(df.head(100), 1)

True

In [38]:
isKAnonymized(df.head(100), 2)

False

In [40]:
# outliers are a serious issue!
depths = {
    'age': 1,
    'edu': 1
}

df2 = generalize(df.head(100), depths)
isKAnonymized(df2, 2)

False