# CC5215: Privacidad de Datos
## Laboratorio 1

In [1]:
# Load the data and libraries
import pandas as pd
import numpy as np

adult = pd.read_csv('https://users.dcc.uchile.cl/~mtoro/cursos/cc5215/adult_with_pii.csv')

# Dataset with private information
adult_pii = adult[['Name', 'DOB', 'SSN', 'Zip', 'Age']]
# Deanonimized dataset
adult_deid = adult.drop(columns=['Name', 'SSN'])

## Question 1 (5 points)

Using the dataframes `adult_pii` and `adult_deid`, write code to conduct a linking attack to recover the names of as many individuals in `adult_deid` as possible. Your solution should be parameterized by the set of columns to use in the attack.

In [2]:
def linking_attack(cols):
    cols.append('Name')
    return adult_deid.merge(pd.DataFrame({col: adult_pii[col] for col in cols}), how='left')['Name']

In [3]:
# TEST CASES for Question 1

assert len(linking_attack(['Zip'])) == 43191
assert len(linking_attack(['Zip', 'DOB'])) == 32563
assert len(linking_attack(['Zip', 'Age'])) == 32755

## Question 2 (5 points)

How many individuals in this dataset are uniquely identified by their Zip code? How many are uniquely identified by their age?

Hint: note that the number of *unique ZIP codes* is **different** from the number of *individuals uniquely identified by ZIP code*.

Hint: you can use the `value_counts` method (and its `subset` parameter) to count the number of occurences of each value in a series.

In [4]:
def unique_zipcode():
    return len(adult_deid[~adult_deid['Zip'].duplicated(keep=False)])

def unique_dob():
    return len(adult_deid[~adult_deid['DOB'].duplicated(keep=False)])

In [5]:
# TEST CASES for Question 2

assert unique_zipcode() == 23513
assert unique_dob() == 7845

## Question 3 (10 points)

Write code to determine the `Education-Num` of the individual named Ardyce Golby by performing a differencing attack. Your code should *only* use aggregate data to find Ardyce's education number.

In [6]:
def ardyce_education():
    adult_join = adult_deid.merge(adult_pii, how='left')
    totalSum = adult_join['Education-Num'].sum()
    sumWithoutArdyce = adult_join[adult_join['Name']!='Ardyce Golby']['Education-Num'].sum()
    return totalSum - sumWithoutArdyce

In [7]:
# TEST CASE for Question 3
assert ardyce_education() == 12

## Question 4 (15 points)

Implement a more efficient version of `is_k_anonymous`. The inefficient implementation, taken from the textbook, appears below.

**Hint**: use the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) or `group_by` functions, and make sure no count is less than $k$.

In [8]:
# Checking for k-Anonymity, taken from the textbook
# def is_k_anonymous(k, qis, df):
#     for index, row in df.iterrows():
#         query = ' & '.join([f'{col} == {row[col]}' for col in qis])
#         rows = df.query(query)
#         if (rows.shape[0] < k):
#             return False
#     return True

In [9]:
# Checking for k-anonymity more efficiently
def is_k_anonymous(k, qis, df): 
    for counter in df[qis].groupby(qis).value_counts():
        if (counter < k):
            return False
    return True

In [10]:
# TEST CASES for question 4

assert not is_k_anonymous(2, ['Age'], adult)
assert is_k_anonymous(1, ['Age'], adult)
assert is_k_anonymous(1, ['Age', 'Occupation'], adult)

## Question 5 (10 points)

Implement a `generalize` function that takes a dataframe `df` and a dictionary `depths` that describes how much to generalize each column of `df`. Generalizing a column to a depth of $n$ replaces the $n$ least-significant digits of each number in that column by zeroes. For example, we could generalize column `A` by making its least-significant digit a 0 and column `B` by doing the same for 2 digits with the following depth specification:

In [11]:
depths = {
    'A': 2,
    'B': 1
}

In [12]:
def generalize(df, depths):
    dfcopy = df.copy()
    for key in depths:
        dfcopy[key] = dfcopy[key]//10**depths[key]*10**depths[key]
    return dfcopy

The result of generalizing the age by depth of 1 should satisfy k-anonimity for $k = 20$:

In [13]:
# Test case for question 5

def generalize_adult_age():
    depths = {
        'Age': 1,
    }

    return generalize(adult[['Age']], depths)

assert is_k_anonymous(20, ['Age'], generalize_adult_age())

## Question 6 (5 points)

Using the `generalize` function, generalize the `Age` and `Zip` columns of the `adult` dataset in order to achieve $k$-Anonymity for $k=5$. Your result should drop other columns besides these two.

In [14]:
def generalize_adult_age_zip():
    depths = {
        'Age': 2,
        'Zip': 2,
    }

    return generalize(adult[['Age', 'Zip']], depths)

assert is_k_anonymous(5, ['Age', 'Zip'], generalize_adult_age_zip())

In [15]:
def generalize_adult_age_zip():
    depths = {
        'Age': 1,
        'Zip': 5,
    }

    return generalize(adult[['Age', 'Zip']], depths)

assert is_k_anonymous(5, ['Age', 'Zip'], generalize_adult_age_zip())

## Question 7 (10 points)

In 1-4 sentences each, answer the following:

1. How much generalization was required to achieve $k=5$ in question 6?
2. Does this level of generalization significantly impact the utility of the $k$-Anonymized data? Why or why not?
3. Is there another approach, in addition to our simple generalization method, that might work better?
4. What is a simple method for generalizing the `Occupation` column?

Your answer here:

1. Encontré dos generalizaciones que permitieron lograr lo pedido uno es generalizar "Age" con profundidad 1 y "Zip" con profundidad 5. La segunda opción fue "Age": 2 y "Zip": 2. Ya que con valores menores se tenía un $k$-anonimato para $k=1$.
2. Estas generalizaciones impactaron en gran manera a la información que cada atributo representaba. Por ejemplo con la primera generalización mencionada en el punto anterior, la información de "Zip" se pierde totalmente y en la segunda ocurre lo mismo para "Age". Lo anterior ocurre ya que con valores menores se sigue teniendo un par de filas que pueden ser individualizadas.
3. Se podrían simplemente eliminar estos datos que siguen forzando la 1-anonimización por ejemplo para "Age" : 1 y "Zip" : 4 hay 2 filas del dataframe (outliers) que pueden ser invidualizadas y se eliminan, se podría tener algo de información acerca del "Zip" ya que no serían todos los valores iguales a cero.
4. Se podría utilizar árboles de taxonomía y con él clasificar las ocupaciones por área. En aquellas que sea necesario se podría separar cada rama en más ramas siempre y cuando permitan la generalización (no crear más ramas de las necesarias).