# CS211: Data Privacy
## Homework 2

In [1]:
# Load the data and libraries
import pandas as pd
import numpy as np

adult = pd.read_csv('https://github.com/jnear/cs211-data-privacy/raw/master/homework/adult_with_pii.csv')
adult = adult.dropna()

## Question 1 (20 points)

Implement a more efficient version of `is_k_anonymous`. The inefficient implementation, taken from the textbook, appears below.

**Hint**: use the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) function, and make sure no count is less than $k$.

In [None]:
# Checking for k-Anonymity, taken from the textbook
# def is_k_anonymous(k, qis, df):
#     for index, row in df.iterrows():
#         query = ' & '.join([f'`{col}` == "{row[col]}"' for col in qis])
#         rows = df.query(query)
#         if (rows.shape[0] < k):
#             return False
#     return True

In [2]:
# Checking for k-anonymity more efficiently
def is_k_anonymous(k, qis, df):
    """Returns true if df satisfies k-Anonymity for the quasi-identifiers 
    qis. Returns false otherwise."""
    for col in df[qis].columns:
        counts = df[qis][col].value_counts()
        for count in counts:
            if count < k:
                return False

    return True
            

    
print(is_k_anonymous(2, ['Age'], adult))
print(is_k_anonymous(1, ['Age'], adult))
print(is_k_anonymous(1, ['Age', 'Occupation'], adult))

False
True
True


In [3]:
# TEST CASES for question 1

assert not is_k_anonymous(2, ['Age'], adult)
assert is_k_anonymous(1, ['Age'], adult)
assert is_k_anonymous(1, ['Age', 'Occupation'], adult)

## Question 2 (10 points)

Consider the definition of `generalize` below, taken from the textbook. The function takes a dataframe `df` and a dictionary `depths` that describes how much to generalize each column of `df`. Generalizing a column to a depth of $n$ replaces the $n$ least-significant digits of each number in that column by zeroes. For example, we could generalize column `A` by making its least-significant digit a 0 and column `B` by doing the same for 2 digits with the following depth specification:

In [4]:
depths = {
    'A': 1,
    'B': 2
}

In [5]:
def generalize(df, depths):
    return df.apply(lambda x: x.apply(lambda y: int(int(y/(10**depths[x.name]))*(10**depths[x.name]))))

Using the `generalize` function, generalize the `Age` column of the `adult` dataset to a depth of 1. Drop the other columns of the dataset. Your result should achieve $k$-Anonymity for $k=20$.

In [6]:
def generalize_adult_age():
    depths = {
        'Age': 1
    }
            
    return generalize(adult[['Age']], depths)

In [7]:
assert is_k_anonymous(20, ['Age'], generalize_adult_age())

In [8]:
generalize_adult_age()

Unnamed: 0,Age
0,30
1,50
2,30
3,50
4,20
...,...
32556,20
32557,40
32558,50
32559,20


## Question 3 (10 points)

Using the `generalize` function, generalize the `Age` and `Zip` columns of the `adult` dataset in order to achieve $k$-Anonymity for $k=5$. Your result should drop other columns besides these two.

In [9]:
def generalize_adult_age_zip():
    depths = {
        'Age': 1,
        'Zip': 2
    }

    return generalize(adult[['Age', 'Zip']], depths)

In [10]:
assert is_k_anonymous(5, ['Age', 'Zip'], generalize_adult_age_zip())

In [11]:
generalize_adult_age_zip()

Unnamed: 0,Age,Zip
0,30,64100
1,50,61500
2,30,95600
3,50,25500
4,20,75300
...,...,...
32556,20,41300
32557,40,94700
32558,50,49600
32559,20,8200


## Question 4 (30 points)

In 1-4 sentences each, answer the following:

1. How much generalization was required to achieve $k=5$ in question 3?
2. Does this level of generalization significantly impact the utility of the $k$-Anonymized data? Why or why not?
3. Why is generalizing the `adult` dataset so challenging? (**Hint**: consider outliers)
4. Is there another approach, in addition to our simple generalization method, that might work better?
5. What is a simple method for generalizing the `Occupation` column?

1. To achieve k=5 in question 3, generalization to a depth of 1 for Age and to a depth of 2 for Zipcode was required. 

2. This level of generalization somewhat impacts the utility of the data. While you can still tell what age bracket people fall into, the region they live in has become more broad and could be less useful in certain situations. 

3. Generalizing this dataset is challenging because there are outliers in many of the columns. Trying to generalize these columns to a point where the outliers cannot be detected ends up generalizing too much, to a point where the data contains hardly any information at all. 

4. Removing these outliers before generalizing would work better, as you wouldn't have to generalize to as deep of a level and could maintain more information. Plus, you don't lose much important information from dropping outliers, as they usually skew your data in a way that doesn't represent the actual trend. You could also generalize to better suited numbers, such as rounding up or down rather than just truncating and adding a zero. 

5. To generalize the Occupation column, one could create more broad fields or umbrella categories for the occupations, so that they are less specific and don't reveal as much identifying information. 