# CSE 146 Lab 5: Privacy
### ASSIGNED: November 17, 2020
### DUE: December 1, 2020
### 95 Points Total

The purpose of this assignment is to familiarize yourself with concepts in privacy. We will be covering $k$-anonymity,  linking attacks, differential privacy, as well as current internet practices.

# Instructions 
Prior to beginning the assignment, in addition to the readings, this short Medium post may be useful:
- [Differential Privacy](https://medium.com/georgian-impact-blog/a-brief-introduction-to-differential-privacy-eacf8722283b)

Just like in previous labs, these guidelines remain consistent:
- All cells where code is required are marked with a "# YOUR CODE HERE" comment. All cells where a written answer is required are marked with "Please type your answers here!". The point values for each code block are written in the header for the associated subsection.
- For each question, you should write python code to compute the answer to the questions that renders in a readable way according to the specifications of the question. You may only use the packages provided in the Background and Setup code. We will not be installing any packages during grading, and code that does not compile will negatively affect your grade.
- This assignment can be done collaboratively, but please be sure to list the student(s) you worked with in the space provided below. Please reach out to each other if you have any questions or difficulties.
- Be sure to rename this lab notebook (in [YOUR NAME HERE] so that it includes your name. 

### List any students you talked with about this assignment here:
1. Margarita Fernandez
2. Zachary Zulanas
3. etc.

# Setup

The dataset is based on census data. The columns `Name`, `DOB`, `SSN`, and `Zip` represent personally identifiable information (PII). The values in these columns are made up.

In [2]:
import numpy as np
import pandas as pd

def your_code_here():
    return 0

adult = pd.read_csv("adult_with_pii.csv")
adult

Unnamed: 0,Name,DOB,SSN,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Salary
0,Karrie Trusslove,9/7/1967,732-14-6110,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,Brandise Tripony,6/7/1988,150-19-2766,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,Brenn McNeely,8/6/1991,725-59-9860,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,Dorry Poter,4/6/2009,659-57-4974,25503,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,Dick Honnan,9/16/1951,220-93-3811,75387,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,Ardyce Golby,10/29/1961,212-61-8338,41328,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,Jean O'Connor,6/28/1952,737-32-2919,94735,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,Reuben Skrzynski,8/9/1966,314-48-0219,49628,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,Caye Biddle,5/19/1978,647-75-3550,8213,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [3]:
# Remove personally identifiable information
anonymized_adult = adult.drop(columns=['Name', 'SSN'])
anonymized_adult

Unnamed: 0,DOB,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Salary
0,9/7/1967,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,6/7/1988,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,8/6/1991,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,4/6/2009,25503,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,9/16/1951,75387,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,10/29/1961,41328,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,6/28/1952,94735,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,8/9/1966,49628,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,5/19/1978,8213,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [4]:
# Personally identifiable information with DOB and Zip
personally_identifiable_info = adult[['Name', 'DOB', 'SSN', 'Zip']]
personally_identifiable_info

Unnamed: 0,Name,DOB,SSN,Zip
0,Karrie Trusslove,9/7/1967,732-14-6110,64152
1,Brandise Tripony,6/7/1988,150-19-2766,61523
2,Brenn McNeely,8/6/1991,725-59-9860,95668
3,Dorry Poter,4/6/2009,659-57-4974,25503
4,Dick Honnan,9/16/1951,220-93-3811,75387
...,...,...,...,...
32556,Ardyce Golby,10/29/1961,212-61-8338,41328
32557,Jean O'Connor,6/28/1952,737-32-2919,94735
32558,Reuben Skrzynski,8/9/1966,314-48-0219,49628
32559,Caye Biddle,5/19/1978,647-75-3550,8213


# Problems

### Problem 1: Linking Attacks (55 points)

Linking attacks are a generic process by which two or more datasets are linked together, usually with one dataset containing personally identifiable information and the others containing sensitive data. In practice, linking attacks employ techniques (like entity resolution and graph matching) to link records across databases.

(a) How many unique pairs of `(DOB, Zip)` are there? Your code should display the answer. (5 points)

In [5]:
# your code here
pairs = adult.loc[:, ['DOB','Zip']]
pairs.drop_duplicates().shape[0]

32560

(b) Write the function `link` which takes two dataframes and links them on `DOB` and `Zip`. The result should be a list where each element has the form `[DOB, Zip,[elements of df1],[elements of df2]]`. (25 points)

In [6]:
# your code here

def link(df1, df2):
    linked = []
    for row in pairs.iterrows():
        DOB = row[1]['DOB']
        Zip = row[1]['Zip']
        e_df1 = df1.loc[(DOB == df1['DOB']) & (Zip == df1['Zip'])]
        e_df2 = df2.loc[(DOB == df2['DOB']) & (Zip == df2['Zip'])]
        linked.append([DOB,Zip,e_df1,e_df2])
    return linked

In [7]:
linked = link(anonymized_adult, personally_identifiable_info) # note: this may take some time!

(c) How many records were you able to uniquely identify? Your code should display the answer. (10 points)

In [8]:
# your code here
len(linked)

32561

(d) What was the `(DOB,Zip)` pair that had the most records associated with it? How many? Your code should display the answer. (10 points)

In [9]:
# your code here
most_rec = max(linked, key=lambda x: len(x[2]))
print((most_rec[0],most_rec[1]))
print(len(most_rec[2]))

('10/15/1974', 78808)
2


(e) Given this [graph](https://thedatamap.org/mobile2014/index.php#Demonstration), how likely do you think it is for linking attacks to be performed on your data? Are there any edges that you found surprising? (5 points)

Answer: Since is so easy to access user data from all sorts of apps, it is definiteley easy to perform a linking attack in our data. This is because it is easy to link data once you have all of the other information about a person. What suprised me the most is how MapQuest and Map My Walk share immense amounts of data, literally tracking our movement. How is that data being used and to whom is being sold is the major question.

### Problem 2: $k$-Anonymity (40 points)

A dataset is $k$-anonymous if for each record in the dataset, there are $k$ occurrences of that record that are indistinguishable from each other. A common way to achieve $k$-anonymity is by dropping columns.

(a) Implement a function (`is_k_anonymous`) to check whether, for `k` and a set of columns to be dropped `cols`, whether a given dataframe `df` satisfies $k$-Anonymity. A fast method for counting duplicate rows can be found [here](https://stackoverflow.com/questions/35584085/how-to-count-duplicate-rows-in-pandas-dataframe). (15 points)

In [4]:
# your code here

def is_k_anonymous(k, cols, df):
    dropped = df.drop(columns=cols)
    dups = dropped.groupby(dropped.columns.tolist()).size().reset_index().rename(columns={0:'count'})
    for i in range(0, dups.shape[0]):
        if dups['count'][i] < k:
            return False
    return True

In [5]:
display(is_k_anonymous(2, ['Name', 'SSN', 'DOB', 'Zip'], adult)) # this should return false if your function works
display(is_k_anonymous(1, ['Name', 'SSN', 'DOB', 'Zip'], adult)) # this should return true if your function works

False

True

There is often a tension between $k$-anonymity and data-utility. For example, let $|D|$ be the size of dataset $D$. If we drop every column of a dataset, then automatically this new dataset it $|D|$-anonymous, but it also contains no information. 

(b) Determine a set of columns of `adult` that contain `['Name', 'SSN', 'DOB', 'Zip']`, where the resulting dataset is $3$-anonymous. Drop as few as you can! (15 points)

(hint: work backwards from dropping all of the columns)

In [22]:
# your solution here
import itertools

def get_cols(data, k):
    for i in range(1, data.shape[1]):
        combs = itertools.combinations(data.columns.values.tolist(),i)
        for comb in combs:
            comb = [*comb]
            if is_k_anonymous(k, comb, data):
                return comb
    return data.columns.values.tolist()

cols = ['Name', 'SSN', 'DOB', 'Zip']
total_cols = adult.columns.values.tolist()
total_cols = [i for i in total_cols if i not in cols]
dropped = adult.drop(columns=cols)

cols.extend(get_cols(dropped,3))

In [24]:
if is_k_anonymous(6,cols, adult):
    display(cols)

['Name',
 'SSN',
 'DOB',
 'Zip',
 'Age',
 'Workclass',
 'fnlwgt',
 'Education',
 'Education-Num',
 'Martial Status',
 'Occupation',
 'Relationship',
 'Capital Gain',
 'Capital Loss',
 'Hours per week',
 'Country']

 (c) Which columns could you leave in the datasets? In your experiments were there any surprises about what you had to drop? (10 points)
 
 Solution: I had to drop 16 lines in total (including the personal identifiable data) to achive 3 anpnymity. The suprising par is that it is a lot of columns to drop, ultimately reducing our possibility to analyze the data dramatically. Another interesting find is that it is the same amount to achieve from 2 anonymity to 6 anonymity.