# CS211: Data Privacy
## Homework 4

In [1]:
# Load the data and libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings(action="ignore")
plt.style.use('seaborn-whitegrid')

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

adult = pd.read_csv('https://github.com/jnear/cs211-data-privacy/raw/master/homework/adult_with_pii.csv')

In [2]:
adult.head()

Unnamed: 0,Name,DOB,SSN,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,Karrie Trusslove,9/7/1967,732-14-6110,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,Brandise Tripony,6/7/1988,150-19-2766,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,Brenn McNeely,8/6/1991,725-59-9860,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,Dorry Poter,4/6/2009,659-57-4974,25503,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,Dick Honnan,9/16/1951,220-93-3811,75387,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Question 1 (10 points)

Complete the definition of `dp_sum_capgain` below. Your definition should compute a differentially private sum of the "Capital Gain" column of the `adult` dataset, and have a total privacy cost of `epsilon`.

In [3]:
def dp_sum_capgain(epsilon):
    
    # YOUR CODE HERE
    upper_bound = adult['Capital Gain'].max()
    lower_bound = adult['Capital Gain'].min()
    sensitivity = upper_bound - lower_bound #summation queries have unbounded sensitivity so we use clipping
    clip_sum = adult['Capital Gain'].clip(lower=lower_bound, upper=upper_bound).sum()
    noisy_sum = laplace_mech(clip_sum, sensitivity, epsilon / 2)
    
    return noisy_sum

dp_sum_capgain(1.0)

35123856.25552404

In [4]:
# TEST CASE for question 1

real_sum = adult['Capital Gain'].sum()
r1 = np.mean([pct_error(real_sum, dp_sum_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_sum, dp_sum_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_sum, dp_sum_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 10
assert r2 < 2
assert r3 < 0.2

Average errors: 4.619153434629004 0.44176941161719213 0.05717357622723261


## Question 2 (10 points)

In 2-5 sentences each, answer the following:

- What clipping parameter did you use in your definition of `dp_sum_capital`, and why?
- What was the sensitivity of the query you used in `dp_sum_capital`, and how is it bounded?
- Argue that your definition of `dp_sum_capital` has a total privacy cost of `epsilon`

* **What clipping parameter did you use in your definition of dp_sum_capital, and why?**\
I used the maximum capital gain and minimum capital gain values as the upper and lower bounds respectively. This is because  they include 100% of the Capital Gain data which in turn reduces the chance of information loss due to aggressive clipping which would in turn cause a loss of accuracy which outweighs the improvement in noise resulting from smaller sensitivity. The bounds also yield lower average errors.

* **What was the sensitivity of the query you used in dp_sum_capital, and how is it bounded?**\
The sensitivity used is the difference between the clipping parameters (upper and lower bounds). These clipping parameters also enforces the sensitivity bounds.

* **Argue that your definition of dp_sum_capital has a total privacy cost of epsilon**\
The first law of laplace states that the frequency of an error can be expressed as an exponential function of the absolute magnitude of the error, which leads to the Laplace distribution. By including an epsilon value in our laplace mechanism's exponential decay scale, we affirm that this is the total privacy cost of our function: dp_sum_capital. 

## Question 3 (10 points)

Complete the definition of `dp_avg_capgain` below. Your definition should compute a differentially private average (mean) of the "Capital Gain" column of the adult dataset, and have a **total privacy cost of epsilon**.

In [5]:
def dp_avg_capgain(epsilon):

    # YOUR CODE HERE
    
    # dp sum query
    upper_bound = adult['Capital Gain'].max()
    lower_bound = adult['Capital Gain'].min()
    sensitivity = upper_bound - lower_bound
    clip_sum = adult['Capital Gain'].clip(lower=lower_bound, upper=upper_bound).sum()
    noisy_sum = laplace_mech(clip_sum, sensitivity, epsilon / 2)
    
    # dp count query
    real_count = len(adult['Capital Gain'])
    noisy_count = laplace_mech(real_count,1,epsilon / 2)    
    
    # dp avg
    return noisy_sum/noisy_count

dp_avg_capgain(1.0)

1072.4246058058593

In [6]:
# TEST CASE for question 3

real_avg = adult['Capital Gain'].mean()
r1 = np.mean([pct_error(real_avg, dp_avg_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_avg, dp_avg_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_avg, dp_avg_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 20
assert r2 < 4
assert r3 < 0.4

Average errors: 5.752077630645605 0.5958753291663863 0.0495969530028027


## Question 4 (10 points)

In 2-5 sentences each, answer the following:

- Argue that your definition of `dp_avg_capgain` has a total privacy cost of `epsilon`
- For sums and averages, which seems to be more important for accuracy - the value of the clipping parameter $b$ or the scale of the noise added? Why?
- Do you think the answer to the previous point will be true for every dataset? Why or why not?

* **Argue that your definition of dp_avg_capgain has a total privacy cost of epsilon**\
The first law of laplace states that the frequency of an error can be expressed as an exponential function of the absolute magnitude of the error, which leads to the Laplace distribution. By including an epsilon value in our laplace mechanism's exponential decay scale, we affirm that this is the total privacy cost of our function: dp_avg_capgain. 

* **For sums and averages, which seems to be more important for accuracy - the value of the clipping parameter b or the scale of the noise added? Why?**\
The clipping parameter is more important for accuracy. The clipping parameter b determines the amount of information lost from the original dataset during clipping thus directly influencing the accuracy of a query.For instance, when the upper and lower clipping bounds are close to each other, a lot of information is lost from a dataset and thus a query's loss of accuracy which outweighs improvement in noise from the same bound-tightness.

* **Do you think the answer to the previous point will be true for every dataset? Why or why not?**\
Yes. In some cases, e.g an ML modeling dataset, the scale of noise added may affect accuracy more since differentially private algorithms add noise to the training gradient -the noise can cause the algorithm to move in the wrong direction during training, and actually make the model worse. However, the amount of noise added depends on the clipping parameter b - when the upper and lower clipping bounds are closer together, the sensitivity is lower thus less noise and viz a viz. This therefore still holds the clipping parameter b more important for accuracy.


## Question 5 (20 points)

Write a function `auto_avg` that returns the differentially private average of a Pandas series `s`. Your function should automatically determine the clipping parameter `b`, and should enforce differential privacy for a **total privacy cost** of `epsilon`. You can assume that all values are non-negative (i.e. 0 or greater).

In [7]:
# YOUR CODE HERE
def auto_avg(s, epsilon):
    
    # dp sum query
    b = s.max()
    clip_sum = s.clip(lower=0, upper=b).sum()
    noisy_sum = laplace_mech(clip_sum, b, epsilon / 2)
    
    # dp count query
    real_count = len(s)
    noisy_count = laplace_mech(real_count,1,epsilon / 2)    
    
    # dp avg
    return noisy_sum/noisy_count

In [8]:
# TEST CASE for question 5
r1 = np.mean([pct_error(adult['Age'].mean(), auto_avg(adult['Age'], 1.0)) for _ in range(20)])
r2 = np.mean([pct_error(adult['Capital Gain'].mean(), auto_avg(adult['Capital Gain'], 1.0)) for _ in range(20)])
r3 = np.mean([pct_error(adult['fnlwgt'].mean(), auto_avg(adult['fnlwgt'], 1.0)) for _ in range(20)])

print('Average errors:', r1, r2, r3)
assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 1
assert r2 < 100
assert r3 < 1

Average errors: 0.0192903430098942 0.5924266718517623 0.056400245990743746


## Question 6

In 2-5 sentences each, answer the following:

- Explain your strategy for implementing `auto_avg`
- Argue informally that your definition of `auto_avg` has a total privacy cost of `epsilon`
- Did your solution work well for all three example columns? If it did not work well on any of them, why not?
- When is your solution likely to *not* work well? (i.e. what properties does the data have to have, in order for your solution to not work well?)

* **Explain your strategy for implementing auto_avg**\
The function takes a series *s* and an epsilon value *epsilon* as its arguments.
It is divided into dp sum and sp count queries cause dp average is a complilation of sum and count queries.
The function uses the maximum series value as the sensitivity parameter *b* directly (the values are non 0, so 0 becomes the lower bound by default) for the sum query and 1 for the count query.
The *laplace_mech* function is called on the real values and provided sensitivity for the 2 queries to get the dp values.
The return function returns the average dp by dividing the dp sum by dp count - the formulae for average calculation.

* **Argue informally that your definition of auto_avg has a total privacy cost of epsilon**\
The first law of laplace states that the frequency of an error can be expressed as an exponential function of the absolute magnitude of the error, which leads to the Laplace distribution. By including an epsilon value in our laplace mechanism's exponential decay scale, we affirm that this is the total privacy cost of our function: auto_avg.

* **Did your solution work well for all three example columns? If it did not work well on any of them, why not?**\
It did as all assertion statements passed.

* **When is your solution likely to not work well? (i.e. what properties does the data have to have, in order for your solution to not work well?)**\
When the dataset has a high variance and outliers present in it. This will result in a high clipping parameter value, which in turn results in a high sensitivity value and in turn too much added noise and thus higher query errors. 