# CS 3110/5990: Data Privacy
## Homework 4

In [1]:
# Load the data and libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

adult = pd.read_csv('https://github.com/jnear/cs3110-data-privacy/raw/main/homework/adult_with_pii.csv')

## Question 1 (10 points)

Complete the definition of `dp_sum_capgain` below. Your definition should compute a differentially private sum of the "Capital Gain" column of the `adult` dataset, and have a total privacy cost of `epsilon`.

In [18]:
def dp_sum_capgain(epsilon):
    b = 99900
    capgain_sum = adult['Capital Gain'].clip(upper=b).sum()

    dp_sum = laplace_mech(capgain_sum, sensitivity=b, epsilon=epsilon)
    return dp_sum

dp_sum_capgain(1.0)

35146604.40296236

In [17]:
# TEST CASE for question 1

real_sum = adult['Capital Gain'].sum()
r1 = np.mean([pct_error(real_sum, dp_sum_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_sum, dp_sum_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_sum, dp_sum_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 10
assert r2 < 2
assert r3 < 0.2

Average errors: 3.4709486841231296 0.3128700565395841 0.05395681963006834


## Question 2 (10 points)

In 2-5 sentences each, answer the following:

- What clipping parameter did you use in your definition of `dp_sum_capital`, and why?
- What was the sensitivity of the query you used in `dp_sum_capital`, and how is it bounded?
- Argue that your definition of `dp_sum_capital` has a total privacy cost of `epsilon`

1. Clipping Parameter:
- The clipping parameter I used is b = 99900. This was chosen because it represents a reasonable upper bound on the `Capital Gain` column, limiting the impact of outliers on the sum. Without clipping, a few extreme values could disproportionately affect the sum, leading to larger noise additions to maintain privacy. Clipping reduces the noise needed by bounding the possible contribution of any individual.
2. Sensitivity:
- The sensitivity of the query is b = 99900. Sensitivity is defined as the maximum change that a single individual’s data can have on the result of the query. Since we clipped the `Capital Gain` column at 99900, the maximum contribution of a single individual to the sum is bounded by this value, ensuring the sensitivity remains constant.
3. Privacy Cost:
- The total privacy cost of the `dp_sum_capital` query is `epsilon`, as specified by the Laplace mechanism. The Laplace mechanism ensures differential privacy by adding noise proportional to the sensitivity of the query divided by epsilon. Since the sensitivity is bounded by b and the noise added is scaled by epsilon, the privacy guarantee directly depends on the epsilon value passed into the function.

## Question 3 (10 points)

Complete the definition of `dp_avg_capgain` below. Your definition should compute a differentially private average (mean) of the "Capital Gain" column of the adult dataset, and have a **total privacy cost of epsilon**.

In [44]:
def dp_avg_capgain(epsilon):
    dp_sum = dp_sum_capgain(epsilon/2)
    dp_count = laplace_mech(len(adult['Capital Gain']), sensitivity=1, epsilon=epsilon/2)

    return dp_sum / dp_count

dp_avg_capgain(1.0)

1073.3723009169937

In [43]:
# TEST CASE for question 3

real_avg = adult['Capital Gain'].mean()
r1 = np.mean([pct_error(real_avg, dp_avg_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_avg, dp_avg_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_avg, dp_avg_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 20
assert r2 < 4
assert r3 < 0.4

Average errors: 5.70165634865113 0.6479165804554817 0.07809996484925241


## Question 4 (10 points)

In 2-5 sentences each, answer the following:

- Argue that your definition of `dp_avg_capgain` has a total privacy cost of `epsilon`
- For sums and averages, which seems to be more important for accuracy - the value of the clipping parameter $b$ or the scale of the noise added? Why?
- Do you think the answer to the previous point will be true for every dataset? Why or why not?

1. Privacy Cost of dp_avg_capgain:
- The total privacy cost of dp_avg_capgain is epsilon. I split the privacy budget equally between two operations: computing the differentially private sum and the count of individuals in the dataset. Since each operation consumes a privacy budget of epsilon/2, the total privacy cost of the function remains bounded by epsilon, ensuring that the function adheres to differential privacy guarantees.
2. Importance of Clipping Parameter vs. Noise Scale:
- For both sums and averages, the value of the clipping parameter b is generally more important for accuracy. Clipping limits the influence of extreme values (outliers), which reduces the sensitivity of the query and, in turn, the amount of noise added. A well-chosen clipping parameter significantly improves accuracy because it directly controls the impact of noise by bounding the range of data. If the clipping parameter is too large, the noise added may become excessive, affecting the accuracy of the result.
3. Applicability to All Datasets:
- The importance of the clipping parameter may not hold for all datasets. For datasets with a fairly uniform or narrow distribution, the scale of the noise added might be more critical for accuracy than clipping, as extreme values are less of a concern. However, in datasets with outliers or highly skewed distributions, clipping becomes essential to prevent large deviations in the results due to a few data points. Therefore, the significance of clipping vs. noise scale is context-dependent, and both need to be considered based on the dataset’s characteristics.

## Question 5 (20 points)

Write a function `auto_avg` that returns the differentially private average of a Pandas series `s`. Your function should automatically determine the clipping parameter `b`, and should enforce differential privacy for a **total privacy cost** of `epsilon`. You can assume that all values are non-negative (i.e. 0 or greater).

In [58]:
def auto_avg(s, epsilon):
    b = np.percentile(s, 99)

    dp_sum = laplace_mech(s.clip(upper=b).sum(), sensitivity=b, epsilon=epsilon/2)
    dp_count = laplace_mech(len(s), sensitivity=1, epsilon=epsilon/2)

    return dp_sum / dp_count

dp_avg = auto_avg(adult['Capital Gain'], 1.0)

In [59]:
# TEST CASE for question 5
r1 = np.mean([pct_error(adult['Age'].mean(), auto_avg(adult['Age'], 1.0)) for _ in range(20)])
r2 = np.mean([pct_error(adult['Capital Gain'].mean(), auto_avg(adult['Capital Gain'], 1.0)) for _ in range(20)])
r3 = np.mean([pct_error(adult['fnlwgt'].mean(), auto_avg(adult['fnlwgt'], 1.0)) for _ in range(20)])

print('Average errors:', r1, r2, r3)
assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 1
assert r2 < 100
assert r3 < 1

Average errors: 0.13318139201478063 41.208930211535936 0.6664883605196914


## Question 6

In 2-5 sentences each, answer the following:

- Explain your strategy for implementing `auto_avg`
- Argue informally that your definition of `auto_avg` has a total privacy cost of `epsilon`
- Did your solution work well for all three example columns? If it did not work well on any of them, why not?
- When is your solution likely to *not* work well? (i.e. what properties does the data have to have, in order for your solution to not work well?)

1. Strategy for Implementing auto_avg:
- The strategy for implementing auto_avg involved first automatically determining a clipping parameter, b, based on the 99th percentile of the data. Clipping limits the influence of extreme values. I then split the privacy budget epsilon equally between the sum and count queries, using the Laplace mechanism to add noise. Finally, I calculated the differentially private average by dividing the noisy sum by the noisy count.
2. Privacy Cost Argument:
- The total privacy cost of auto_avg is bounded by epsilon. I split the privacy budget evenly between the sum and the count operations, allocating epsilon/2 to each. Since both operations adhere to the Laplace mechanism’s guarantees, the combined privacy cost of the function is the sum of the two, which equals epsilon.
3. Solution Performance:
- The solution worked well for all three example columns after adjusting the clipping parameter to the 99th percentile. Initially, it didn’t work well for the fnlwgt column because the values were much larger and more skewed, requiring more aggressive clipping. After this adjustment, the solution handled all columns adequately.
4. When the Solution May Not Work Well:
- The solution might not work well with datasets that have extremely skewed or heavy-tailed distributions, where even the 99th percentile doesn’t effectively limit outliers. Additionally, if the dataset is very small or has a high proportion of outliers, the added noise could disproportionately affect the result, leading to less accurate differentially private averages.