# CS 3110/5990: Data Privacy
## Homework 4

In [2]:
# Load the data and libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

adult = pd.read_csv('https://github.com/jnear/cs3110-data-privacy/raw/main/homework/adult_with_pii.csv')

## Question 1 (10 points)

Complete the definition of `dp_sum_capgain` below. Your definition should compute a differentially private sum of the "Capital Gain" column of the `adult` dataset, and have a total privacy cost of `epsilon`.

In [3]:
def dp_sum_capgain(epsilon):
    capgain_sum = adult["Capital Gain"].sum()
    sensitivity = adult["Capital Gain"].max()

    dp_sum = laplace_mech(capgain_sum, sensitivity=sensitivity, epsilon=epsilon)
    return dp_sum

dp_sum_capgain(1.0)

35048384.163850926

In [4]:
# TEST CASE for question 1

real_sum = adult['Capital Gain'].sum()
r1 = np.mean([pct_error(real_sum, dp_sum_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_sum, dp_sum_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_sum, dp_sum_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 10
assert r2 < 2
assert r3 < 0.2

Average errors: 1.9499730772877641 0.29039905950370454 0.026207578820291125


## Question 2 (10 points)

In 2-5 sentences each, answer the following:

- What clipping parameter did you use in your definition of `dp_sum_capital`, and why?
- What was the sensitivity of the query you used in `dp_sum_capital`, and how is it bounded?
- Argue that your definition of `dp_sum_capital` has a total privacy cost of `epsilon`

1.	What clipping parameter did you use in your definition of dp_sum_capital, and why?
- I did not use an explicit clipping parameter in the implementation of dp_sum_capgain. Instead, I used the maximum value in the “Capital Gain” column as the sensitivity, which inherently serves as a form of clipping by bounding the maximum possible change to the sum query. This ensures that outliers do not disproportionately affect the noise added for privacy.
2. What was the sensitivity of the query you used in dp_sum_capital, and how is it bounded?
- The sensitivity of the sum query is the maximum value of the “Capital Gain” column in the dataset, which is bounded because the values in the dataset are finite. Sensitivity is calculated as the difference in the sum caused by the addition or removal of one individual, and it is bounded by the largest possible individual “Capital Gain.”
3. Argue that your definition of dp_sum_capital has a total privacy cost of epsilon.
- The total privacy cost of the dp_sum_capgain function is epsilon because the Laplace mechanism is used with a noise scale of sensitivity / epsilon. By adding noise proportional to the sensitivity divided by epsilon, we ensure that the probability of any particular outcome changes only by a factor of exp(epsilon), thus satisfying differential privacy with a privacy cost of exactly epsilon for this query.

## Question 3 (10 points)

Complete the definition of `dp_avg_capgain` below. Your definition should compute a differentially private average (mean) of the "Capital Gain" column of the adult dataset, and have a **total privacy cost of epsilon**.

In [5]:
def dp_avg_capgain(epsilon):
    capgain_sum = adult["Capital Gain"].sum()
    capgain_count = len(adult["Capital Gain"])

    sum_sensitivity = adult["Capital Gain"].max()
    count_sensitivity = 1

    noisy_sum = laplace_mech(capgain_sum, sum_sensitivity, epsilon/2)
    noisy_count = laplace_mech(capgain_count, count_sensitivity, epsilon/2)

    dp_avg = noisy_sum / noisy_count

    return dp_avg

dp_avg_capgain(1.0)

1076.964727297736

In [6]:
# TEST CASE for question 3

real_avg = adult['Capital Gain'].mean()
r1 = np.mean([pct_error(real_avg, dp_avg_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_avg, dp_avg_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_avg, dp_avg_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 20
assert r2 < 4
assert r3 < 0.4

Average errors: 6.229336943600658 0.5495139105254281 0.05916032350313569


## Question 4 (10 points)

In 2-5 sentences each, answer the following:

- Argue that your definition of `dp_avg_capgain` has a total privacy cost of `epsilon`
- For sums and averages, which seems to be more important for accuracy - the value of the clipping parameter $b$ or the scale of the noise added? Why?
- Do you think the answer to the previous point will be true for every dataset? Why or why not?

1. Argue that your definition of dp_avg_capgain has a total privacy cost of epsilon.
- The privacy cost of the dp_avg_capgain function is epsilon because we split the privacy budget between the sum and the count operations, assigning epsilon/2 to each. The Laplace mechanism guarantees that each query (sum and count) is individually protected by differential privacy with its allocated epsilon/2. Since the queries are independent, the total privacy cost is the sum of their individual costs, resulting in a total privacy cost of epsilon.
2. For sums and averages, which seems to be more important for accuracy - the value of the clipping parameter  b  or the scale of the noise added? Why?
- The scale of the noise added seems to be more important for accuracy, especially when the noise is large relative to the magnitude of the sum or average. Larger noise (resulting from smaller values of  \epsilon ) can greatly distort the result, whereas the clipping parameter  b , which bounds the sensitivity, primarily affects outlier influence. In many cases, the noise overwhelms the effect of clipping if the sensitivity is not extreme.
3. Do you think the answer to the previous point will be true for every dataset? Why or why not?
- This answer may not hold true for every dataset. In datasets with extreme outliers or a wide range of values, the clipping parameter  b  plays a more critical role in maintaining accuracy. If the clipping parameter is too large, extreme values can distort the result, while if it is too small, valid data may be lost. In datasets with more uniform data, the scale of the noise becomes the primary factor influencing accuracy. Therefore, the relative importance of noise and clipping depends on the specific characteristics of the dataset.

## Question 5 (20 points)

Write a function `auto_avg` that returns the differentially private average of a Pandas series `s`. Your function should automatically determine the clipping parameter `b`, and should enforce differential privacy for a **total privacy cost** of `epsilon`. You can assume that all values are non-negative (i.e. 0 or greater).

In [16]:
def auto_avg(s, epsilon):
    if s.max() < 100:
        b = np.percentile(s, 95)
    else:
        b = np.percentile(s, 99)
    
    s_clipped = np.clip(s, 0, b)
    sum_clipped = s_clipped.sum()
    count_clipped = len(s_clipped)

    sum_sensitivity = b
    count_sensitivity = 1

    noisy_sum = laplace_mech(sum_clipped, sum_sensitivity, epsilon/2)
    noisy_count = laplace_mech(count_clipped, count_sensitivity, epsilon/2)

    noisy_count = max(noisy_count, 1e-3)
    dp_avg = noisy_sum / noisy_count

    return dp_avg

auto_avg(adult['Capital Gain'], 1.0)

635.2078500554668

In [19]:
# TEST CASE for question 5
r1 = np.mean([pct_error(adult['Age'].mean(), auto_avg(adult['Age'], 1.0)) for _ in range(20)])
r2 = np.mean([pct_error(adult['Capital Gain'].mean(), auto_avg(adult['Capital Gain'], 1.0)) for _ in range(20)])
r3 = np.mean([pct_error(adult['fnlwgt'].mean(), auto_avg(adult['fnlwgt'], 1.0)) for _ in range(20)])

print('Average errors:', r1, r2, r3)
assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 1
assert r2 < 100
assert r3 < 1

Average errors: 0.8541702889354614 41.18474165789031 0.6573477104725857


## Question 6

In 2-5 sentences each, answer the following:

- Explain your strategy for implementing `auto_avg`
- Argue informally that your definition of `auto_avg` has a total privacy cost of `epsilon`
- Did your solution work well for all three example columns? If it did not work well on any of them, why not?
- When is your solution likely to *not* work well? (i.e. what properties does the data have to have, in order for your solution to not work well?)

1. Strategy for implementing auto_avg:
- The strategy for auto_avg is to automatically clip the values in the Pandas series s based on its distribution (90th or 99th percentile) to limit the influence of extreme outliers. After clipping, I applied the Laplace mechanism to both the sum and the count of the clipped values, ensuring differential privacy while avoiding very small noisy counts by enforcing a lower bound.
2. Total privacy cost of epsilon:
- The function ensures that the total privacy cost is epsilon by splitting epsilon into two parts: one for the noisy sum and one for the noisy count, each receiving epsilon/2 or another balanced portion. The Laplace mechanism guarantees differential privacy with privacy loss parameter epsilon for each operation, ensuring the combined privacy cost doesn’t exceed the total epsilon.
3. Solution performance for all columns:
- The solution worked reasonably well for the ‘Age’ and ‘fnlwgt’ columns due to their relatively predictable distributions. However, for ‘Capital Gain,’ which has a more skewed distribution and extreme outliers, the method didn’t perform as well because clipping at the 99th percentile still left substantial variation, leading to higher errors.
4. When the solution might not work well:
- The solution is likely to fail on datasets with heavily skewed distributions, particularly when there are large outliers or long tails. In such cases, the chosen clipping thresholds might either be too aggressive (leading to loss of important data) or too loose (leading to insufficient noise reduction). Additionally, datasets with very small counts may introduce high variance when adding noise to the count.