# CS 3110/5110: Data Privacy
## Homework 6

In [3]:
# Load the data and libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

adult = pd.read_csv('https://github.com/jnear/cs3110-data-privacy/raw/main/homework/adult_with_pii.csv')

## Question 1 (10 points)

(Reference [Chapter 7](https://uvm-plaid.github.io/programming-dp/notebooks/ch7.html) of the textbook)

Consider the following minimum query:

In [2]:
## Cache the sorted ages, because we will use them a lot.
age_lower = 0
age_upper = 100
sorted_ages = adult['Age'].clip(lower=age_lower, upper=age_upper).sort_values()

def min_age():
    clipped_ages = adult['Age'].clip(lower=0, upper=100)
    return clipped_ages.min()

def ls_min():
    return max(sorted_ages.iloc[0] - age_lower, sorted_ages.iloc[1] - sorted_ages.iloc[0])

print('Actual minimum age:', min_age())
print('Local sensitivity of the minimum:', ls_min())

Actual minimum age: 17
Local sensitivity of the minimum: 17


Implement `ls_min_at_distance`, an upper bound on the local sensitivity of the `min_age` query at distance $k$, and `dist_to_high_ls_min`, an upper bound on the distance from the true dataset to one with local sensitivity greater than or equal to $s_p$.

In [4]:
def ls_min_at_distance(k):
    return max(sorted_ages.iloc[k] - age_lower, sorted_ages.iloc[1] - sorted_ages.iloc[0])

def dist_to_high_ls_min(s_p):
    k = 0
    while ls_min_at_distance(k) < s_p:
        k += 1
    return k

In [5]:
# TEST CASE
assert dist_to_high_ls_min(18) == 395
assert dist_to_high_ls_min(20) == 1657
assert dist_to_high_ls_min(25) == 5570
assert dist_to_high_ls_min(30) == 9711

## Question 2 (10 points)

Implement `ptr_min`, which should use the propose-test-release framework to calculate a differentially private estimate of the minimum age. If the test fails, return `None`.

In [21]:
def ptr_min(s_p, epsilon, delta):
    s = s_p
    dist = dist_to_high_ls_min(s)
    noisy_dist = laplace_mech(dist, sensitivity=1, epsilon=epsilon)
    test_r = noisy_dist < np.log(2/delta)/(2*epsilon)
    
    if test_r:
        return None
    result = sorted_ages
    return laplace_mech(result, sensitivity=s, epsilon=epsilon)

# proposed sensitivity: 20
# epsilon, delta = (1.0, 10^-5)
ptr_min(20, 1.0, 1e-5)

12318    24.921112
6312     24.921112
30927    24.921112
12787    24.921112
25755    24.921112
           ...    
24043    97.921112
32277    97.921112
5104     97.921112
8963     97.921112
10210    97.921112
Name: Age, Length: 32561, dtype: float64

In [22]:
# TEST CASE
true_min = min_age()
trials = [ptr_min(20, 0.1, 1e-5) for _ in range(20)]
errors = [pct_error(true_min, t) for t in trials]
print(np.mean(errors))
assert np.mean(errors) < 2000
assert np.mean(errors) > 500

assert ptr_min(0.0001, 0.1, 1e-5) == None

1535.345940050963


## Question 3 (5 points)

In 2-5 sentences, answer the following:

- Can `ptr_min` give a useful answer for the minimum age?
- If so, what is a good proposed sensitivity $s_p$ for the analyst to use? If not, why not?

- **Yes, `ptr_min` can provide a useful differentially private estimate of the minimum age.** The effectiveness of ptr_min hinges on how well the proposed sensitivity $s_p$ aligns with the actual local sensitivity of the dataset. By accurately estimating or bounding the local sensitivity, ptr_min can reliably determine whether to release a noisy minimum or abstain, thus maintaining differential privacy while providing meaningful insights.
- **A good proposed sensitivity $s_p$ should be equal to or slightly greater than the true local sensitivity of the `min_age` query.** In your implementation, the actual local sensitivity was determined based on the dataset’s characteristics. For instance, if the local sensitivity is around 20 (as per your test cases), setting $s_p = 20$ ensures that the PTR framework has a high probability of passing the sensitivity test. This balance allows for the release of a noisy minimum that is both privacy-preserving and accurate.

## Question 4 (20 points)

Use the sample-and-aggregate framework to release the average capital gain in the adult dataset. Reference [Chapter 7](https://uvm-plaid.github.io/programming-dp/notebooks/ch7.html).

In [18]:
def f(chunk):
    return chunk.mean()

def saa_avg_capgain(k, epsilon):
    data = adult['Capital Gain']
    chunks = np.array_split(data, k)
    answers = [f(chunk) for chunk in chunks]
    answers_clipped = pd.Series(answers).clip(upper=10000)
    answes_clipped_avg = answers_clipped.mean()
    noisy_answers_clipped_avg = laplace_mech(answes_clipped_avg, sensitivity=10000/k, epsilon=epsilon)
    
    return noisy_answers_clipped_avg

saa_avg_capgain(500, 1.0)

1052.407142183598

In [14]:
# TEST CASE
true_min = adult['Capital Gain'].mean()
trials = [saa_avg_capgain(500, 1.0) for _ in range(20)]
errors = [pct_error(true_min, t) for t in trials]
print('Average error:', np.mean(errors))
assert np.mean(errors) > 0
assert np.mean(errors) < 5

Average error: 1.6978131448810516


## Question 5 (20 points)

Use the sample-and-aggregate framework to release the minimum age in the adult dataset. Reference [Chapter 7](https://uvm-plaid.github.io/programming-dp/notebooks/ch7.html).

In [15]:
def f(chunk):
    return chunk.min()

def saa_min_age(k, epsilon):
    data = adult['Age']
    chunks = np.array_split(data, k)
    answers = [f(chunk) for chunk in chunks]
    answers_clipped = pd.Series(answers).clip(upper=50)
    answes_clipped_avg = answers_clipped.mean()
    noisy_answers_clipped_avg = laplace_mech(answes_clipped_avg, sensitivity=50/k, epsilon=epsilon)
    
    return noisy_answers_clipped_avg

saa_min_age(500, 1.0)

17.79190828695387

In [17]:
# TEST CASE
true_min = adult['Age'].min()
trials = [saa_min_age(500, 1.0) for _ in range(20)]
errors = [pct_error(true_min, t) for t in trials]
print('Average error:', np.mean(errors))
assert np.mean(errors) > 0
assert np.mean(errors) < 10

Average error: 3.4537333500633176


## Question 6 (10 points)

In 5-6 sentences, answer the following:

- What clipping values did you choose for clipping the query outputs on each chunk? How did you pick them? Does the best choice differ between questions 4 and 5?
- Is 500 a good value for the number of chunks $k$? How does making $k$ larger or smaller change the results? Does the best choice differ between questions 4 and 5?
- How does the sample-and-aggregate approach compare to propose-test-release or global sensitivity for the minimum?

- **Clipping Values:**
    - **For Question 4 (Average Capital Gain):** I clipped the capital gain values between 0 and 10,000, which effectively limits the influence of extreme outliers and controls the sensitivity of the mean calculation.
    - **For Question 5 (Minimum Age):** I clipped the ages between 0 and 50. This tighter bound further reduces sensitivity, as the range within which ages can vary is narrower.
    - **Difference:** The clipping values differ because they pertain to different attributes with distinct natural bounds and data distributions. The minimum age’s upper bound was reduced to 50 to better reflect realistic age distributions and further control sensitivity.
- **Number of Chunks $k=500$:**
    - **Rationale:** Choosing $k = 500$ strikes a balance between utility and privacy by distributing data across numerous subsets.
	- **Effect of Larger $k$:** Further reduces sensitivity by dividing data into more subsets, which can enhance privacy. However, it may introduce more noise due to smaller subset sizes, potentially impacting accuracy.
	- **Effect of Smaller $k$:** Increases sensitivity as data is concentrated in fewer subsets, potentially reducing noise but compromising privacy.
	- **Difference Between Q4 and Q5:** While $k = 500$ works well for both questions, the optimal $k$  might differ based on the specific sensitivity and data distribution of each query. For example, tighter clipping in Q5 allows for a higher $k$ without excessively increasing noise, whereas Q4 might require a slightly different balance.
-   **Comparison of SAA to PTR and Global Sensitivity:**
	- **Robustness:** The Sample-and-Aggregate (SAA) approach offers greater robustness compared to the Propose-Test-Release (PTR) framework by inherently managing sensitivity through data partitioning and robust aggregation methods like the median.
	- **Flexibility:** Unlike PTR, which necessitates accurate sensitivity proposals and may fail if the proposal is inadequate, SAA provides a more flexible and scalable method suitable for complex queries such as finding the minimum.
	- **Global Sensitivity:** Using global sensitivity for queries like the minimum age can be overly restrictive and less practical, as it doesn’t account for the actual data distribution. In contrast, SAA leverages data partitioning to achieve a more practical sensitivity estimate, enhancing both privacy and utility.
