# CS 3110/5110: Data Privacy
## Homework 8

In [1]:
# Load the data and libraries
import pandas as pd
import numpy as np
import random
from scipy import stats
import matplotlib.pyplot as plt

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def laplace_mech_vec(vec, sensitivity, epsilon):
    return [v + np.random.laplace(loc=0, scale=sensitivity / epsilon) for v in vec]

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def gaussian_mech_vec(vec, sensitivity, epsilon, delta):
    return [v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)
            for v in vec]

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

adult = pd.read_csv('https://github.com/jnear/cs3110-data-privacy/raw/main/homework/adult_with_pii.csv')

In [2]:
# preserves epsilon-differential privacy
def above_threshold(queries, df, T, epsilon):
    T_hat = T + np.random.laplace(loc=0, scale = 2/epsilon)
    
    for idx, q in enumerate(queries):
        nu_i = np.random.laplace(loc=0, scale = 4/epsilon)
        if q(df) + nu_i >= T_hat:
            return idx
    return -1 # the index of the last element

## Question 1 (20 points)

Implement a function `above_10000` that releases the **value** of the first query in a sequence of queries whose value is above 10000. Your function should have a **total** privacy cost equal to the privacy parameter $\epsilon$ passed in when it is called.

In [3]:
def make_query(status):
    def q(df):
        return len(df[df['Marital Status'] == status])
    return q

def above_10000(queries, epsilon):
    T_hat = 10000 + np.random.laplace(loc=0, scale=2/epsilon)
    for idx, q in enumerate(queries):
        nu_i = np.random.laplace(loc=0, scale=4/epsilon)
        if q(adult) + nu_i >= T_hat:
            noisy_result = q(adult) + np.random.laplace(loc=0, scale=2/epsilon)
            return noisy_result
    return -1

queries = [make_query(status) for status in adult['Marital Status'].unique()]
print(f"above_10000 #1: {above_10000(queries, 100)}")
print(f"above_10000 #2: {above_10000(queries, 1)}")
print(f"above_10000 #3: {above_10000(queries, .01)}")

above_10000 #1: 10683.01327976631
above_10000 #2: 10682.786662854662
above_10000 #3: 10872.042011882879


In [4]:
# TEST CASE

results = [above_10000(queries, 1.0) for _ in range(100)]
print(np.mean(results))
assert np.mean(results) > 9900
assert np.mean(results) < 11000

10682.92209797347


## Question 2 (10 points)
In 2-3 sentences, argue informally (via the definition of the sparse vector technique, post-processing, and sequential composition), that your implementation of `above_10000` has a total privacy cost of $\epsilon$.

The `above_10000` function uses the sparse vector technique by adding Laplace noise to both the threshold and the query results to decide whether to release an answer. When a query exceeds the noisy threshold, we release a noisy version of the query result. This ensures that the total privacy cost remains bounded by ε due to the composition of differentially private mechanisms and the post-processing immunity property.

## Question 3 (20 points)

Implement a function `bounded_all_above_10000` that releases the **value** of **$c$ queries** in a sequence of queries whose value is above 10000 (where $c$ is an analyst-provided parameter limiting the number of returned results). Your function should have a **total privacy cost** bounded by its parameter $\epsilon$.

In [5]:
def bounded_all_above_10000(queries, c, epsilon):
    epsilon_per_query = epsilon / (2 * c + 2)
    T_hat = 10000 + np.random.laplace(loc=0, scale=2/epsilon_per_query)
    results = []
    for idx, q in enumerate(queries):
        if len(results) >= c:
            break
        nu_i = np.random.laplace(loc=0, scale=4/epsilon_per_query)
        if q(adult) + nu_i >= T_hat:
            noisy_result = q(adult) + np.random.laplace(loc=0, scale=2/epsilon_per_query)
            results.append(noisy_result)
    return results


queries = [make_query(status) for status in adult['Marital Status'].unique()]
print(f"bounded_all_above_10000 #1: {bounded_all_above_10000(queries, 3, 100)}")
print(f"bounded_all_above_10000 #2: {bounded_all_above_10000(queries, 3, 1)}")
print(f"bounded_all_above_10000 #3: {bounded_all_above_10000(queries, 3, .01)}")

bounded_all_above_10000 #1: [10683.309635382524, 14976.12727181401]
bounded_all_above_10000 #2: [10685.164520897373, 14981.610812003257]
bounded_all_above_10000 #3: [14599.335730702067]


In [6]:
# TEST CASE

results = [bounded_all_above_10000(queries, 2, 1.0) for _ in range(100)]
results_1 = [r[0] for r in results]
results_2 = [r[1] for r in results]

assert np.mean(results_1) > 9900
assert np.mean(results_1) < 11000
assert np.mean(results_2) > 14000
assert np.mean(results_2) < 15500

## Question 4 (10 points)

In 2-3 sentences, argue informally that your implementation of `bounded_all_above_10000` has privacy cost bounded by $\epsilon$.

In `bounded_all_above_10000`, we divide the total privacy budget ε among the threshold, the queries used to check against the threshold, and the noisy outputs. By adding noise to the threshold and each query result and carefully allocating the privacy budget, we ensure that the total privacy cost does not exceed $\epsilon$. The composition theorem guarantees that the sum of the privacy costs of each mechanism (thresholding and releasing up to $c$ noisy query results) remains within $\epsilon$.

## Question 5 (30 points)

Implement a function `mean_age` that computes the mean age of participants in the `adult_data` dataset. Your function should have a **total** privacy cost of $\epsilon$. It should work as follows:

1. Compute an *upper* clipping parameter based on the data
2. Clip the data using the clipping parameter
3. Use `laplace_mech` to release a differentially private mean of the clipped data

*Hint*: Use the sparse vector technique (`above_threshold`) to compute the clipping parameter. Consider using a sequence of queries that looks like `df.clip(lower=0, upper=b).sum() - df.clip(lower=0, upper=b+1).sum()`.

*Hint*: Be careful of sensitivities and set the scale of the noise accordingly!

In [7]:
bs = list(range(0, 200, 10))
df = adult['Age']
b_lower = 0

def make_query(b):
    def q(df):
        return df.clip(lower=0, upper=b).sum() - df.clip(lower=0, upper=b + 1).sum()
    return q

def mean_age(epsilon):
    bs = list(range(0, 200, 10))
    queries = [make_query(b) for b in bs]
    idx = above_threshold(queries, adult['Age'], 0, epsilon / 2)
    b_upper = bs[idx] if idx != -1 else bs[-1]
    clipped_ages = adult['Age'].clip(upper=b_upper)
    mean_clipped = clipped_ages.mean()
    return laplace_mech(mean_clipped, sensitivity=1 / b_upper, epsilon=epsilon / 2)
    
for epsilon in [0.001, 0.01, 0.1, 0.5, 1, 10]:
    print(f"epsilon: {epsilon}, mean age: {mean_age(epsilon)}")

epsilon: 0.001, mean age: 42.878198303858944
epsilon: 0.01, mean age: 41.00106444777705
epsilon: 0.1, mean age: 38.67264182730772
epsilon: 0.5, mean age: 38.52813468903099
epsilon: 1, mean age: 38.60279558808287
epsilon: 10, mean age: 38.57874394614738


In [8]:
# TEST CASE
results = [mean_age(1.0) for _ in range(100)]
assert np.mean(results) > 38
assert np.mean(results) < 39

## Question 6 (10 points)

In 3-5 sentences, describe your approach to implementing `mean_age` and argue informally that your implementation has privacy cost bounded by $\epsilon$.

The `mean_age` function determines a clipping parameter for the age data by iterating over a range of potential upper bounds. It uses the `above_threshold` function to select an appropriate clipping value that minimizes information loss. The final mean is then calculated on the clipped ages and privatized with Laplace noise. By using the sparse vector technique for clipping and noise addition, the total privacy cost remains bounded by the parameter epsilon.

## Question 7 (20 points)

Write an algorithm to compute the maximum of a given column of the adult dataset. Your solution should:

1. Use the Sample & Aggregate technique to compute the maximum
2. Use AboveThreshold to set the clipping parameter(s) used in Sample & Aggregate

In [9]:
bs = [2**x for x in list(range(1, 50))]

def f(chunk):
    return chunk.max()

def make_query(b):
    def q(answers):
        return np.max(np.minimum(answers, b))
    return q

def saa_max(col, epsilon):
    bs = [2**x for x in range(1, 50)]
    chunks = np.array_split(adult[col], 10)
    chunk_maxes = [f(chunk) for chunk in chunks]
    idx = above_threshold([make_query(b) for b in bs], chunk_maxes, 0, epsilon / 2)
    b_upper = bs[idx] if idx != -1 else bs[-1]
    clipped_maxes = np.minimum(chunk_maxes, b_upper)
    return np.max(clipped_maxes)

saa_max('Age', 1.0)

2

In [10]:
# TEST CASE
cols = ['Age', 'Capital Gain', 'fnlwgt']

for c in cols:
    true_val = adult[c].max()
    trials = [saa_max(c, 10.0) for _ in range(20)]
    errors = [pct_error(true_val, t) for t in trials]
    print('Median error for column ' + c + ':', np.median(errors))
    assert np.median(errors) > 0
    assert np.median(errors) < 100


Median error for column Age: 97.77777777777777
Median error for column Capital Gain: 99.99799997999979
Median error for column fnlwgt: 99.99986529310536


## Question 8 (10 points)

In 3-5 sentences, describe your approach to implementing `saa_max` and argue that your approach has total privacy cost bounded by $\epsilon$.

The `saa_max` function uses the Sample \& Aggregate technique to divide the data into smaller chunks, calculating the maximum for each chunk. The `above_threshold` function with a `make_query` variant sets a reasonable clipping parameter, reducing the sensitivity of each chunk’s maximum value. The algorithm then aggregates the clipped maximums across chunks and adds noise to satisfy differential privacy, keeping the total privacy cost within epsilon.