<a href="https://colab.research.google.com/github/elisa-luo/dp-simulations/blob/main/garmin_insights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Package installs + imports

In [None]:
!pip install python-dp



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pydp as dp 
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
import pandas as pd
import statistics

# Generate Synthetic Dataset
*   10,000 users
*   columns: average sleep duration and user ID (UID)
- sleep time generated using a normal distribution
  - mean + std of sleep duration based off of [CDC study](https://www.cdc.gov/mmwr/volumes/65/wr/mm6506a1.htm?s_cid=mm6506a1_w)



In [None]:
mean, std = 7, 1.5
np.random.normal(mean, std)

7.066493370714232

In [None]:
uid = range(0, 10000)
sleep_times = [np.random.normal(mean, std) for i in range(0,10000)]
assert len(uid) == len(sleep_times) == 10000

In [None]:
original_df = pd.DataFrame({"UID": uid, "sleep_duration": sleep_times})

In [None]:
original_df.head(5)

Unnamed: 0,UID,sleep_duration
0,0,8.691973
1,1,9.846144
2,2,4.84151
3,3,8.474597
4,4,3.163695


# A Simple Membership Inference Attack Demo
- show how using DP can protect a user against this attack

First, let's redact 1 entry from the original dataset; the data of user with UID 0. 
- For the Membership Inference Attack, if we take the sum of the sleep_duration of all users and compare it with the sum of sleep_duration with the dataset which has exactly one less record, we should be able to identify which user has slept how much and hence identify the user.



In [None]:
redact_df = original_df.copy()
redact_df = redact_df[1:]

In [None]:
redact_df.head(5)

Unnamed: 0,UID,sleep_duration
1,1,9.846144
2,2,4.84151
3,3,8.474597
4,4,3.163695
5,5,7.095451


We should see that the difference between the original and the redacted dataset is equal to exactly the redated user's data:

In [None]:
print("sleep duration of redacted user: {}".format(original_df.iloc[0, 1]))
print("sum of original dataset: {}".format(original_df['sleep_duration'].sum()))
print("sum of redacted dataset: {}".format(redact_df['sleep_duration'].sum()))
print("difference between dataset: {}".format(original_df['sleep_duration'].sum()-redact_df['sleep_duration'].sum()))

sleep duration of redacted user: 8.691972765491704
sum of original dataset: 70122.8203478781
sum of redacted dataset: 70114.12837511262
difference between dataset: 8.691972765489481


# The Solution: Diffential Privacy
- we add Laplace noise to both datasets and show that the attack doesn't work anymore.

In [None]:
# add noise to the original dataset.
dp_sum_original_dataset = BoundedSum(
    epsilon=1, lower_bound=original_df['sleep_duration'].min(), upper_bound=original_df['sleep_duration'].max(), dtype="float")
dp_sum_original_dataset.reset()
dp_sum_original_dataset.add_entries(original_df["sleep_duration"].to_list())  # adding the data to the DP algorithm

In [None]:
# calculate sum of the orignal dataset
dp_sum_og = round(dp_sum_original_dataset.result(), 2)
print(dp_sum_og)

70111.29


In [None]:
# add noise to the redacted dataset.
dp_redact_dataset = BoundedSum(epsilon=1, lower_bound=original_df['sleep_duration'].min(), upper_bound=original_df['sleep_duration'].max(), dtype="float")
dp_redact_dataset.add_entries(redact_df["sleep_duration"].to_list())

In [None]:
# calculate sum of the redacted dataset
dp_sum_redact = round(dp_redact_dataset.result(), 2)
print(dp_sum_redact)

70109.71


We should now see that the difference between the two datasets (using DP) is no longer the sleep duration of the redacted user!

In [None]:
print("sleep duration of redacted user: {}".format(original_df.iloc[0, 1]))
print("difference in sum using DP: {}".format(round(dp_sum_og - dp_sum_redact, 2)))
print("difference in sum without DP: {}".format(round(original_df['sleep_duration'].sum() - redact_df['sleep_duration'].sum(), 2)))

sleep duration of redacted user: 8.691972765491704
difference in sum using DP: 1.58
difference in sum without DP: 8.69
