In [236]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

## 1. Differential Privacy Mechanisms : Introduction

![Differential Privacy](./images/streamlinehq-protect-privacy-4-users-200.PNG)

Differential privacy is a set of mechanisms for publicly sharing information about a dataset by describing the patterns of groups within the dataset while limiting the disclure of information about individuals in the dataset. The goal is to enable gaining insights about a population while protecting the privacy of individuals. 

- Differential Privacy offers mathematical guarantees about the level of privacy afforded an individual.  These mathematical guarantees make the mechanisms attractive as a means of protection. 

- Differential Privacy mechanisms act as a filter of the true responses from queries to a database.  These filters perturb the true answers just enough to keep the contributions of individual records private. These types of queries are typically referred to as "aggregation queries".

- In practice, if implemented appropriately, differentially private queries will return answers that are a) useful and b) keep an individual's membsherip in the database private

- In theory, differential privacy results can provide mathematical gaurantees on the level of privacy afforded.  This is what makes them most attractive.

- All differential privacy mechanisms work by injecting a perturbation in the output of a query.  What? Thats right, the answer returned by the query differs potentially from the true answer!

- The perturbation comes in the form of either adding "noise"  to the true answer (adding or subtracting a random amount from the true answer) as is the case with **Laplacian Mechanisms** or **Gaussian Mechanisms**, or potentially returning the wrong answer some times, as is the case with **Exponential Mechanisms**.

- Now that we understand what Differntial Privacy tries to guarantee on an intuitive level, we want to add a "knob" that determines the level of that protection.  We can turn the knob (or knobs) to increase or decrease the level of privacy. The first knob comes in the form of the parameter $\epsilon$. The second knob comes in the form of the parameter $\Delta$.

- **Think lower $\epsilon$ is higher privacy!** 

- Envision $\epsilon$ as a knob.  

![Knob](./images/volume.PNG)

- The higher you turn the knob, the lower the privacy.

- Practical implementations require the careful choice of $\epsilon$ and $\Delta$ to produce privacy in practice and useful results from queries.

Next, let's explore how to set the parameters for our differential privacy mechanisms in practice.

## 2. Setting the Parameter Epsilon

The two important paremeters in differential privacy mechanisms are **$\epsilon$** and **$\Delta$**.  Here we first consider the "privacy" parameter $\epsilon$. The symbol may not be that familar as it never made it as a big time COVID-19 variant!

![Epsilon](./images/epsilon.png)

 - $\epsilon$ is the parameter that sets the level of the privacy guarantee and affects the level of utility of the results returned from queries to the database.
 
- Recall that by adding probabilistic noise to the outcome of the mechanism just enough uncertainty is added to mask the membership of any one record in the database. $\epsilon$ is the parameter that determines how much noise is added to the outcome.

- Changes in $\epsilon$  will impact both the privacy guarantees afforded to the members of the database and the utility of the results.

### 2.1 Selecting Epsilon for the Right Balance of Privacy and Utility

At this point you might be wondering what happens if you set $\epsilon$ as low as possible. **The information gained from the query result will eventually become worthlessif you lower epsilon too much!** The answer to a query will become so noisy, so skewed from the actual answer, that no insights will be able to be gained. **So balancing privacy with utility is the key!**

### 2.2 Some Practical Facts About Epsilon and Simple DIfferential Privacy

- Simply put $\epsilon$ is a measure of **privacy loss or leakage**.  

- Thus, higher epsilon values translate to lower privacy.  

- A ‘good’ $\epsilon$ value should be low and at the same time maintain a certain level of utility of the answers provided by the DP mechanism. 

- $\epsilon$ ranges higher than 4 or 5 typically offer diminishing privacy guarantees so much so as really translte to no privacy (although the second parameter $\Delta$ does affect this claim).

- $\epsilon$ values below 1 tend to diminish utility to the point of being impractical (so much noise in the answer its not helpful)

- Simple Differential Privacy (SDP) offeres $\epsilon$ ranges between .1 and 4, a reasonably practical range 

- SDP offers $\epsilon$ offers values to be set by .1 increments 

- $\epsilon$ is a setting like a water faucet the higher you turn the knob, the more privacy that leaks (or gushes) out.


### 2.3 The Mathematical Relationship Between $\epsilon$ and Privacy

- Very generally privacy guarentees are met by bounding the differences in the probabilities of outcomes between queries from two databases that differ by only one record.

- Bounding this difference protects individual records.

- Very generalley turning up $\epsilon$ one whole number, say from 1 to 2, decreases privacy by a factor of e, or 2.718 (all other parameters held equal)

- The mathematics of differential privacy are such that $\epsilon$ from 1 to 5 make the probability of an outcome being returned by a Laplace mechanism over 100 times more likely.  Thus if initially an outcome was less than 3% likely with an $\epsilon$ value of 1 it is over 100% (not private at all) with an $\epsilon$ of 5

- If you are interested the mathematical aspects of differential privacy, subsequent notebooks take a deeper dive and present functions for experimentation with parameters and use cases.

### 2.4 Putting the Choice of $\epsilon$  Into Practice

As we mentioned earlier the trade-off between utility and privacy is a **inverse** relationship, privacy goes up, utility goes down. At a high level, as a general concept is depicted by the following illustration.

![Epsilon](./images/Trade-off-between-privacy-and-utility-3.png)

- The idea is to find the point in the trade-off that is right for each database and use case.
- There is no exact right answer! The choice depends on the application.
- However, examples from practical research and existing implementations can guide the choice.
- The key is to find the point where the acceptable privacy and acceptable utility meet.
- Utility depends on acceptable accuracy for a particular use case.
- But acceptable ranges of epsilon tend to be centered around $\epsilon$ = 1
- Below is an example of an actual experiment with varying values of $\epsilon$ and the resulting accuracy of a Bayesian classifier model.

![Epsilon](./images/Trade-off-between-privacy-and-utility-2.png)

- This actual use case illustrates a common accuracy trade-off with increasing $\epsilon$
- Notice there is very little accuracy gain for $\epsilon$ > 4
- Notice there is very little accuracy loss for $\epsilon$ < .1

## 3. A Simple Example: Oski (Go Bears!) and the Balding Brown Bears

![Oski](./images/oski.png)

Suppose we are studying the impact of dietary habits of the brown bear and the implications of diet on the loss of fur of our salmon eating friends. The database contains a list of bears, all of whom are losing their fur.  They are balding.  No self-respecting bear wants such a sensitive fact made public. Disclosure of a bear's membership in this database means breaching the privacy of the bear's senstive attribute: they are going bald.

Consider the facts around the balding bears database.

- When the salmon are running large, talented and strong brown bears can catch in excess of 30 salmon per day
- Lesser talented, "normal" bears tend to collect and eat 10 - 20 salmon per day


![CalSal](./images/California_Salmon.jpeg)

Suppose we have a database of ten bears, all somewhat average brown bears, who are all balding.  Initially we have 10 bears in the database.

1. Avalanche (Kutztown)
2. Bananas (U of Maine)
3. Benny (Morgan State)
4. Boomer (Lake Forest)
5. Grizz (Oakland)
6. Kody (Cascadia)
7. Monte (U Montana)
8. Nanook (Bowdoin)
9. Ranger (Drew)
10. Scott Highlander (UC Riverside)

On a particular day, July 30, when the salmon are running the database records the following number of fish captured by this group.

[10, 13, 14, 12, 18, 14, 18, 17, 16, 12]

**Note: the eating habits of this entire group falls within a reasonable range of "normal" brown bear daily catches!**


What if we want to understand some statistics about this group of bears without divulging the actual catches of a particular brown bear?

The true average is 14.4

Using a Laplacian mechanism with **$\epsilon$ = 1** (**reasonably high privacy**), we ask what is the differentially private average catch on July 30th? 

 With 10 runs of the Laplacian Diffierential Privacy Mechanism the noisy average output is
 
 - 15.05
 - 10.56
 - 17.80
 - 12.74
 - 12.01
 - 17.15
 - 22.76
 - 11.34
 - 12.28
 - 15.89
 
Here the MSE or mean square error of the 10 query results from the true average is **12.78**. That is a lot of noise relative to the true average!

Using a Laplacian mechanism with **$\epsilon$ = 4** (**reasonably low privacy**), we ask what is the average catch on July 30th?

With 10 runs we see a much tigher band around the true answer. With higher $\epsilon$ comes lower protection of privacy.

- 14.51
- 14.80
- 14.75
- 13.45
- 15.63
- 13.36
- 16.20
- 12.93
- 12.97
- 13.87

Here the Mean Square Error of the 10 query results is **.11**.  The actual query results are close to the true answer.

What conclusions can we draw from this example?

- The choice of $\epsilon$ boils down to tolerable error in the results for a given dataset
- One important factor is that we considered only the utility privacy trade-off for a pretty **homegeneous** group of bears. All bears ate an amount of salmon in the range of 10-18
- With this homegenous group it is pretty hard to figure out **under either $\epsilon$ value** whether any particular bear is in the data set.  

- This distribution of data, fairly homogeneous in nature motivates a lower privacy gaurantee and more utility!  

**What if we needed to protect the privacy of outlier bear we well, one such as Oski?**
What if the database, in general had to protect a wider range of dietary habits of our salmon eaters? Not only is the protection important but perhaps the outliers hold the key to uncovering a cure for the balding bears?

It is well known that **Oski the Cal Bear** is a giant among bears. What if we have auxillary information about Oski?  Namely, true to his school population he is talented, competitive, clever, and innovative and ee may be able to catch **40-50** salmon a day when they are running!

**Let's add Oski to the database. To keep the fact that Oski is in this database private, it will take more work from $\epsilon$!**

Suppose Oski records 48 salmon caught on July 30th? Look at the results with $\epsilon$ = 4. The true average with Oski is $17.46$.

- 17.96
- 17.24
- 21.27
- 16.39
- 18.07
- 17.54
- 17.35
- 17.48
- 18.05
- 17.17

Compare these results with those where Oski is not in the database.  The noisy average with $\epsilon$ = 4 reveals with high probability a bear who is a giant eater is in the database.  We know Oski is a skilled, voracious hunter.  Thus we can guess with better odds that Oski is probably in the database.

This violates our differential privacy goal! Outliers like Oski in a database need lower values of $\epsilon$ to mask their membership and thus protect their senstive attributes.

What if we want to effectively mask Oski's membership in the database? What $\epsilon$ value would we need to accomplish this?

Look what happens if we choose $\epsilon$ = .5 and run the Laplace Mechanism 10 times before and after Oski is in the database.

Recall the true average without Oski is 14.4 and with Oski is 17.46.


Before:

- 4.0
- 28.53
- 16.04
- 14.11
- 35.73
- 32.66
- 11.89
- 10.09
- 10.58
- 38.85

After:

- 15.01
- 15.59
- 14.66
- 14.74
- 12.18
- 47.22
- 21.04
- 32.38
- 11.68
- 20.72

With an $\epsilon$ = .5 the MSE between the cases where Oski is in the database and not in the database the MSE is 103 and 110 respectively.
The values are so noisy as to make the probabilty of learning of his presense in the database highly unlikely!

But notice the noise in the average values.  Still, the $\epsilon$ makes Oski's membership virtually impossible to ascertain!

**What did we learn from protecting Oski?**

- The value for $\epsilon$ depends on the level of privacy needed for a particular use case

- Both the variablity of the data and level of protection desired drives the choice.  
- If the range of the data is not highly variable, $\epsilon$ can be higher, imparting more accuracy.  
- If the use case needs less protection for outliers, because they may not even be possible in the data, use of a higher value of $\epsilon$ 
- However, if protection must extend to extreme cases in the database, $\epsilon$ must be much lower, typically between 1 and 0.
- There is a hefty loss of utility between 1 and 4.  We saw at .5 utility of the average was highly impaired possibliy to the point of being useless. Tolerable utility degradation is use case specific, data specific.
- Parameter tuning and privacy decision must be made carefully, with study anbd experimentation on the database itself.