# Differential Privacy

_Differential Privacy_ and _Quantitative Information Flow_ can be seen as having essentially the same goal, namely to control the leakage of sensitive information. In this notebook, we are going to try to explore similarities and differences bewteen the two approaches.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from func import *
try:
    from qif import *
except: # install qif if not available (for running in colab, etc)
    import IPython; IPython.get_ipython().run_line_magic('pip', 'install qif')
    from qif import *

## An example scenario

Assume we are interested in the eye color of a certain population $\cal{I} = \{Alice,Bob,Charlie\}$. Let the possible values for each person in $\cal{I}$ be defined by the set $\cal{V}=\{a,b,g\}$, where $a$ stands for absent (i.e. the person is not in this specific database), $b$ stands for _black_ and $g$ for _green_. Each dataset is a tuple $x_0x_1x_2 \in \cal{V}^2$ where $x_0$ represents the eye color of _Alice_, $x_1$ of _Bob_ and $x_2$ of _Charlie_. 

So we have

```
X = { 
    aaa, aab, aag, 
    aba, abb, abg,
    aga, agb, agg,
    baa, bab, bag,
         ...
    gga, ggb, ggg
    }
```

Consider now a counting query in the form of

```
SELECT COUNT(*)
FROM X
WHERE eye_color = 'b';
```

Its possible output values are

```
Y = {0, 1, 2, 3}
```

We can model this query as a channel $f$ as below.

$$
\begin{array}{|c|c|c|c|}
\hline
f & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3}  \\ \hline
\texttt{aaa} & 1 & 0 & 0 & 0  \\ \hline
\texttt{aab} & 0 & 1 & 0 & 0  \\ \hline 
\texttt{aag} & 1 & 0 & 0 & 0  \\ \hline 
\texttt{aba} & 0 & 1 & 0 & 0  \\ \hline 
... & ... & ... & ... & ...\\ \hline
\texttt{ggg} & 1 & 0 & 0 & 0  \\ \hline 
\end{array}
$$

Now insted of reporting the true answer $y$, we process it further by passing it through a noise channel $H$ and report a slightly different answer $z$. 

Here are gonna use the following mechanism for $H$.

$$
\begin{array}{|c|c|c|c|}
\hline
H & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3}  \\ \hline
\texttt{0} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline
\texttt{1} & \frac{1}{4} & \frac{1}{2} & \frac{1}{6} & \frac{1}{12} \\ \hline
\texttt{2} & \frac{1}{12} & \frac{1}{6} & \frac{1}{2} & \frac{1}{4} \\ \hline
\texttt{3} & \frac{1}{36} & \frac{1}{18} & \frac{1}{6} & \frac{3}{4} \\ \hline
\end{array}
$$

What $H$ does is basically add noise to the true answer of $f$ using a specific logic which is not of importance for this example. The thing to notice is that the true answer of $f$ has the highest probability within its row.

So the whole channel, from $X$ to $Z$, i.e. from the database to the fuzzy query answer, can be depicted as below.

$$
\begin{array}{|c|c|c|c|}
\hline
C & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3}  \\ \hline
\texttt{aaa} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline
\texttt{aab} & \frac{1}{4} & \frac{1}{2} & \frac{1}{6} & \frac{1}{12} \\ \hline
\texttt{aag} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline
\texttt{aba} & \frac{1}{4} & \frac{1}{2} & \frac{1}{6} & \frac{1}{12} \\ \hline
... & ... & ... & ... & ...\\ \hline
\texttt{ggg} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline 
\end{array}
$$

In [2]:
num_persons = 3
values = ['a', 'b', 'g']
num_values = len(values)
query_value = 'b'

In [3]:
C = get_C(num_persons, values, query_value)
print(C)

[[0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.02777778 0.05555556 0.16666667 0.75      ]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       

The following image graphically depicts the whole setting (ingore the notions of leakage and utility for now).

![img1](./img1.jpg)

## Assesing the information leakage through _QIF_

To measure the leakage with _QIF_ we must first define the prior distribution over $X$. If we don't have any particular knowledge about it we use the uniform distribution. 

In [4]:
pi = probab.uniform(num_persons ** num_values)
print(pi)

[0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704]


Next, we compute the hyper distribution which depends both on $H$ and $\pi$.

In [5]:
from print_hyper import print_hyper
print_hyper(C, pi)

---------------------------------------
|    0.35    0.31    0.21    0.13 |
---------------------------------------
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.00    0.01    0.03    0.22 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02   

Now for each column of the matrix above, i.e. each possible outcome $z$, __QIF__ models the threat as the $x$ with the highest probability within that column. So we pick the maximum probabilities of each column and then **we weigh** them with the outer probabilities, i.e. the probability of each $z$ occuring. And the result is the vulnerability of $C$.

In [6]:
print("QIF posterior vulnerability:", measure.bayes_vuln.posterior(pi, C))

QIF posterior vulnerability: 0.09259259259259259


## Assesing the information leakage through _Differential Privacy_

Differential privacy works a bit differently. 

First of all, it uses the notion of _adjacent_ or _neighbor_ databases. This means two databases that differ in the presence of, or in the value associated with, exactly one individual. We use $x_1 \sim x_2$ to indicate that $x_1$ and $x_2$ are adjacent. For example $bbg \sim bag$ and $aba \sim bba$.

Now, for each column of $C$, i.e. each possible outcome $z$, __Differential Privacy__ models the threat as the biggest difference between any two _adjacent_ databases and computes the gap bewteen them as below.

$$
C_{x_1z} \cdot e^{\epsilon} = C_{x_2z} 
$$

where the biggest gap between any two _adjacent_ elements of that column is realized for $x_1$ and $x_2$ and $C_{x_1z}$ is the smallest probability and $C_{x_2z}$ is the highest.

To combine the $\epsilon$ values for all $z$, we keep the biggest one, which represents the biggest gap between the probabilities of any two _adjacent_ elements.

In [7]:
# The following function overestimates the real value of epsilon,
# but provides an upper bound for it.
print("Differential Privacy epsilon:", get_worst_epsilon(C))

Differential Privacy epsilon: 3.295836866004329


Let's verify that indeed that is the worst-case $\epsilon$ by observing the $\epsilon$ values for each column of $C$.

In [8]:
for i in range(num_values+1):
    print("epsilon for column", i, "=", get_worst_epsilon(C, i))

epsilon for column 0 = 3.295836866004329
epsilon for column 1 = 2.1972245773362196
epsilon for column 2 = 2.1972245773362196
epsilon for column 3 = 3.295836866004329


The idea behind **Differential Privacy** is that the presence or absence of any individual in a database, or changing the data of any individual, does not significantly affect the probability of obtaining any specific answer for a certain query.

## Comparing the two approaches

One difference is that QIF is sensitive to the prior distribution of $X$ whereas differential privacy is not.

For example consider the uniform and point distrubtions as below.

In [9]:
pi1 = probab.uniform(num_persons ** num_values)
print("pi1\n", pi1, "\n")
pi2 = probab.point(num_persons ** num_values)
print("pi2\n", pi2)

pi1
 [0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704] 

pi2
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0.]


If we measure the information leakage thrgouh QIF for both cases we get:

In [10]:
print("QIF posterior vulnerability for p1:", measure.bayes_vuln.posterior(pi1, C))
print("QIF posterior vulnerability for p2:", measure.bayes_vuln.posterior(pi2, C))

QIF posterior vulnerability for p1: 0.09259259259259259
QIF posterior vulnerability for p2: 1.0


But differential privacy does not consider the prior distribution of $X$, so we get:

In [11]:
print("Differential Privacy epsilon for pi1:", get_worst_epsilon(C))
print("Differential Privacy epsilon for pi2:", get_worst_epsilon(C))

Differential Privacy epsilon for pi1: 3.295836866004329
Differential Privacy epsilon for pi2: 3.295836866004329


Which is also obvious from the fact that the `get_worst_epsilon` function does not take a `pi` parameter.

Another difference between the two is that QIF vulnerability is defined as the result of averagind the contribution of all the columns to the vulnerability, while differential privacy represents the worst case (i.e.the maximum $\epsilon$ for all $z$).

Hence there could be a column with a very high $\epsilon$ value which does not contribute very much to the average (typically because the corresponding output has very low probability). In that case, the QIF vulnerability could be very small, and still $\epsilon$-differential privacy would have a really big $\epsilon$ value.

## Incorporating ideas from QIF into Differential Privacy

It is interesting to notice that since differential privacy does not take into account the prior distribution, we could use a prior uniform distribution, which assumes no special knowledge about $X$, and then compute the epsilon parameter from the posterior distributions matrix like below.

In [12]:
pi = probab.uniform(num_persons ** num_values)
posteriors = channel.hyper(C, pi)[1]
print("Differential Privacy epsilon:", get_worst_epsilon(posteriors))

Differential Privacy epsilon: 3.2958368660043296


Channel matrix $C$ and the matrix of the posterior distributions give the same epsilon values for all columns. Compare them with the results we got before.

In [13]:
for i in range(num_values+1):
    print("epsilon for column", i, "=", get_worst_epsilon(posteriors, column=i))

epsilon for column 0 = 3.2958368660043296
epsilon for column 1 = 2.1972245773362196
epsilon for column 2 = 2.1972245773362196
epsilon for column 3 = 3.2958368660043296


So if for some reason there is a need to take the prior distribution into account, we could define `pi` and `C` and compute the epsilon value from the matrix of the posterior distributions.

If we wanted to address the second difference we mentioned before between the two methods, it would be interesting to consider instead of taking the worst-case $\epsilon$ for all columns, that by weighing the $\epsilon$ values with the outer probabilities (i.e. the probability of each $z$ happening), then we would not let low-probability high-impact $\epsilon$ affect too much the general leakage of our channel.

That is, we could compute $\epsilon$ as follows.

$$
\epsilon = p_Y(y_0) \cdot \epsilon_0 + p_Y(y_1) \cdot \epsilon_1 + p_Y(y_2) \cdot \epsilon_2 + p_Y(y_3) \cdot \epsilon_3
$$

Again, if we don't want to incorporate any special knowledge about the prior distribution we could assume the uniform distribution.

So the new epsilon for our example would be:

In [14]:
outers = channel.hyper(C, pi)[0]
e = 0

for i in range(num_values+1):
    e += outers[i] * get_worst_epsilon(posteriors, column=i)

print("Differential Privacy epsilon:", e)

Differential Privacy epsilon: 2.7261860496579025


Which is obviously lower than the usual $\epsilon$ since that one represents the worst case.