# Differential Privacy

_Differential Privacy_ and _Quantitative Information Flow_ can be seen as having essentially the same goal as quantitative information flow, namely to control the leakage of sensitive information. In this notebook, we are going to try to explore similarities and differences bewteen the two approaches.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from func import *
try:
    from qif import *
except: # install qif if not available (for running in colab, etc)
    import IPython; IPython.get_ipython().run_line_magic('pip', 'install qif')
    from qif import *

### An example scenario

Assume we are interested in the eye color of a certain population $\cal{I} = \{Alice,Bob,Charlie\}$. Let the possible values for each person in $\cal{I}$ be defined by the set $\cal{V}=\{a,b,g\}$, where $a$ stands for absent (i.e. the person is not in this specific database), $b$ stands for _black_ and $g$ for _green_. Each dataset is a tuple $x_0x_1x_2 \in \cal{V}^2$ where $x_0$ represents the eye color of _Alice_, $x_1$ of _Bob_ and $x_2$ of _Charlie_. 

So we have

```
X = { 
    aaa, aab, aag, 
    aba, abb, abg,
    aga, agb, agg,
    baa, bab, bag,
         ...
    gga, ggb, ggg
    }
```

Consider now a counting query in the form of

```
SELECT COUNT(*)
FROM X
WHERE eye_color = 'b';
```

Its possible output values are

```
Y = {0, 1, 2, 3}
```

We can model this query as a channel $f$ as below.

$$
\begin{array}{|c|c|c|c|}
\hline
f & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3}  \\ \hline
\texttt{aaa} & 1 & 0 & 0 & 0  \\ \hline
\texttt{aab} & 0 & 1 & 0 & 0  \\ \hline 
\texttt{aag} & 1 & 0 & 0 & 0  \\ \hline 
\texttt{aba} & 0 & 1 & 0 & 0  \\ \hline 
... \\ \hline
\texttt{ggg} & 1 & 0 & 0 & 0  \\ \hline 
\end{array}
$$

Now insted of reporting the true answer $y$, we process it further by passing it through a noise channel $H$ and report a slightly different answer $z$. The domain of $z$ is the same as $Y$, i.e. $Z = Y$.

Here are gonna use the following mechanism for $H$.

$$
\begin{array}{|c|c|c|c|}
\hline
H & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3}  \\ \hline
\texttt{0} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline
\texttt{1} & \frac{1}{4} & \frac{1}{2} & \frac{1}{6} & \frac{1}{12} \\ \hline
\texttt{2} & \frac{1}{12} & \frac{1}{6} & \frac{1}{2} & \frac{1}{4} \\ \hline
\texttt{3} & \frac{1}{36} & \frac{1}{18} & \frac{1}{6} & \frac{3}{4} \\ \hline
\end{array}
$$

What $H$ does is basically add noise to the true answer of $f$ using a specific logic which is not of importance for this example. The thing to notice is that the true answer of $f$ has the highest probability within its row.

So the whole channel, from $X$ to $Z$, i.e. from the database to the fuzzy query answer, can be depicted as below.

$$
\begin{array}{|c|c|c|c|}
\hline
C & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3}  \\ \hline
\texttt{aaa} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline
\texttt{aab} & \frac{1}{4} & \frac{1}{2} & \frac{1}{6} & \frac{1}{12} \\ \hline
\texttt{aag} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline
\texttt{aba} & \frac{1}{4} & \frac{1}{2} & \frac{1}{6} & \frac{1}{12} \\ \hline
...\\ \hline
\texttt{ggg} & \frac{3}{4} & \frac{1}{6} & \frac{1}{18} & \frac{1}{36} \\ \hline 
\end{array}
$$

In [2]:
num_persons = 3
values = ['a', 'b', 'g']
num_values = len(values)
query_value = 'b'

In [3]:
C = get_C(num_persons, values, query_value)
print(C)

[[0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.02777778 0.05555556 0.16666667 0.75      ]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.08333333 0.16666667 0.5        0.25      ]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       0.5        0.16666667 0.08333333]
 [0.75       0.16666667 0.05555556 0.02777778]
 [0.25       

The following image graphically depicts what we just described (ingore the notions of leakage and utility for now).

![img1](./img1.jpg)

### Assesing the information leakage through _QIF_

To measure the leakage with _QIF_ we must first define the prior distribution over $X$. If we don't have any particular knowledge about it we use the uniform distribution. 

In [4]:
pi = probab.uniform(num_persons ** num_values)
print(pi)

[0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704 0.03703704
 0.03703704 0.03703704 0.03703704]


Next, we compute the hyper distribution which depends both on $H$ and $\pi$.

In [5]:
from print_hyper import print_hyper
print_hyper(C, pi)

---------------------------------------
|    0.35    0.31    0.21    0.13 |
---------------------------------------
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.00    0.01    0.03    0.22 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02    0.01    0.01 |
|    0.03    0.06    0.03    0.02 |
|    0.01    0.02    0.09    0.07 |
|    0.03    0.06    0.03    0.02 |
|    0.08    0.02   

Now for each column, i.e. each possible outcome, we model the threat as the $x$ with the highest probability within that column. So we pick the maximum probabilities of each column and then **we weigh** them with the outer probabilities, i.e. the probability of each $z$ occuring. And the result is the vulnerability of $C$.

### Assesing the information leakage through _Differential Privacy_

Differential privacy works a bit differently. 

For each possible outcome $z$, i.e. each column of $C$, it models the threat as the biggest difference between two elements. That is, it takes the maxmimum and minimum and compute the $\epsilon$ which satisfies the following inequality.

$$
C_{x_1 z} \leq e^{\epsilon} \cdot C_{x_2z}
$$

where $x_1$ and $x_2$ are _adjacent_ or _neighbor_ databases, meaning that they differ in the presence of, or in the value associated with, exactly one individual. 