# Node Privacy
Reference implementations of some of the techniques described in the paper
[Analyzing Graphs with Node Differential Privacy](https://privacytools.seas.harvard.edu/files/privacytools/files/chp3a10.10072f978-3-642-36594-2_26.pdf)
by Shiva Prasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith.

## Library Imports
We use [networkx](https://networkx.org) to perform the required graph computations, [scipy](https://www.scipy.org) (in particular the optimisation algorithms provided by
[scipy.optimize](https://docs.scipy.org/doc/scipy/reference/optimize.html))
to solve the required optimisation problems, and the implementation of the Laplace mechanism provided by [RelM](https://github.com/anusii/RelM) to release the differentially private query responses.

In [1]:
import numpy as np
import networkx as nx

import scipy.optimize
import scipy.interpolate
import scipy.special

from relm.mechanisms import LaplaceMechanism

<hr style="border:2px solid gray"> </hr>

## Convex Optimisation
The authors describe an algorithm for computing a low-sensitivity approximation to queries which are linear in the degree distribution of a graph $G$. This algorithm involves solving a convex optimisation problem defined in terms of a flow graph derived from $G$. As such, we will call this algorithm the quasiflow approximation algorithm. This process is comprised of three steps:
  1. Derive the appropriate flow graph from $G$,
  2. Solve the required convex optimisation problem,
  3. Add noise to the approximate query response scaled according to its sensitivity.

#### Generate a random graph

In [2]:
n = 2**7
p = 2 ** -6
G = nx.random_graphs.gnp_random_graph(n, p)

### Compute the exact query responses
The algorithm in question works for queries which are linear in the degree distribution of a graph. That is, queries which can be written as $$f(G) = \sum_{v \in G} h(\text{deg}_v(G)$$ for some concave function $h$. Three examples of such queries are the number of edges in a graph $\left(h(i) = i/2\right)$, the number of nodes in a graph $\left(h(i) = 1\right)$, and the number of $k$-stars in a graph $\left(h(i) = \binom{i}{k}\right)$.

In [3]:
def exact_count(G, h):
    """
    Compute the exact value of a query which is linear in the degree distribution of G
    
    Parameters:
        G: An undirected graph
        h: An array describing a query which is linear in the degree distribution of G
        
    Returns:
        The exact value of the query evaluated on G.
    """
    x = np.arange(len(h))
    h_fun = scipy.interpolate.interp1d(x, h)
    degree_histogram = np.array(nx.degree_histogram(G))
    exact_count = degree_histogram @ h_fun(np.arange(len(degree_histogram)))
    return exact_count

#### Compute the exact counts

In [4]:
# Edge count
h_edge = np.arange(n+1) / 2.0
edge_count = exact_count(G, h_edge)

# Node count
h_node = np.ones(n+1)
node_count = exact_count(G, h_node)

# kstar count
k = 3
h_kstar = scipy.special.comb(np.arange(n+1), k)
kstar_count = exact_count(G, h_kstar)

### Build the flow graph
The flow graph used in the quasiflow approximation algorithm is constructed by taking two copies each node in $G$, one labeled "left" and one labled "right". Each "left" copy is connected to a source node $s$ via an edge with capacity $D$. Each "right" copy is connected to a sink node $t$ via an edge with capacity $D$. Finally, each "left" node is connected to each "right" node via an edge with capacity 1.

In [5]:
def build_flow_graph(G, D):
    """
    Build a flow graph for G
    
    Parameters:
        G: An undirected graph
        D: The capacity for edges between nodes of G and
           the source/sink nodes in the flow graph
           
    Returns:
        A flow graph whose max flow yields an approximate query response
    """
    V_left = list(zip(["left"] * len(G), G.nodes()))
    V_right = list(zip(["right"] * len(G), G.nodes()))
    F = nx.DiGraph()
    F.add_nodes_from(V_left)
    F.add_nodes_from(V_right)
    F.add_nodes_from("st")
    F.add_weighted_edges_from([("s", vl, D) for vl in V_left], weight="capacity")
    F.add_weighted_edges_from([(vr, "t", D) for vr in V_right], weight="capacity")
    F.add_weighted_edges_from(
        [(("left", u), ("right", v), 1) for u, v in G.edges()], weight="capacity"
    )
    F.add_weighted_edges_from(
        [(("left", v), ("right", u), 1) for u, v in G.edges()], weight="capacity"
    )
    return F

In [6]:
D = 2**3
F = build_flow_graph(G, D)

### Solve the required convex optimisation problem
The quasiflow approximation is the defined as the maximal value over all flows on the flow graph described above of the objective function $obj_h = \sum_{v \in V} h(\text{Fl}(v))$ where $\text{Fl}(v)$ is the units of flow passing from $s$ to the "left" copy of $v$ in the flow $\text{Fl}$.

In [7]:
def bounded_degree_quasiflow(F, h, D):
    """
    Parameters:
        G: An undirected graph
        h: An array describing a query which is linear in the degree distribution of G
        D: A bound on the capacities in the flow graph derived from G
        
    Returns:
        The maximal value for \max_{f \in flows} \sum{v \in F} h(f(v))
    """
    nodes = list(F.nodes())
    edges = list(F.edges())
    adjacency = np.zeros((len(nodes), len(edges)))
    for j in range(len(edges)):
        i0 = nodes.index(edges[j][0])
        i1 = nodes.index(edges[j][1])
        adjacency[i0, j] = -1
        adjacency[i1, j] = 1

    capacities = np.array([F.edges[e]["capacity"] for e in F.edges()])
    x0 = np.random.random(capacities.size) * capacities
    mask = np.array([("s" in edge) for edge in edges])
    bounds = [(0, capacity) for capacity in capacities]
    constraint = scipy.optimize.LinearConstraint(adjacency[:-2], 0, 0)

    x = np.arange(D + 1)
    h_fun = scipy.interpolate.interp1d(x, h[:D+1])
    f = lambda x, *args: -np.sum(h_fun(x[tuple(args[0])]))
    res = scipy.optimize.minimize(
        fun=f, x0=x0, args=[mask], bounds=bounds, constraints=[constraint]
    )
    return -res.fun

In [8]:
# Edge count
bd_quasiflow_edge = bounded_degree_quasiflow(F, h_edge, D)

# Node count
bd_quasiflow_node = bounded_degree_quasiflow(F, h_node, D)

# kstar count
bd_quasiflow_kstar = bounded_degree_quasiflow(F, h_kstar, D)

### Add noise scaled according to the sensitivity of the bounded-degree quasiflow
We create a differentially private release mechanism by adding Laplace noise to the bouded-degree quasiflow computed above scaled according to its sensitivity. The sensitivity of the bounded-degree quasiflow is given by $\lVert f \rVert_{\infty} + D\lVert f^{\prime} \rVert_{\infty}$ where $\lVert f \rVert_{\infty} = \max_{0 \leq x \leq D} h(x)$ and $\lVert f^{\prime} \rVert_{\infty} = \max_{0 \leq x < D} |h(x+1) - h(x)|$ is the Lipschitz coefficient of $h$ on $[0, D]$.

Because the Laplace distributed random variables are real-valued, the differentially private query response will be real-valued despite the exact query response being integer-valued.


In [9]:
# Create a differentially private release mechanism
epsilon = 1.0

# Edge count
sensitivity_edge = np.max(h_edge[:(D+1)]) + np.max(h_edge[1:(D+1)] - h_edge[:D])
mechanism_edge = LaplaceMechanism(epsilon=epsilon, sensitivity=sensitivity_edge)
dp_edge_count = mechanism_edge.release(np.array([bd_quasiflow_edge]))

# Node count
sensitivity_node = np.max(h_node[:(D+1)]) + np.max(h_node[1:(D+1)] - h_node[:D])
mechanism_node = LaplaceMechanism(epsilon=epsilon, sensitivity=sensitivity_node)
dp_node_count = mechanism_node.release(np.array([bd_quasiflow_node]))

# kstar count
# Note: The sensitivity of the kstar count is much greater than that of the edge and node counts
sensitivity_kstar = np.max(h_kstar[:(D+1)]) + np.max(h_kstar[1:(D+1)] - h_kstar[:D])
mechanism_kstar = LaplaceMechanism(epsilon=epsilon, sensitivity=sensitivity_kstar)
dp_kstar_count = mechanism_kstar.release(np.array([bd_quasiflow_kstar]))

#### Display results

In [10]:
# Edge count
print("Exact edge count = %f" % edge_count)
print("Approximate edge count = %f" % bd_quasiflow_edge)
print("Differentially private edge count = %f\n" % dp_edge_count)

# Node count
print("Exact node count = %f" % node_count)
print("Approximate node count = %f" % bd_quasiflow_node)
print("Differentially private node count = %f\n" % dp_node_count)

# kstar count
print("Exact %i-star count = %f" % (k, kstar_count))
print("Approximate %i-star count = %f" % (k, bd_quasiflow_kstar))
print("Differentially private %i-star count = %f\n" % (k, dp_kstar_count))

Exact edge count = 119.000000
Approximate edge count = 119.000000
Differentially private edge count = 117.148070

Exact node count = 128.000000
Approximate node count = 128.000000
Differentially private node count = 127.806005

Exact 3-star count = 125.000000
Approximate 3-star count = 123.000000
Differentially private 3-star count = 117.834289



<hr style="border:2px solid gray"> </hr>

## Linear Programming
The authors describe a second approximation algorithm for the number of copies of a small template graph $H$ that are contained in a graph $G$.  This algorithm involves solving a linear programming problem. This proces sis comprised of two steps:
  1. Forumlate the linear programming problem,
  2. Solve the linear programming problem,
  3. Add noise to the approximate query response scaled according to its sensitivity.

#### Generate a random graph

In [11]:
n = 2 ** 7
p = 2 ** -3
G = nx.random_graphs.gnp_random_graph(n, p)

### Compute the exact query responses
This algorithm returns an approximation to the number of copies of some $k$-node subgraph $H$ in $G$.  We give one example of such a query with $k = 3$. 

In [12]:
# Compute the exact triangle count
k = 3
triangles = nx.triads_by_type(nx.DiGraph(G))["300"]

### Formulate linear programming problem
To formulate the linear programming problem, we need a set of variables, and objective function, and a set of constraints.  Let $C$ be the set of copies of $H$ in $G$ and for every $c \in C$ let $V(c)$ be the vertex set of $c$. Finally, let $\Delta_D f$ be the sensitivity of the query on $D$-bounded graphs.

The variables in our linear program are:
  - $\{X_c\}_{c \in C}$.

The objective function is:
  - $\sum_{c \in C} X_c$.

Subject to the constraints:
  - $0 \leq X_c \leq 1$ for all $c \in C$,
  - $\sum_{c: v \in V(c)} X_C \leq \Delta_D f$ for all $v \in V(G)$
  
Observe that $\Delta_D f \leq k D (D-1)^{k-2}$.

In [13]:
D = 2**2

# Compute bounded-degree triangle count using linear programming
c = -np.ones(len(triangles))
A_ub = np.zeros((n, len(triangles)))
for i, t in enumerate(triangles):
    nodes = triangles[i].nodes()
    for node in nodes:
        A_ub[node, i] = 1

sensitivity = k * D * (D - 1) ** (k - 2)
b_ub = np.ones(n) * sensitivity
bounds = (0.0, 1.0)

### Solve linear programming problem
We use `scipy.optimize.linprog` to solve the linear program.

In [14]:
res = scipy.optimize.linprog(c, A_ub=A_ub, b_ub=b_ub, bounds=bounds)

### Add noise scaled according to the sensitivity of the linear program solution
We create a differentially private release mechanism by adding Laplace noise to the solution to the linear program computed above scaled according to its sensitivity.

Because the Laplace distributed random variables are real-valued, the differentially private query response will be real-valued despite the exact query response being integer-valued.

In [15]:
# Create a differentially private release mechanism
epsilon = 1.0
mechanism = LaplaceMechanism(epsilon=epsilon, sensitivity=sensitivity)

# Compute the differentially private query response
dp_triangle_count = mechanism.release(np.array([-res.fun]))[0]

#### Display results

In [16]:
print("Exact triangle count = %i" % len(triangles))
print("Bounded-degree triangle count = %f" % -res.fun)
print("Differentially private triangle count = %f\n" % dp_triangle_count)

Exact triangle count = 713
Bounded-degree triangle count = 707.999999
Differentially private triangle count = 719.161563

