Maintaining Data Privacy in Association Rule Mining using MASK algorithm 

By
Salah Ismail , matricola: 5239380
Amir Mashmool, matricola: 5245307
Kozy-Korpesh Tolep, matricola: 5302354  

## Background

This paper discusses the challenge of obtaining accurate input data for data mining services while addressing privacy concerns. It explores whether users can be motivated to provide correct information by guaranteeing privacy protection during the mining process. proposing a scheme that distorts user data probabilistically, ensuring a high level of privacy while maintaining accurate mining results. It is this distorted information that is eventually supplied to the data miner, along with a description of the distortion procedure. The performance of the scheme is validated using real and synthetic datasets. 


## Dataset 

The paper assumes a database model where each customer's data is represented as a tuple consisting of a fixed-length sequence of 1's and 0's. This model is commonly used for market-basket databases, where columns represent items sold by a supermarket, and each row represents a customer's purchases (1 indicating a purchase and 0 indicating no purchase).

The assumption is that the number of 1's (purchases) in the database is significantly smaller than the number of 0's (non-purchases) ==> In short, the database is modeled as a large disk-resident two-dimensional sparse boolean matrix.

## Mining Objectives

The mining objective is to efficiently discover all frequent itemsets in the database, which correspond to statistically significant and strong association rules. These rules have a support factor indicating their frequency and a confidence factor representing their strength. The goal is to find interesting rules that surpass user-defined thresholds for support and confidence. A rule is said to be “interesting” if its support and confidence are greater than user-defined thresh-olds sup_min and con_min, respectively.

example of a association rule:

Transaction 1: {Bread, Milk}
Transaction 2: {Bread, Milk, Diapers}

Using these measures, we can discover association rules:
Bread → Milk (Support: 40%, Confidence: 66.6%)
   This rule indicates that if a customer buys bread, they are likely to buy milk as well.
Bread, Milk → Diapers (Support: 40%, Confidence: 50%)
    This rule states that if a customer buys both bread and milk, they are likely to buy diapers as well.



## The privacy metric

As stated in the paper the metric is “with what probability can a given 1 or 0 in the true matrix be reconstructed”.
It measures the probability of reconstructing distorted user data. It focuses on individual entries within customer tuples. For many applications, customers may prioritize more privacy for their 1's (purchased actions) compared to their 0's (non-purchased options).

## Quantifying MASK’s Privacy

In this section, we present the distortion procedure used by the MASK scheme and quantify the privacy provided by the procedure, as per the above privacy metric.

### Distortion Procedure

A customer tuple can be considered to be a random vector $ X = \{X_i\}$ , such that $X_i = 0 \ or \ 1$ .
We generate the distorted vector from this customer tuple by computing $Y = distort(X)$ where $Y_i=X_i \ XOR \ \overline r_i$ and $\overline r_i$ is the complement of $r_i$, a random variable with a density function $f(r) = bernoulli(p) \ (0 \leq p \leq 1)$. That is, $r_i$ takes a value 1 with probability $p$ and 0 with probability $1 - p$.

The net effect of the above computation is that the identity of the $i^{th}$ element in $X$ is kept the same with probability $p$ and is flipped with probability $(1 — p)$. All the customer tuples are distorted in this fashion and make up the database supplied to the miner in effect, the miner receives a $probabilistic function$ of the true customer database.

### Reconstruction Probability of a 1
${R}_1(p)$ is the probability with which a ‘1’ can be reconstructed from the distorted entry as follow:

$\mathcal{R}_1(p)=\frac{s_0 \times p^2}{s_0 \times p+\left(1-s_0\right) \times(1-p)}+\frac{s_0 \times(1-p)^2}{s_0 \times(1-p)+\left(1-s_0\right) \times p}$

while $s_0$ is the average support of an item in the database. 

### The General Reconstruction Equation
the relationship between $p$ and the reconstruction probability for the general case where the customer may wish to protect both her $1’s$ and $0’s$, but her concern to keep the $1’s$ private is more than that for the $0’s$.

${R}_0(p)$ is the probability with which a ‘0’ can be reconstructed from the distorted entry as follow:

$\mathcal{R}_0(\mathrm{p})=\frac{\left(1-s_0\right) \times p^2}{\left(1-s_0\right) \times p+s_0 \times(1-p)}+\frac{\left(1-s_0\right) \times(1-p)^2}{s_0 \times p+\left(1-s_0\right) \times(1-p)}$

Our aim is to minimize a weighted average of ${R}_1(p)$ and ${R}_0(p)$. This corresponds to minimizing the probability of reconstruction of both $1’s$ and $0’s$. The $\textbf {total reconstruction probability}$, ${R}(p)$, is then given as:

$\mathcal{R}(p)=a \mathcal{R}_1(p)+(1-a) \mathcal{R}_0(p)$

where $a$ is the weight given to $1’s$ over $0’s$.


## Privacy Measure
Armed with the ability to compute the total reconstruction probability, we now simply define $\textbf {user privacy}$ ${P}(p)$ as the following percentage:

$\mathcal{P}(p)=(1-\mathcal{R}(p)) * 100$



distortion_probability = the distortion probability value given to distortion function

distort_dataset() = a function used to distort the dataset based on distortion_probability

T = the original true matrix dataset

D = the distorted matrix dataset obtained with a distortion probability

In [None]:
import numpy as np

def distort_dataset(T, distortion_probability):
    mask = np.random.random(T.shape) > (1 - distortion_probability)
    D = (T + mask) % 2
    return D


$s_0$ is the average support of an item in the database.

$a$ is the weight given to $1’s$ over $0’s$.

${P}(p)$ is user privacy metric

In [None]:
def user_privacy(s_0, a, p):
    # Computing R1_p and R0_p
    R1_p = ((s_0 * (p ** 2)) / ((s_0 * p) + ((1 - s_0) * (1 - p)))) + ((s_0 * ((1 - p) ** 2)) / ((s_0 * (1 - p)) + ((1 - s_0) * p)))
    R0_p = (((1 - s_0) * (p ** 2)) / (((1 - s_0) * p) + (s_0 * (1 - p)))) + ((s_0 * ((1 - p) ** 2)) / ((s_0 * p) + ((1 - s_0) * (1 - p))))
    
    # Computing the total reconstruction probability
    R_p = (a * R1_p) + ((1 - a) * R0_p)
    
    # Computing the privacy metric
    P_p = (1 - R_p) * 100
    
    return P_p
