Maintaining Data Privacy in Association Rule Mining using MASK algorithm

By

Kozy-Korpesh Tolep, matricola: 5302354  

Salah Ismail , matricola: 5239380

Amir Mashmool, matricola: 5245307

## Background

This paper discusses the challenge of obtaining accurate input data for data mining services while addressing privacy concerns. It explores whether users can be motivated to provide correct information by guaranteeing privacy protection during the mining process. proposing a scheme that distorts user data probabilistically, ensuring a high level of privacy while maintaining accurate mining results. It is this distorted information that is eventually supplied to the data miner, along with a description of the distortion procedure. The performance of the scheme is validated using real and synthetic datasets.


## Dataset

The paper assumes a database model where each customer's data is represented as a tuple consisting of a fixed-length sequence of 1's and 0's. This model is commonly used for market-basket databases, where columns represent items sold by a supermarket, and each row represents a customer's purchases (1 indicating a purchase and 0 indicating no purchase).

The assumption is that the number of 1's (purchases) in the database is significantly smaller than the number of 0's (non-purchases) ==> In short, the database is modeled as a large disk-resident two-dimensional sparse boolean matrix.

## Mining Objectives

The mining objective is to efficiently discover all frequent itemsets in the database, which correspond to statistically significant and strong association rules. These rules have a support factor indicating their frequency and a confidence factor representing their strength. The goal is to find interesting rules that surpass user-defined thresholds for support and confidence. A rule is said to be “interesting” if its support and confidence are greater than user-defined thresh-olds $sup_{min}$ and $con_{min}$, respectively.

example of a association rule:

Transaction 1: {Bread, Milk}
Transaction 2: {Bread, Milk, Diapers}

Using these measures, we can discover association rules:


Bread → Milk (Support: 40%, Confidence: 66.6%)
   This rule indicates that if a customer buys bread, they are likely to buy milk as well.

   
Bread, Milk → Diapers (Support: 40%, Confidence: 50%)
    This rule states that if a customer buys both bread and milk, they are likely to buy diapers as well.



## The privacy metric
the mechanism adopted in this paper for achieving privacy is to distort the user data before  it is subject to the mining process.
As stated in the paper the metric is “with what probability can a given 1 or 0 in the true matrix be reconstructed”.
It measures the probability of reconstructing distorted user data. It focuses on individual entries within customer tuples. For many applications, customers may prioritize more privacy for their 1's (purchased actions) compared to their 0's (non-purchased options).

## Quantifying MASK’s Privacy

In this section, we present the distortion procedure used by the MASK scheme and quantify the privacy provided by the procedure, as per the above privacy metric.

### Distortion Procedure

A customer tuple can be considered to be a random vector $ X = \{X_i\}$ , such that $X_i = 0 \ or \ 1$ .
We generate the distorted vector from this customer tuple by computing $Y = distort(X)$ where $Y_i=X_i \ XOR \ \overline r_i$ and $\overline r_i$ is the complement of $r_i$, a random variable with a density function $f(r) = bernoulli(p) \ (0 \leq p \leq 1)$. That is, $r_i$ takes a value 1 with probability $p$ and 0 with probability $1 - p$.

The net effect of the above computation is that the identity of the $i^{th}$ element in $X$ is kept the same with probability $p$ and is flipped with probability $(1 — p)$. All the customer tuples are distorted in this fashion and make up the database supplied to the miner in effect, the miner receives a $probabilistic \ function$ of the true customer database.

### Reconstruction Probability of a 1
${R}_1(p)$ is the probability with which a ‘1’ can be reconstructed from the distorted entry as follow:

$\mathcal{R}_1(p)=\frac{s_0 \times p^2}{s_0 \times p+\left(1-s_0\right) \times(1-p)}+\frac{s_0 \times(1-p)^2}{s_0 \times(1-p)+\left(1-s_0\right) \times p}$

while $s_0$ is the average support of an item in the database.

### The General Reconstruction Equation
the relationship between $p$ and the reconstruction probability for the general case where the customer may wish to protect both his $1’s$ and $0’s$, but his concern to keep the $1’s$ private is more than that for the $0’s$.

${R}_0(p)$ is the probability with which a ‘0’ can be reconstructed from the distorted entry as follow:

$\mathcal{R}_0(\mathrm{p})=\frac{\left(1-s_0\right) \times p^2}{\left(1-s_0\right) \times p+s_0 \times(1-p)}+\frac{\left(1-s_0\right) \times(1-p)^2}{s_0 \times p+\left(1-s_0\right) \times(1-p)}$

Our aim is to minimize a weighted average of ${R}_1(p)$ and ${R}_0(p)$. This corresponds to minimizing the probability of reconstruction of both $1’s$ and $0’s$. The $\textbf {total reconstruction probability}$, ${R}(p)$, is then given as:

$\mathcal{R}(p)=a \mathcal{R}_1(p)+(1-a) \mathcal{R}_0(p)$

where $a$ is the weight given to $1’s$ over $0’s$.


## Privacy Measure
Armed with the ability to compute the total reconstruction probability $\mathcal{R}(p)$, we now simply define $\textbf {user privacy}$ ${P}(p)$ as the following percentage:

$\mathcal{P}(p)=(1-\mathcal{R}(p)) * 100$



distortion_probability = the distortion probability value given to distortion function

distort_dataset() = a function used to distort the dataset based on distortion_probability

T = the original true matrix dataset

D = the distorted matrix dataset obtained with a distortion probability

In [122]:
import warnings
warnings.filterwarnings('ignore')

In [123]:
import numpy as np


def distort_dataset(T, distortion_probability):
    mask = np.random.random(T.shape) > (1 - distortion_probability)
    D = (T + mask) % 2 # %2 is to ensures that all values are either 0 or 1.
    return D


$s_0$ is the average support of an item in the database.

$a$ is the weight given to $1’s$ over $0’s$.

${P}(p)$ is user privacy metric

###### Computing ${R}_1(p)$ and ${R}_0(p)$
###### Computing the total reconstruction probability ${R}(p)$
###### Computing the privacy metric ${P}(p)$

In [124]:
def user_privacy(s_0, a, p):
    r1_p = ((s_0 * (p ** 2)) / ((s_0 * p) + ((1 - s_0) * (1 - p)))) + ((s_0 * ((1 - p) ** 2)) / ((s_0 * (1 - p)) + ((1 - s_0) * p)))
    r0_p = (((1 - s_0) * (p ** 2)) / (((1 - s_0) * p) + (s_0 * (1 - p)))) + ((s_0 * ((1 - p) ** 2)) / ((s_0 * p) + ((1 - s_0) * (1 - p))))

    r_p = (a * r1_p) + ((1 - a) * r0_p)

    p_p = (1 - r_p) * 100

    return p_p


## Mining the Distorted Database (MASK Algorithm)

MASK’s technique is for estimating the true (accurate) supports of itemsets from a distorted database.
We first show how to estimate the supports of $1-itemsets$ (i.e. singletons) and then present the general $n-itemset$ support estimation procedure.

It is important to keep in mind that the miner is provided with both the distorted matrix as well as the distortion probability, that is, it knows the value of $p$ that was used in distorting the true matrix.

### Estimating $1-itemsets$ Supports
Now we have two matrices, first is denoted as $T$ which is original true matrix and second one is the distorted matrix, obtained with a distortion probability of $p$, denoted as $D$.

Lets find out formula for finding support of a 1-itemset. Lets take any column $i$ of $T$, it has some amount of $1’s$ and $0’s$ which are $c_1^T$ and $c_0^T$, while same column $i$ has values $c_1^D$ and $c_0^D$ in the matrix $D$.
With this notation, we calculate the support of $i$ in $T$ using the equation below:

$\mathbf{C}^T=\mathbf{M}^{-1} \mathbf{C}^{\mathrm{D}}$

Where

$M=\left[\begin{array}{cc}p & 1-p \\ 1-p & p\end{array}\right] \  C^D=\left[\begin{array}{c}c_1^D \\ c_0^D\end{array}\right] C^T=\left[\begin{array}{c}c_1^T \\ c_0^T\end{array}\right]$

The M matrix in the above equation incorporates the observation that by our method of distortion, if a column had $n \ 1’s$ in $T$, these $1’s$ will generate approximately $pn \ 1’s$ and $(1 — p)n \ 0’s$ for the same column in $D$. Similarly for the $0’s$ of this column in $T$. Therefore, given $c_1^D$ and $c_0^D$, it is possible to estimate the value of $c_1^T$, that is, the true support of item $i$.


### Estimating $n-itemsets$ Supports
If we extend formula for 1-itemsets we can compute the support for an arbitrary n-itemset. For this general case, we define the matrices as:

$C^D=\left[\begin{array}{c}c_{2^n-1}^D \\ \cdot \\ \cdot \\ \cdot \\ c_1^D \\ c_0^D\end{array}\right] \quad \ C^T=\left[\begin{array}{c}c_{2^n-1}^T \\ \cdot \\ \cdot \\ \cdot \\ c_1^T \\ c_0^T\end{array}\right]$

Here $c_k^T$ should be interpreted as the count of the tuples in $T$ that have the binary form of $k$ (in $n$ digits) for the given itemset (that is, for a 2 -itemset, $c_2^T$ refers to the count of 10's in the columns of $T$ corresponding to that itemset, $c_3^T$ to the count of 11 's, and so on). Similarly, $c_k^D$ is defined for the distorted matrix $D$. Finally, the matrix $\mathbf{M}$ is defined as:

$m_{i, j}=$ The probability that a tuple of the form corresponding to $c_j^T$ in $T$ goes to a tuple of the form corresponding to $c_i^D$ in $D$.

For example, $m_{1, 2}$ for a $2-itemset$ is the probability that a $10$ tuple distorts to a $01$ tuple. Accordingly, $m_{1, 2}=$ $(1-p)(1-p)$. The basis for this formulation lies in the fact that in our distortion procedure, the component columns of an $n-itemset$ are distorted $idependently$. Therefore, we can use the product of the probability terms. As a result matrix $\mathbf{M}$ for a $2-itemset$ is equal to:
$M=\left[\begin{array}{cc}p^2 & p(1-p) & p(1-p) & (1-p)^2 \\ p(1-p) & p^2 & (1-p)^2 & p(1-p) \\ p(1-p) & (1-p)^2 & p^2 & p(1-p) \\ (1-p)^2 & p(1-p) & p(1-p) &p^2  \end{array}\right]$

### The Full Mining Process
After we found $c_i^D$, we can find $c_{2^n-1}^T$ for an $n-itemset$. However, first we need to find $c_i^D$ values themselves. For this purpose, there was proposed implementation of the system which is based on the classical Apriori algorithm.

The Apriori algorithm helps to find out frequent itemsets by gradually increasing number of the itemsets. It uses algorithm called AprioriGen to create candidate itemsets for every pass by using previos frequent itemsets. With help of this algrorithm we can efficiently search frequent itemset in large datasets, by looking for a subsets that likely to be frequent.

Algorithm proposed in the paper, named MASK is based on Apriori, but it has critical difference. For instance, lets consider we are countring 2-itemsets. Apriori will count only one binary form $'11'$ for each tuple in the candidate 2-itemset, which means that both items in the subset has positive value in the dataset. However, Mask goes further and counts different combinations of the 2-itemsets: $'00', '01', '10'$, and $'11'$. Thus, we can keep track of the co-occurrences too. Another important difference is that true support is calculated after each pass, not tuple-by-tuple basis. Furthermore, if the same value of p is used for all columns matrix M is same for all candidates and needs to be calculate only once at the end of the pass. Finally, the size of the Matrix is $O(2^n)$ as it depends on n, of the n-itemset.

## MASK Mining Optimizations

### Linear Number of Counters
First, we need to focus on reducing number of counters. As it can be quite compute-intensive to find square matrix $O(2^n)$ for each itemset. If we closely look on how true support is calculated, we can find some ways for optimization. M matrix is inverted and multiplied to all combination of the n-itemset. As a result, reconstructed support is the weighted sum of the counts of different combinations of $2^n$ componentes of the distorted database. However, we can denote that there are only n+1 distinct weights for $2^n$ weights. For instance, for a 2-itemset estimated reconstructed support can be found by formula:

$s_{est}=a_{1}C_{00}^{D}+a_{1}C_{01}^{D}+a_{1}C_{02}^{D}+a_{1}C_{03}^{D}$

where $C_{00}^{D}$ is the count of xy tuples in the distorted database and $a_{i}$ are the associated weights. Here, weights of $a_2$ and $a_3$ are equal because probability that $'11'$ distorts to $'10'$ is equal to the probability that $'11'$ distorts to $'01'$ both are equal to $p(1-p)$. Thus, reverse component weights are also equal. It means that we need to calculate weights of only three combinations $'00', '11', '01'$ or $'10'$. The above observation can be generalized to an n-itemset. Number of counters can be reduced from $2^n$ to $n+1$. As a result counters of items that has unique number of $1$'s will have distinct number, for example, for a 3-itemset we need to find counter of 4 combinations: one for $'000'$ and $'111'$, one for one occurence of '1' in the triplet $(001, 010, 100)$ and one for two occurences of $'1'$ $(011, 110, 101)$.   
### Reducing Amount of Counting
We can also make reduction in amount of counting with help of simple algebraic properties. Based on the fact that $C_{00} + C_{01} + C_{10} + C_{11}$ must be equal to the cardinality of the database, we can choose to to count of these components.

Counting of only $'11'$ in the 2-itemset will take $O(tlen^2)$ operations, where tlen is the transaction length. It can be reduced using following technique. Initially, for a item-list, we need to keep track of all identifiers that have 1's in the transaction, from this list we remove all singleton itemset identifiers that were calculated in the previos pass. The next stage is to create complement-list, which consists of previousy estimated all frequent 1-itemsets that do not appear in the current transaction. Let the item-list and complement-list be of lenght $m_1$ and $m_2$, so $|m_1+m_2|=F_1$, where $F_1$ is total number of frequent 1-itemsets. If we now, use this technique to calculate 11's, 01's and 10's, it will take $O(m_1^2)+O(m_1m_2)$ operations.


In [125]:
from itertools import combinations
import string
import numpy as np
import time
import pandas as pd

def MASK(dataset, dist_P, min_support, verbose=False):
    all_frequent_itemsets = [] #frequent itemset of all passes
    all_support_values = [] #estimated support for each n-itemset

    columns_list, columns_index_map = generate_items_and_column_map(dataset)

    passes = dataset.shape[1] #number of passes
    start_time = time.time()

    for p in range(passes):
        n = p + 1 #each pass generate n-itemset tuple (path = 0, 1-itemset and etc.)
        print("\nPASS: ", n, "( ",n, "-itemsets)")
        column_name_combinations = generate_column_name_combinations(columns_list, n) #generating all possible combinations of column names for each n-itemset
        if verbose:
          print("column_name_combinations",column_name_combinations)
        #Optimization N2 (Reducing amount of counting)
        if n>1:
          #frequent_itemsets array to keep track n-combination with support LARGER than min_support
          #non_frequent_n_itemsets array to keep track n-combination with support SMALLER than min_support
          column_name_combinations = filter_non_frequent_n_itemsets(column_name_combinations, non_frequent_n_itemsets)
        #we don't need to calculate number of 0's as it can be found by subtracting number of 1's from number of transactions
        possible_ones = np.array(range(1, n + 1)) #quantity of ones that can be in the combination depends of number of n - n=1 possible ones = [1], n=2 possible ones = [1,2]...
        weights = compute_weights(dist_P, n, possible_ones) #weights is array which is computed from formula (#) weights = [weight corresoponding to each possible value of one in the tuple]

        frequent_n_itemsets, support_n_itemsets, non_frequent_n_itemsets = compute_support(n, dataset, columns_index_map, column_name_combinations,
                                                                  weights, dist_P, min_support)

        all_frequent_itemsets.extend(frequent_n_itemsets)
        all_support_values.extend(support_n_itemsets)

        if verbose:
            print("Frequent itemsets for this pass are: ", frequent_n_itemsets)
            print("The estimated support values of them are: ", support_n_itemsets)
            print("Non-frequent itemsets that should not be part of the combinations of the next pass are: ",
                  non_frequent_n_itemsets)

        if np.size(frequent_n_itemsets) == 0:
            break

    print("The algorithm stopped because there are no more frequent itemsets!")

    end_time = time.time()
    execution_time = end_time - start_time
    execution_time = str(round(execution_time, 2))
    if verbose:
        print("Processing time is:", execution_time, "sec.")

    return np.array(all_frequent_itemsets), np.array(all_support_values)


def generate_items_and_column_map(dataset):
    alphabet = string.ascii_uppercase
    columns_list = list(alphabet)[:dataset.shape[1]] #converting column names to list of Alphabet strings [A, B, C, D ...]
    columns_index_map = {item: column_index for column_index, item in enumerate(columns_list)} #creating set where key is the string from columns_list and value is their index {'A':0, 'B':1,'C':2, ...}
    return columns_list, columns_index_map


def generate_column_name_combinations(items, n):
    column_name_combinations = [''.join(c) for c in combinations(items, n)]
    return np.array(column_name_combinations)


def filter_non_frequent_n_itemsets(column_name_combinations, non_frequent_n_itemsets):
    #non_freq_singletons = ['A', 'B',...] non frequent singletons from non_frequent_n_itemsets
    non_freq_singletons = list(set([c for comb in non_frequent_n_itemsets for c in comb]))
    #new column_name_combinations without non_freq_singletons
    column_name_combinations = [comb for comb in column_name_combinations if not any(c in comb for c in non_freq_singletons)]
    return np.array(column_name_combinations)


def compute_weights(dist_P, n, possible_ones):
    #according to formula calculating weights corresponding to possible number of ones
    weights = (dist_P ** (n - possible_ones)) * ((1 - dist_P) ** possible_ones)
    return weights


def compute_support(n, dataset, item_column_map, column_name_combinations, weights, dist_P, min_support):
    frequent_n_itemsets = []
    support_n_itemsets = []
    non_frequent_n_itemsets = []

    for column_name_combination in column_name_combinations:
        columns = [item_column_map[item] for item in column_name_combination]
        # print("columns inside compute support", columns)
        column_values = dataset[:, columns]
        # print("column_values", column_values)
        list_column_sum = column_values.sum(axis=1)
        # print("list_column_sum",list_column_sum)
        # Optimization N2 we have calculated weights for each number of possible "1"s now we need to calculate number of "1"s corresponding to that weight
        number_of_ones = np.array([np.count_nonzero(list_column_sum == i + 1) for i in range(n)])
        number_of_zeros = dataset.shape[0] - np.sum(number_of_ones) #according to optimization N2, we don't need to count number of tuples with 0 only just need to subtract number of cols - cols with 1

        weight_row_zeros = dist_P ** n #weight of tuple consisting only zeros
        estimated_support = (weight_row_zeros * number_of_zeros) + np.sum(weights * number_of_ones) #estimated support according to formula #

        if estimated_support >= min_support:
            frequent_n_itemsets.append(column_name_combination)
            support_n_itemsets.append(estimated_support)
        else:
            non_frequent_n_itemsets.append(column_name_combination)

    return np.array(frequent_n_itemsets), np.array(support_n_itemsets), np.array(non_frequent_n_itemsets)


## Performance Framework
Due to the probabilistic nature of MASK, the reconstructed support values are not expected to match the actual supports exactly. This can lead to errors where the reported support values are either larger or smaller than the actual supports.

Errors in support estimation can have a significant impact, the probabilistic evaluation of supports may incorrectly classify "border-line" itemsets as either frequent or rare. This results in both false positives (itemsets wrongly reported as frequent) and false negatives (itemsets wrongly reported as rare).

We evaluate the mining process under two conditions. The first condition uses the $sup_{min}$ value provided by the user, while the second condition uses a slightly lower $sup_{min}$ value.

We prioritize coverage over precision in this evaluation. Specifically, we consider a 10% reduction in the $sup_{min}$ value (r = 10%).
This means lowering the threshold for what is considered significant, allowing for greater inclusion of items but potentially reducing precision or increasing false positives.

To quantify the errors made, We compare the mining outputs obtained from MASK with those derived from Apriori running on the true (undistorted) database. We compare the results using both the $sup_{min}$ value provided by the user and the lowered $sup_{min}$ value. This allows to evaluate the impact of the lowered $sup_{min}$ value on the mining process and assess the differences between MASK and Apriori.

### Error Metrics

1.Support Error  $\rho$.

This metric reflects the (percentage) average relative error in the reconstructed support values for those itemsets that are correctly  identified to be frequent. Denoting the reconstructed support by rec_sup and the actual support by act_sup, the support error is computed over all frequent itemsets as

$\rho=\frac{1}{|f|} \Sigma_f \frac{\left|r e c_{-} s u p_f-a c t_{-} s u p_f\right|}{a c t_{\_} s u p_f} * 100$

2.Identity Error.

This metric reflects the (percentage) error in identifying frequent itemsets and has two components: $\sigma^{+}$, indicating the percentage of false positives, and $\sigma^{-}$ indicating the percentage of false negatives. Denoting the reconstructed set of frequent itemsets with $R$ and the correct set of frequent itemsets with $F$, these metrics are computed as:

$\sigma^{+}=\frac{|R-F|}{|F|} * 100 \quad \sigma^{-}=\frac{|F-R|}{|F|} * 100$

In [126]:
import numpy as np

def calculate_support_error(mask_frequent, mask_f_support, apriori_f_support):
    """
    Calculates the support error metric.

    Args:
    mask_frequent (list): List of frequent itemsets obtained from MASK.
    mask_f_support (list): List of estimated supports of frequent itemsets from MASK.
    apriori_f_support (list): List of actual supports of frequent itemsets from Apriori.

    Returns:
    float: Support error.
    """
    total_itemsets = len(mask_frequent)

    # Calculate the relative support error for each itemset and sum them up
    relative_errors = [abs(rec_sup - act_sup) / act_sup for rec_sup, act_sup in zip(mask_f_support, apriori_f_support)]
    sum_of_errors = sum(relative_errors)

    # Calculate the average relative support error
    average_error = sum_of_errors / total_itemsets

    support_error = average_error
    return support_error

def calculate_identity_error(mask_frequent, apriori_frequent):
    """
    Calculates the identity error metrics.

    Args:
    mask_frequent (list): List of frequent itemsets obtained from MASK.
    apriori_frequent (list): List of frequent itemsets obtained from Apriori.

    Returns:
    tuple: Identity error (false positives, false negatives).
    """
    reconstructed_itemsets = set(mask_frequent)
    print(apriori_frequent)
    actual_itemsets = set(apriori_frequent)
    false_positives = len(reconstructed_itemsets - actual_itemsets) / len(actual_itemsets)
    false_negatives = len(actual_itemsets - reconstructed_itemsets) / len(actual_itemsets)
    return false_positives, false_negatives

def support_identity_errors(mask_frequent, mask_f_support, apriori_frequent, apriori_f_support):
    nbr_m_f = np.size(mask_frequent)
    max_levels = len(mask_frequent[nbr_m_f-1])

    levels = []
    support_errors = []
    false_positive_errors = []
    false_negative_errors = []

    for l in range(max_levels):
        level = l+1
        level_mask_frequent = []
        level_mask_f_support = []
        for i in range(np.size(mask_frequent)):
            if len(mask_frequent[i]) == level:
                level_mask_frequent.append(mask_frequent[i])
                level_mask_f_support.append(mask_f_support[i])

        support_error = calculate_support_error(level_mask_frequent, level_mask_f_support, apriori_f_support)

        false_positive, false_negative = calculate_identity_error(level_mask_frequent, apriori_frequent)

        levels.append(level)
        support_errors.append(support_error)
        false_positive_errors.append(false_positive)
        false_negative_errors.append(false_negative)

    return levels, support_errors, false_positive_errors, false_negative_errors


## Experimental Results


### Applying The MASK on a synthetic dataset

#### Distortion Procedure

In [127]:
# creating True matrix T_dataset
T_dataset = np.array([
[1, 0, 1, 1, 0, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 0, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 0, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
])

distortion_probability = 0.9

# building the distorted matrix D_dataset by distorting T_dataset
# D_dataset = distort_dataset(T_dataset, distortion_probability)
D_dataset = T_dataset

s_0 = 0.01
a = 0.75
privacy = user_privacy(s_0, a, distortion_probability)
print("True matrix: ", T_dataset)
print("Distorted matrix: ", D_dataset)
print("User Privacy attained: ", privacy, "%")

True matrix:  [[1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]]
Distorted matrix:  [[1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]]
User Privacy attained:  71.86866799534961 %


#### Mining Procedure

MASK() will provide us with
1. mask_frequent which is a list of all frequent itemsets and
2. mask_f_support which is a list of their corresponding estimated supports

In [128]:
min_support = 0.25

mask_frequent, mask_f_support = MASK(D_dataset, distortion_probability, min_support, True)
print(mask_frequent)
print(mask_f_support)


PASS:  1 (  1 -itemsets)
column_name_combinations ['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
Frequent itemsets for this pass are:  ['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
The estimated support values of them are:  [3.4 6.6 5.8 5.8 4.2 1.  6.6 9.  3.4 5.8]
Non-frequent itemsets that should not be part of the combinations of the next pass are:  []

PASS:  2 (  2 -itemsets)
column_name_combinations ['AB' 'AC' 'AD' 'AE' 'AF' 'AG' 'AH' 'AI' 'AJ' 'BC' 'BD' 'BE' 'BF' 'BG'
 'BH' 'BI' 'BJ' 'CD' 'CE' 'CF' 'CG' 'CH' 'CI' 'CJ' 'DE' 'DF' 'DG' 'DH'
 'DI' 'DJ' 'EF' 'EG' 'EH' 'EI' 'EJ' 'FG' 'FH' 'FI' 'FJ' 'GH' 'GI' 'GJ'
 'HI' 'HJ' 'IJ']
Frequent itemsets for this pass are:  ['AB' 'AC' 'AD' 'AE' 'AF' 'AG' 'AH' 'AI' 'AJ' 'BC' 'BD' 'BE' 'BF' 'BG'
 'BH' 'BI' 'BJ' 'CD' 'CE' 'CF' 'CG' 'CH' 'CI' 'CJ' 'DE' 'DF' 'DG' 'DH'
 'DI' 'DJ' 'EF' 'EG' 'EH' 'EI' 'EJ' 'FG' 'FH' 'FI' 'FJ' 'GH' 'GI' 'GJ'
 'HI' 'HJ' 'IJ']
The estimated support values of them are:  [0.9  2.74 2.74 0.66 0.34 2.82 3.06 2.5  2.74 3.06 3.06 3.

#### Apriori Algorithm from mlxtend.frequent_patterns package

In [129]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

T_dataset_df = pd.DataFrame(T_dataset) # providing the data in the required format for the Apriori algorithm
print(T_dataset_df)
apriori_frequent_itemsets = apriori(T_dataset_df, min_support = 0.25, use_colnames = True)
print(apriori_frequent_itemsets)

   0  1  2  3  4  5  6  7  8  9
0  1  0  1  1  0  1  0  0  1  1
1  0  1  0  0  1  1  0  0  0  0
2  1  0  0  0  1  1  1  0  1  0
3  1  0  1  1  0  1  0  0  1  1
4  0  1  0  0  1  1  0  0  0  0
5  1  0  0  0  1  1  1  0  1  0
6  1  0  1  1  0  1  0  0  1  1
7  0  1  0  0  1  1  0  0  0  0
8  1  0  0  0  1  1  1  0  1  0
9  1  0  1  1  0  1  0  0  1  1
    support            itemsets
0       0.7                 (0)
1       0.3                 (1)
2       0.4                 (2)
3       0.4                 (3)
4       0.6                 (4)
..      ...                 ...
86      0.4     (0, 2, 5, 8, 9)
87      0.4     (0, 3, 5, 8, 9)
88      0.3     (0, 4, 5, 6, 8)
89      0.4     (2, 3, 5, 8, 9)
90      0.4  (0, 2, 3, 5, 8, 9)

[91 rows x 2 columns]


In [130]:
import string
import numpy as np

def alphabetical_transform(dataset, apriori_frequent_itemsets):
    # Generate the required letters from the alphabet
    alphabet = string.ascii_uppercase[:dataset.shape[1]]

    # Create a mapping between column index and corresponding letter
    number_items_map = {index: letter for index, letter in enumerate(alphabet)}

    # Transform the itemsets using the mapping and extract as NumPy array
    transformed_itemsets = apriori_frequent_itemsets["itemsets"].apply(
                                lambda itemset: ''.join(number_items_map[index] for index in itemset)
                            ).values

    # Extract the support values as NumPy array
    support_values = apriori_frequent_itemsets["support"].values

    return transformed_itemsets, support_values


In [131]:
print(T_dataset)
print(apriori_frequent_itemsets)
apriori_frequent, apriori_f_support = alphabetical_transform(T_dataset, apriori_frequent_itemsets)
print(apriori_frequent)
print(apriori_f_support)

[[1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]
 [0 1 0 0 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0 1 0]
 [1 0 1 1 0 1 0 0 1 1]]
    support            itemsets
0       0.7                 (0)
1       0.3                 (1)
2       0.4                 (2)
3       0.4                 (3)
4       0.6                 (4)
..      ...                 ...
86      0.4     (0, 2, 5, 8, 9)
87      0.4     (0, 3, 5, 8, 9)
88      0.3     (0, 4, 5, 6, 8)
89      0.4     (2, 3, 5, 8, 9)
90      0.4  (0, 2, 3, 5, 8, 9)

[91 rows x 2 columns]
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'I' 'J' 'AC' 'AD' 'AE' 'AF' 'AG' 'AI' 'AJ'
 'BE' 'BF' 'CD' 'CF' 'IC' 'JC' 'DF' 'ID' 'JD' 'EF' 'EG' 'IE' 'FG' 'IF'
 'JF' 'IG' 'IJ' 'ACD' 'ACF' 'AIC' 'AJC' 'ADF' 'AID' 'AJD' 'AEF' 'AEG'
 'AIE' 'AFG' 'AIF' 'AJF' 'AIG' 'AIJ' 'BEF' 'CDF' 'ICD' 'JCD' 'ICF' 'JCF'
 'IJC' 'IDF' 'JDF' 'IJD' 'EFG' 'IEF' 'IEG' 'IFG' 'IJF' 'ACDF' 'AICD'
 'AJCD'

#### Error Metrics with support_identity_errors()

In [132]:
Levels, Support_Errors, False_Positive, False_Negative = support_identity_errors(mask_frequent, mask_f_support, apriori_frequent, apriori_f_support)

data = {
    'Level': Levels,
    'Support Error': Support_Errors,
    'False Positive': False_Positive,
    'False Negative': False_Negative
}

performance = pd.DataFrame(data)
performance.head()

['A' 'B' 'C' 'D' 'E' 'F' 'G' 'I' 'J' 'AC' 'AD' 'AE' 'AF' 'AG' 'AI' 'AJ'
 'BE' 'BF' 'CD' 'CF' 'IC' 'JC' 'DF' 'ID' 'JD' 'EF' 'EG' 'IE' 'FG' 'IF'
 'JF' 'IG' 'IJ' 'ACD' 'ACF' 'AIC' 'AJC' 'ADF' 'AID' 'AJD' 'AEF' 'AEG'
 'AIE' 'AFG' 'AIF' 'AJF' 'AIG' 'AIJ' 'BEF' 'CDF' 'ICD' 'JCD' 'ICF' 'JCF'
 'IJC' 'IDF' 'JDF' 'IJD' 'EFG' 'IEF' 'IEG' 'IFG' 'IJF' 'ACDF' 'AICD'
 'AJCD' 'AICF' 'AJCF' 'AICJ' 'AIDF' 'AJDF' 'AIDJ' 'AEFG' 'AIEF' 'AIEG'
 'AIFG' 'AIFJ' 'ICDF' 'JCDF' 'IJCD' 'IJCF' 'IJDF' 'IEFG' 'ACDFI' 'ACDFJ'
 'ACDIJ' 'ACFIJ' 'ADFIJ' 'AEFGI' 'CDFIJ' 'ACDFIJ']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'I' 'J' 'AC' 'AD' 'AE' 'AF' 'AG' 'AI' 'AJ'
 'BE' 'BF' 'CD' 'CF' 'IC' 'JC' 'DF' 'ID' 'JD' 'EF' 'EG' 'IE' 'FG' 'IF'
 'JF' 'IG' 'IJ' 'ACD' 'ACF' 'AIC' 'AJC' 'ADF' 'AID' 'AJD' 'AEF' 'AEG'
 'AIE' 'AFG' 'AIF' 'AJF' 'AIG' 'AIJ' 'BEF' 'CDF' 'ICD' 'JCD' 'ICF' 'JCF'
 'IJC' 'IDF' 'JDF' 'IJD' 'EFG' 'IEF' 'IEG' 'IFG' 'IJF' 'ACDF' 'AICD'
 'AJCD' 'AICF' 'AJCF' 'AICJ' 'AIDF' 'AJDF' 'AIDJ' 'AEFG' 'AIEF' 'AIEG'
 'AIFG' 'AIFJ' 'ICDF' 'J

Unnamed: 0,Level,Support Error,False Positive,False Negative
0,1,11.171429,0.010989,0.901099
1,2,5.911132,0.318681,0.824176
2,3,2.263114,1.131868,0.901099


### Applying The MASK on a real dataset

#### Distortion Procedure

In [133]:
# Preparing Dataset
import csv
import pandas as pd
import numpy as np

dataset = []
with open('market_basket2.csv', 'r') as fd:
    reader = csv.reader(fd)
    for row in reader:
        dataset.append(row)
print(pd.DataFrame(dataset))

# Convert the dataset it into an multidimensional array
T_dataset = np.array(dataset)
T_dataset = T_dataset[1:,:]
T_dataset = T_dataset.astype("int")
print(T_dataset)

# Distortion Procedure and privacy user privacy
distortion_probability = 0.9
s_0 = 0.01
a = 0.75
D_dataset = distort_dataset(T_dataset, distortion_probability)
privacy = user_privacy(s_0, a, distortion_probability)
print( "The distorted dataset: ", D_dataset)
print("Privacy attained: ", privacy, "%")

         0       1       2      3       4     5       6       7     8      9
0    Apple  Banana  Orange  Mango  Grapes  Eggs  Yogurt  Cheese  salt  sugar
1        0       1       0      0       1     1       0       1     0      0
2        0       0       0      0       0     0       0       0     0      1
3        1       0       1      0       0     1       0       1     0      1
4        0       0       1      1       0     1       0       0     0      1
..     ...     ...     ...    ...     ...   ...     ...     ...   ...    ...
995      0       1       0      0       0     0       1       0     0      0
996      1       0       0      0       1     0       0       0     1      1
997      1       0       0      0       1     1       0       0     0      0
998      0       0       1      1       1     0       1       1     1      0
999      0       0       0      0       0     0       0       0     0      1

[1000 rows x 10 columns]
[[0 1 0 ... 1 0 0]
 [0 0 0 ... 0 0 1]
 [1 0 1 ... 

#### Mining Procedure

MASK() will provide us with
1. mask_frequent which is a list of all frequent itemsets and
2. mask_f_support which is a list of their corresponding estimated supports

In [134]:
min_support = 0.25
mask_frequent, mask_f_support = MASK(D_dataset, distortion_probability, min_support)
print(mask_frequent)
print(mask_f_support)


PASS:  1 (  1 -itemsets)

PASS:  2 (  2 -itemsets)

PASS:  3 (  3 -itemsets)

PASS:  4 (  4 -itemsets)

PASS:  5 (  5 -itemsets)

PASS:  6 (  6 -itemsets)

PASS:  7 (  7 -itemsets)

PASS:  8 (  8 -itemsets)

PASS:  9 (  9 -itemsets)

PASS:  10 (  10 -itemsets)
The algorithm stopped because there are no more frequent itemsets!
['A' 'B' 'C' ... 'ACDEFGHIJ' 'BCDEFGHIJ' 'ABCDEFGHIJ']
[423.9        427.9        451.9        ...   1.29982657   1.15598815
   0.67288153]


#### Apriori Algorithm from mlxtend.frequent_patterns package

In [135]:
T_dataset_df = pd.DataFrame(T_dataset)
print(T_dataset_df)
apriori_frequent_itemsets = apriori(T_dataset_df, min_support = 0.25, use_colnames = True)
print(apriori_frequent_itemsets)

     0  1  2  3  4  5  6  7  8  9
0    0  1  0  0  1  1  0  1  0  0
1    0  0  0  0  0  0  0  0  0  1
2    1  0  1  0  0  1  0  1  0  1
3    0  0  1  1  0  1  0  0  0  1
4    1  1  0  0  0  0  0  0  0  0
..  .. .. .. .. .. .. .. .. .. ..
994  0  1  0  0  0  0  1  0  0  0
995  1  0  0  0  1  0  0  0  1  1
996  1  0  0  0  1  1  0  0  0  0
997  0  0  1  1  1  0  1  1  1  0
998  0  0  0  0  0  0  0  0  0  1

[999 rows x 10 columns]
    support itemsets
0  0.383383      (0)
1  0.384384      (1)
2  0.420420      (2)
3  0.404404      (3)
4  0.407407      (4)
5  0.398398      (5)
6  0.384384      (6)
7  0.410410      (7)
8  0.408408      (8)
9  0.405405      (9)


In [136]:
print(T_dataset)
print(apriori_frequent_itemsets)
apriori_frequent, apriori_f_support = alphabetical_transform(T_dataset, apriori_frequent_itemsets)
print(apriori_frequent)
print(apriori_f_support)

[[0 1 0 ... 1 0 0]
 [0 0 0 ... 0 0 1]
 [1 0 1 ... 1 0 1]
 ...
 [1 0 0 ... 0 0 0]
 [0 0 1 ... 1 1 0]
 [0 0 0 ... 0 0 1]]
    support itemsets
0  0.383383      (0)
1  0.384384      (1)
2  0.420420      (2)
3  0.404404      (3)
4  0.407407      (4)
5  0.398398      (5)
6  0.384384      (6)
7  0.410410      (7)
8  0.408408      (8)
9  0.405405      (9)
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
[0.38338338 0.38438438 0.42042042 0.4044044  0.40740741 0.3983984
 0.38438438 0.41041041 0.40840841 0.40540541]


#### Error Metrics with support_identity_errors()

In [137]:
print(apriori_frequent)
Levels, Support_Errors, False_Positive, False_Negative = support_identity_errors(mask_frequent, mask_f_support, apriori_frequent, apriori_f_support)

data = {
    'Level': Levels,
    'Support Error': Support_Errors,
    'False Positive': False_Positive,
    'False Negative': False_Negative
}

performance = pd.DataFrame(data)
performance.head()


['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J']


Unnamed: 0,Level,Support Error,False Positive,False Negative
0,1,1093.59362,0.0,0.0
1,2,108.338631,4.5,1.0
2,3,18.789662,12.0,1.0
3,4,5.16754,21.0,1.0
4,5,2.150676,25.2,1.0
