## Exercise A-Priori Algorithm

Consider the following programming exercise. Given the information of the frequent singletons $(L_1)$ and frequent pairs $(L_2)$ we compute with our previous implementation of A-Priori for k=2, implement in spark functions to compute the  **confidence** and **interest** of all the binary rules we can build from the set $L_2$. As the dataset to test your code, use the one you can find in:

https://www.kaggle.com/shazadudwadia/supermarket?select=GroceryStoreDataSet.csv

## High-level pseudo-code of the A-Priori algorithm

>$L_1$ := Find frequent elements (T,$\theta$)  
>k=2  
>While ($L_{k-1}$ is not empty) do:  
>>$C_k = \{ P \ | \ |P|=k, \forall S_j \subseteq P, |S_j|=k\!-\!1 \rightarrow S_j \in L_{k-1}\}$  
>>$L_k = \{ P \ | \ P \in C_k, support(P,T) \geq \theta \}$  
>>k=k+1


In [1]:
# Imports
import pyspark
import os
import math
import random
import sys

# Variables and constants
K=2
theta=0.01

In [2]:
# Init Spark Context
sc = pyspark.SparkContext('local[*]','A-Priori Algorithm')
sc

### Phase 1: Compute $L_1$ and $T_{L_1}$

In [3]:
# Read CSV and parse
transactions = sc.textFile("../data/GroceryStoreDataSet.csv").map(lambda line: line.replace("\"",'').split(","))
number_transactions = transactions.count()
print("Nº Transactions: %s\n\n%s" % (number_transactions,transactions.collect()))


Nº Transactions: 20

[['MILK', 'BREAD', 'BISCUIT'], ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'], ['BREAD', 'TEA', 'BOURNVITA'], ['JAM', 'MAGGI', 'BREAD', 'MILK'], ['MAGGI', 'TEA', 'BISCUIT'], ['BREAD', 'TEA', 'BOURNVITA'], ['MAGGI', 'TEA', 'CORNFLAKES'], ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'], ['JAM', 'MAGGI', 'BREAD', 'TEA'], ['BREAD', 'MILK'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'COCK'], ['BREAD', 'SUGER', 'BISCUIT'], ['COFFEE', 'SUGER', 'CORNFLAKES'], ['BREAD', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'SUGER'], ['BREAD', 'COFFEE', 'SUGER'], ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]


In [4]:
# Compute the rdd with frequent singleton sets (L_1)
def computeL1 ( rddT, numtrans, theta ):
  rddtemp = rddT.flatMap( lambda t : [ (it,1) for it in t ] ).reduceByKey( lambda a,b : a+b  )
  return rddtemp.filter( lambda x : (float(x[1])/numtrans) >= theta )

aux_L1 = computeL1(transactions,number_transactions,theta)
print(aux_L1.collect())

L1 = aux_L1.keys().collect()
print("L1 Items: %s"% L1)

[('MILK', 5), ('BREAD', 13), ('BISCUIT', 7), ('CORNFLAKES', 6), ('TEA', 7), ('MAGGI', 5), ('COFFEE', 8), ('COCK', 3), ('SUGER', 6), ('BOURNVITA', 4), ('JAM', 2)]
L1 Items: ['MILK', 'BREAD', 'BISCUIT', 'CORNFLAKES', 'TEA', 'MAGGI', 'COFFEE', 'COCK', 'SUGER', 'BOURNVITA', 'JAM']


In [5]:
# Map any transaction to its version without elements not in L1
# L1 must be a python list, not a RDD
def computeTfilteredByL1( seqOfT, L1 ):
    for t in seqOfT:
       yield [ it for it in t if (it in L1) ]
    
TL1 = transactions.mapPartitions( lambda seqOfT : computeTfilteredByL1( seqOfT, L1 ))
print("Transactions with only frequent elements %s" % TL1.collect())

Transactions with only frequent elements [['MILK', 'BREAD', 'BISCUIT'], ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'], ['BREAD', 'TEA', 'BOURNVITA'], ['JAM', 'MAGGI', 'BREAD', 'MILK'], ['MAGGI', 'TEA', 'BISCUIT'], ['BREAD', 'TEA', 'BOURNVITA'], ['MAGGI', 'TEA', 'CORNFLAKES'], ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'], ['JAM', 'MAGGI', 'BREAD', 'TEA'], ['BREAD', 'MILK'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'COCK'], ['BREAD', 'SUGER', 'BISCUIT'], ['COFFEE', 'SUGER', 'CORNFLAKES'], ['BREAD', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'SUGER'], ['BREAD', 'COFFEE', 'SUGER'], ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]


### **Phase 2**: Compute $C_2(T)$ from $T_{L_1}$

In [6]:
# For each t in seqofFilteredT (they come from T_{L_1}), compute pairs (a,b) from t that belong to C_2
def generateC2( seqofFilteredT ):
    for t in seqofFilteredT:
      cpairslist = []
      for (a,b) in [ (a,b) for i,a in enumerate(t[:-1]) for b in t[i+1:] ]:
                cpairslist.append( ((a,b),1) if (a <= b) else ((b,a),1)  )         
      yield cpairslist
    
rddC2T = TL1.mapPartitions( lambda seqOfFilteredT : generateC2( seqOfFilteredT ) )
rddC2TFlat = rddC2T.flatMap( lambda x : x )

In [7]:
print( "flattened C2T: ", rddC2TFlat.collect())

flattened C2T:  [(('BREAD', 'MILK'), 1), (('BISCUIT', 'MILK'), 1), (('BISCUIT', 'BREAD'), 1), (('BREAD', 'MILK'), 1), (('BISCUIT', 'BREAD'), 1), (('BREAD', 'CORNFLAKES'), 1), (('BISCUIT', 'MILK'), 1), (('CORNFLAKES', 'MILK'), 1), (('BISCUIT', 'CORNFLAKES'), 1), (('BREAD', 'TEA'), 1), (('BOURNVITA', 'BREAD'), 1), (('BOURNVITA', 'TEA'), 1), (('JAM', 'MAGGI'), 1), (('BREAD', 'JAM'), 1), (('JAM', 'MILK'), 1), (('BREAD', 'MAGGI'), 1), (('MAGGI', 'MILK'), 1), (('BREAD', 'MILK'), 1), (('MAGGI', 'TEA'), 1), (('BISCUIT', 'MAGGI'), 1), (('BISCUIT', 'TEA'), 1), (('BREAD', 'TEA'), 1), (('BOURNVITA', 'BREAD'), 1), (('BOURNVITA', 'TEA'), 1), (('MAGGI', 'TEA'), 1), (('CORNFLAKES', 'MAGGI'), 1), (('CORNFLAKES', 'TEA'), 1), (('BREAD', 'MAGGI'), 1), (('MAGGI', 'TEA'), 1), (('BISCUIT', 'MAGGI'), 1), (('BREAD', 'TEA'), 1), (('BISCUIT', 'BREAD'), 1), (('BISCUIT', 'TEA'), 1), (('JAM', 'MAGGI'), 1), (('BREAD', 'JAM'), 1), (('JAM', 'TEA'), 1), (('BREAD', 'MAGGI'), 1), (('MAGGI', 'TEA'), 1), (('BREAD', 'TEA'),

### **Phase 3**: Compute 𝐿2 from 𝐶2(𝑇)

In [8]:
def computeL2( rddC2T, numtrans, theta ):
    pairsCountedrdd = rddC2T.reduceByKey( lambda v1,v2 : v1+v2 )
    # Finally, filter out from the previous rdd those pairs with frequency below theta
    return pairsCountedrdd.filter( lambda x : (float(x[1])/numtrans) >= theta )

rddL2 = computeL2( rddC2TFlat, number_transactions, theta )

rddL2 = rddL2.sortBy(lambda a: -a[1])
for it in rddL2.toLocalIterator():
    print (it)

(('BREAD', 'MILK'), 4)
(('BISCUIT', 'BREAD'), 4)
(('BREAD', 'TEA'), 4)
(('MAGGI', 'TEA'), 4)
(('COFFEE', 'CORNFLAKES'), 4)
(('COFFEE', 'SUGER'), 4)
(('BREAD', 'SUGER'), 4)
(('BISCUIT', 'CORNFLAKES'), 3)
(('BREAD', 'MAGGI'), 3)
(('COCK', 'COFFEE'), 3)
(('BREAD', 'COFFEE'), 3)
(('BOURNVITA', 'BREAD'), 3)
(('BISCUIT', 'MILK'), 2)
(('CORNFLAKES', 'MILK'), 2)
(('BISCUIT', 'MAGGI'), 2)
(('BISCUIT', 'TEA'), 2)
(('CORNFLAKES', 'TEA'), 2)
(('BISCUIT', 'COFFEE'), 2)
(('BISCUIT', 'COCK'), 2)
(('COCK', 'CORNFLAKES'), 2)
(('BOURNVITA', 'TEA'), 2)
(('JAM', 'MAGGI'), 2)
(('BREAD', 'JAM'), 2)
(('BOURNVITA', 'SUGER'), 2)
(('BREAD', 'CORNFLAKES'), 1)
(('MAGGI', 'MILK'), 1)
(('CORNFLAKES', 'MAGGI'), 1)
(('BREAD', 'COCK'), 1)
(('BISCUIT', 'SUGER'), 1)
(('CORNFLAKES', 'SUGER'), 1)
(('MILK', 'TEA'), 1)
(('COFFEE', 'TEA'), 1)
(('COFFEE', 'MILK'), 1)
(('JAM', 'MILK'), 1)
(('JAM', 'TEA'), 1)
(('BOURNVITA', 'COFFEE'), 1)
