# MarketBasket
### Overview
In market basket analysis, the **basket** refers to a customer's collection of items during a single purchase trip. It's not literally a physical basket, but rather the data that represents what products a customer buys together. The basket are represented as **sets** of indices, each index referring to a specific product (e.g. `{112, 41, 1020}` is a basket). 

By identifying frequently bought itemsets (groups of items purchased together), we can reveal associations between products. This allows retailers to understand which items customers tend to buy together.  This knowledge is crucial for strategies like targeted promotions.



In [1]:
# First we need a dataset. 
# Let us generate a dataset in python
def random_basket(items, max_basket_size=15):
    basket_size = random.randint(1,max_basket_size)
    return {random.randint(0,items-1) for i in range(basket_size)}

import random
dataset_size, items = 100000, 100
dataset = [random_basket(items, 10) for i in range(dataset_size)]
print(dataset[:3])

[{1, 3, 58, 37, 38, 74, 14, 86, 90, 93}, {6}, {10, 15, 52, 87, 60, 61, 95}]


In [3]:
# Let's intialize the spark context and let's parallelize the data
import os

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Apriori").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/25 10:17:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
# let's insert data in spark
rdd = spark.sparkContext.parallelize(dataset)

## The A-Priori Algorithm
This algorithm leverages a key property of itemsets - if a large itemset is frequent, all its smaller subsets must also be frequent. (e.g. if `{12,54,22,92,69,4}` is frequent then also all its subsets are frequent, therefore sets as `{12,54,22}` and `{12,69,4}` are frequent). Itemsets are considered frequent (or interesting) when their frequency exceeds a threshold parameterd---called **support**.

> The Role of Support:  The Apriori algorithm uses a minimum support threshold. This threshold defines how frequent an itemset needs to be considered "interesting" for further analysis.  Items or itemsets that appear less frequently than the threshold are discarded. Therefore, it is crucial to select a support that filters most of the data (to maintain the algorithm light) while not discarding interesting connections.

- The first step of the A-priori algorithm is to count occurencies of each item

In [10]:
support = 310
first_pass = rdd.flatMap(lambda basket:[(e,1) for e in basket]) \
                .reduceByKey(lambda x,y:x+y) \
                .filter(lambda x:x[1]>support)

print("remaining singleton", first_pass.count())
print("5 random singleton", first_pass.take(5))

remaining singleton 100
5 random singleton [(0, 5334), (8, 5307), (80, 5231), (40, 5209), (72, 5349)]


- Now we need to count all pair composed of frequent singletons

In [13]:
from itertools import combinations

frequent_singletons = set(first_pass.map(lambda x:x[0]).collect())
second_pass = rdd.flatMap(lambda basket:[(e,1) for e in combinations(sorted(basket),2)]) \
                 .filter(lambda x: x[0][0] in frequent_singletons) \
                 .filter(lambda x: x[0][1] in frequent_singletons) \
                 .reduceByKey(lambda x,y: x+y) \
                 .filter(lambda x:x[1]>support)
print(second_pass.count())

[Stage 52:>                                                         (0 + 8) / 8]

2314


                                                                                

- Now we need to count all the triplets

In [14]:
frequent_pairs = set(second_pass.map(lambda x:x[0]).collect())
third_pass = rdd.flatMap(lambda basket:[(e,1) for e in combinations(basket,3)]) \
                 .filter(lambda x: (x[0][0],x[0][1]) in frequent_pairs) \
                 .filter(lambda x: (x[0][1],x[0][2]) in frequent_pairs) \
                 .filter(lambda x: (x[0][0],x[0][2]) in frequent_pairs) \
                 .reduceByKey(lambda x,y: x+y) \
                 .filter(lambda x:x[1]>support)
print(third_pass.count())

[Stage 56:>                                                         (0 + 8) / 8]

0


                                                                                

You get the point, we simply reiterate this simple steps until we have no more frequent itemsets

In [16]:

support = 3

frdd = rdd.flatMap(lambda basket:[(e,1) for e in basket]) \
          .reduceByKey(lambda x,y:x+y) \
          .filter(lambda x:x[1] > support)
frequent = set(first_pass.map(lambda x:(x[0],)).collect())

print(f"remaining: {len(frequent)}, frdd {frdd.take(5)}")
    
k = 2
while frdd.count() != 0:
    frdd = rdd.flatMap(lambda basket: [(x,1) for x in combinations(sorted(basket),k)])\
              .filter(lambda x: all([y in frequent for y in combinations(x[0],len(x[0])-1)])) \
              .reduceByKey(lambda x,y:x+y) \
              .filter(lambda x:x[1] > support)
    
    frequent = set(frdd.map(lambda x:x[0]).collect())
    print(k, len(frequent), frdd.take(5))
    k += 1

remaining: 100, frdd [(0, 5334), (8, 5307), (80, 5231), (40, 5209), (72, 5349)]


                                                                                

2 4950 [((1, 37), 307), ((1, 93), 316), ((8, 62), 300), ((5, 25), 317), ((5, 41), 309)]


                                                                                

3 161698 [((1, 14, 38), 20), ((1, 14, 86), 15), ((1, 38, 86), 15), ((1, 58, 74), 19), ((1, 58, 90), 14)]


                                                                                

4 73732 [((3, 37, 38, 74), 5), ((15, 39, 52, 98), 4), ((3, 58, 68, 75), 5), ((19, 30, 63, 66), 6), ((1, 45, 69, 81), 5)]


                                                                                

5 12 [((50, 54, 64, 65, 84), 4), ((0, 56, 62, 83, 92), 4), ((12, 15, 29, 46, 68), 4), ((10, 30, 46, 50, 74), 4), ((25, 56, 66, 74, 91), 4)]


                                                                                

6 0 []


(⭐⭐⭐) repeat this algorithm with the data in `data.txt`

## PCY Algorithm
The PCY algorithm, also referred to as the Park-Chen-Yu algorithm, serves as a data mining technique specifically designed to detect frequently occurring itemsets within expansive datasets. It represents an enhancement over the Apriori algorithm.

In the Apriori algorithm, we would only tally itemsets if all their respective subsets were frequent. For instance, an itemset like `(i,j)` would only be counted if both `i` and `j` were frequent. With PCY, an additional discriminator is employed to ascertain the actual frequency of an itemset—**bucket**.

A bucket essentially functions as a counter linked to a set of itemsets. Let's denote these itemsets as $X=\{\text{itemset}_1,\dots,\text{itemset}_m\}$ associated with the same bucket. If these itemsets collectively appear a total of $k$ times, the bucket's counter will reflect this count. Consequently, it's evident that each itemset within $X$ cannot exceed a count of $k$ (as exceeding this count would result in a bucket count greater than $k$).

To illustrate, consider the itemset collection $X=\{\{11,3\},\{1,12\},\{13,2\},\{4,5\}\}$. Let's assume all itemsets in $X$ are linked to the same bucket. If the itemsets $\{11,3\},\{1,12\},\{13,2\},\{4,5\}$ respectively occur 1, 2, 5, and 2 times among the baskets, the associated bucket count would be $1+2+5+2=10$. Consequently, we deduce that no itemset in $X$ appears more than 10 times. Hence, if our chosen support threshold were $3$, we infer the presence of frequent itemsets within $X$. Conversely, if the support threshold were $15$, we conclude that none of the itemsets in $X$ are frequent.

We can link itemset to bucket by simply using hash functions.

In [None]:
print(hash((1,2)))
# what if we want a bucket ?
print(hash((1,2))%10)

In the first step of the algorithm we will compute the counts for each singleton (same as in the Apriori algorithm) and we will also compute the bucket frequency for each pair.

In [18]:
def from_buckets_to_bitmap(buckets, nobuckets):
    zero = [0] * nobuckets
    for b in buckets.collect(): zero[b[0]] = b[1]
    return zero

In [23]:
support = 310
buckets = 10000
first_pass = rdd.flatMap(lambda basket:[(e,1) for e in basket]) \
                .reduceByKey(lambda x,y:x+y) \
                .filter(lambda x:x[1]>support)

first_buckets = rdd.flatMap(lambda basket:[(hash(e)%buckets,1) for e in combinations(sorted(basket),2)]) \
                .reduceByKey(lambda x,y:x+y) \
                .map(lambda x:(x[0],int(x[1] > support)))
first_bitmap = from_buckets_to_bitmap(first_buckets, buckets)


frequent_singletons = set(first_pass.map(lambda x:x[0]).collect())
print(f"it:{k}, frequents:{len(frequent_singletons)}, bitmap:{sum(first_bitmap)}, samples:{first_pass.take(3)}")

                                                                                

it:7, frequents:100, bitmap:2357, samples:[(0, 5334), (8, 5307), (80, 5231)]


In the next step, we will compute the frequency of pairs for which
- the respective bucket is set to `1`.
- their singleton are all frequent (exceed the support)

Additionally, we will also compute buckets for the triplets 

In [20]:

second_pass = rdd.flatMap(lambda basket:[(e,1) for e in combinations(sorted(basket),2) if  
                                         e[0] in frequent_singletons and 
                                         e[1] in frequent_singletons]) \
                 .filter(lambda x:first_bitmap[hash(x[0])%buckets]) \
                 .reduceByKey(lambda x,y: x+y) \
                 .filter(lambda x:x[1]>support)

second_buckets = rdd.flatMap(lambda basket:[(hash(e)%buckets,1) for e in combinations(sorted(basket),3)]) \
                .reduceByKey(lambda x,y:x+y) \
                .map(lambda x:(x[0],int(x[1] > support)))

second_bitmap = from_buckets_to_bitmap(second_buckets, buckets)

print(second_pass.take(5))
print(f"frequent buckets:{sum(second_bitmap)}, bitmap:", "".join(map(str,second_bitmap[:100])))

                                                                                

[((1, 93), 316), ((5, 25), 317), ((61, 81), 314), ((44, 58), 322), ((4, 18), 313)]
frequent buckets:3800, bitmap: 1000000110001101110011001101000101000000100010000001110011101100100001100000011000100000011001100000


                                                                                

In the next step, we need to compute the frequency of triplets for which
- the respective bucket is set to `1`.
- their pars are all frequent (exceed the support)

Additionally, we will need also compute buckets for the quadruplets. And so on.

Let us instead focus on the iterative algorithm.

In [None]:
support = 3
buckets = 1000000

k = 1
# lets compute the frequent singletons
frdd = rdd.flatMap(lambda basket:[(e,1) for e in basket]) \
          .reduceByKey(lambda x,y:x+y) \
          .filter(lambda x:x[1] > support)

# lets compute bucket counters for all the pairs in the baskets
fbuckets = rdd.flatMap(lambda basket:[(hash(e)%buckets,1) for e in combinations(sorted(basket),2)]) \
            .reduceByKey(lambda x,y:x+y) \
            .map(lambda x:(x[0],int(x[1] > support)))

# from the buckets counters, we get the bitmap (1 for frequent bucket)
bitmap = from_buckets_to_bitmap(fbuckets, buckets)

# the frequent elements (same as the Apriori)
frequent = set(first_pass.map(lambda x:(x[0],)).collect())

print(f"it:{k}, frequents:{len(frequent)}, bitmap:{sum(bitmap)}, samples:{frdd.take(3)}")

k = 2
while frdd.count() != 0:

    # we use the frequent elements and the bitmap to filter the itemsets
    frdd = (rdd.flatMap(lambda basket: [(x,1) for x in combinations(sorted(basket),k)]) # compute all itemsets of size k from each basket
              .filter(lambda x: all([y in frequent for y in combinations(x[0],len(x[0])-1)])) # filter itemsets that have a non-frequent sub-itemset
              .filter(lambda x:bitmap[hash(x[0])%buckets]) # filter itemsets with 0 bitamp value
              .reduceByKey(lambda x,y:x+y) # count remaining itemsets
              .filter(lambda x:x[1] > support)) # filter those that are frequent

    # now we compute the buckets and the bitmap for the next iteration
    fbuckets = (rdd.flatMap(lambda basket:[(hash(e)%buckets,1) for e in combinations(sorted(basket),k+1)]) # hash each itemset on its bucket 
            .reduceByKey(lambda x,y:x+y) # count buckets
            .map(lambda x:(x[0],int(x[1] > support)))) # map to 1 those that are frequent 
    bitmap = from_buckets_to_bitmap(fbuckets, buckets) # compute the bitmap
    
    # get the frequent elements for the next step
    frequent = set(frdd.map(lambda x:x[0]).collect())
    
    print(f"it:{k}, frequents:{len(frequent)}, bitmap:{sum(bitmap)}, samples:{frdd.take(3)}")
    k += 1

it:1, frequents:100, bitmap:4950, samples:[(0, 5334), (8, 5307), (80, 5231)]


                                                                                

it:2, frequents:4950, bitmap:153048, samples:[((1, 37), 307), ((1, 93), 316), ((8, 62), 300)]


                                                                                

it:3, frequents:161698, bitmap:501079, samples:[((1, 14, 38), 20), ((1, 14, 86), 15), ((1, 38, 86), 15)]


[Stage 435:>                                                        (0 + 8) / 8]

(⭐⭐⭐) repeat this algorithm with the data in `data.txt`