# MarketBasket
### Overview
In market basket analysis, the **basket** refers to a customer's collection of items during a single purchase trip. It's not literally a physical basket, but rather the data that represents what products a customer buys together. The basket are represented as **sets** of indices, each index referring to a specific product (e.g. `{112, 41, 1020}` is a basket). 

By identifying frequently bought itemsets (groups of items purchased together), we can reveal associations between products. This allows retailers to understand which items customers tend to buy together.  This knowledge is crucial for strategies like targeted promotions.



In [None]:
# First we need a dataset. 
# Let us generate a dataset in python
def random_basket(items, max_basket_size=15):
    basket_size = random.randint(1,max_basket_size)
    return {random.randint(0,items-1) for i in range(basket_size)}

import random
dataset_size, items = 100000, 100
dataset = [random_basket(items, 10) for i in range(dataset_size)]
print(dataset[:3])

In [None]:
# Let's intialize the spark context and let's parallelize the data
import os

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Apriori").getOrCreate()

In [None]:
# let's insert data in spark
rdd = spark.sparkContext.parallelize(dataset)

## The A-Priori Algorithm
This algorithm leverages a key property of itemsets - if a large itemset is frequent, all its smaller subsets must also be frequent. (e.g. if `{12,54,22,92,69,4}` is frequent then also all its subsets are frequent, therefore sets as `{12,54,22}` and `{12,69,4}` are frequent). Itemsets are considered frequent (or interesting) when their frequency exceeds a threshold parameterd---called **support**.

> The Role of Support:  The Apriori algorithm uses a minimum support threshold. This threshold defines how frequent an itemset needs to be considered "interesting" for further analysis.  Items or itemsets that appear less frequently than the threshold are discarded. Therefore, it is crucial to select a support that filters most of the data (to maintain the algorithm light) while not discarding interesting connections.

- The first step of the A-priori algorithm is to count occurencies of each item

In [None]:
support = 310
first_pass = rdd.flatMap(lambda basket:[(e,1) for e in basket]) \
                .reduceByKey(lambda x,y:x+y) \
                .filter(lambda x:x[1]>support)

print("remaining singleton", first_pass.count())
print("5 random singleton", first_pass.take(5))

- Now we need to count all pair composed of frequent singletons

In [None]:
frequent_singletons = set(first_pass.map(lambda x:x[0]).collect())
second_pass = rdd.flatMap(lambda basket:[((e0,e1),1) for e0 in basket for e1 in basket if e1 != e0 and e0 in frequent_singletons and e1 in frequent_singletons]) \
                 .reduceByKey(lambda x,y: x+y) \
                 .filter(lambda x:x[1]>support)
print(second_pass.count())

- Now we need to count all the triplets

In [None]:
frequent_pairs = set(second_pass.map(lambda x:x[0]).collect())
third_pass = rdd.flatMap(lambda basket:[((e0,e1,2),1) for e0 in basket for e1 in basket for e2 in basket if \
                                         e1 != e0 and e0 != e2 and (e0,e1) in frequent_pairs and (e1,e2) in frequent_pairs and (e0,e2) in frequent_pairs]) \
                 .reduceByKey(lambda x,y: x+y) \
                 .filter(lambda x:x[1]>support)
print(third_pass.count())

You get the point, we simply reiterate this simple steps until we have no more frequent itemsets

In [None]:
from itertools import combinations

support = 3

frdd = rdd.flatMap(lambda basket:[(e,1) for e in basket]) \
          .reduceByKey(lambda x,y:x+y) \
          .filter(lambda x:x[1] > support)
frequent = set(first_pass.map(lambda x:(x[0],)).collect())

print(f"remaining: {len(frequent)}, frdd {frdd.take(5)}")
    
k = 2
while frdd.count() != 0:
    frdd = rdd.flatMap(lambda basket: [(x,1) for x in combinations(basket,k) if all([y in frequent for y in combinations(x,len(x)-1)])]) \
              .reduceByKey(lambda x,y:x+y) \
              .filter(lambda x:x[1] > support)
    
    frequent = set(frdd.map(lambda x:x[0]).collect())
    print(k, len(frequent), frdd.take(5))
    k += 1

(⭐⭐⭐) repeat this algorithm with the data in `data.txt`