<a href="https://colab.research.google.com/github/elhamod/BA820/blob/main/Association_Rules/Basic_Association_Rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

#Analyzing Trends at a Grocery Store.

In this notebook, we will use `mlxtend` to mine association rules from grocery transactions.

In [43]:
# supressing some unimportant warnings
import warnings
warnings.filterwarnings(action="ignore", message=r"datetime.datetime.utcnow")

In [44]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

## Load the data

In [45]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "GroceryStoreDataSet.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "simgeerek/grocerystoredataset",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

# replace "COCK" with "COKE"
df = df.apply(lambda x: x.str.replace("COCK", "COKE"))
df = df.apply(lambda x: x.str.replace("SUGER", "SUGAR"))

display(df)

  df = kagglehub.load_dataset(


Using Colab cache for faster access to the 'grocerystoredataset' dataset.


Unnamed: 0,"MILK,BREAD,BISCUIT"
0,"BREAD,MILK,BISCUIT,CORNFLAKES"
1,"BREAD,TEA,BOURNVITA"
2,"JAM,MAGGI,BREAD,MILK"
3,"MAGGI,TEA,BISCUIT"
4,"BREAD,TEA,BOURNVITA"
5,"MAGGI,TEA,CORNFLAKES"
6,"MAGGI,BREAD,TEA,BISCUIT"
7,"JAM,MAGGI,BREAD,TEA"
8,"BREAD,MILK"
9,"COFFEE,COKE,BISCUIT,CORNFLAKES"


## Preprocessing

Convert the text in the table to a list of items

In [46]:
data_column = df.iloc[:, 0]
data = list(data_column.apply(lambda x: x.split(',')))
data

[['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COKE', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COKE', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGAR', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COKE'],
 ['BREAD', 'SUGAR', 'BISCUIT'],
 ['COFFEE', 'SUGAR', 'CORNFLAKES'],
 ['BREAD', 'SUGAR', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGAR'],
 ['BREAD', 'COFFEE', 'SUGAR'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

We need to convert the data into a encoded format.

In [47]:
# Transform data
te = TransactionEncoder()
transactions = te.fit(data).transform(data) # or fit_transform(data)

# Create a dataframe from the data
df_encoded = pd.DataFrame(transactions, columns=te.columns_)
df_encoded

Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COFFEE,COKE,CORNFLAKES,JAM,MAGGI,MILK,SUGAR,TEA
0,True,False,True,False,False,True,False,False,True,False,False
1,False,True,True,False,False,False,False,False,False,False,True
2,False,False,True,False,False,False,True,True,True,False,False
3,True,False,False,False,False,False,False,True,False,False,True
4,False,True,True,False,False,False,False,False,False,False,True
5,False,False,False,False,False,True,False,True,False,False,True
6,True,False,True,False,False,False,False,True,False,False,True
7,False,False,True,False,False,False,True,True,False,False,True
8,False,False,True,False,False,False,False,False,True,False,False
9,True,False,False,True,True,True,False,False,False,False,False


## Association Rule Extraction

Let's find some frequent itemsets.

In [48]:
frequent_itemsets = apriori(df_encoded, min_support=0.000001, use_colnames=True)
frequent_itemsets.sort_values(by="support")

Unnamed: 0,support,itemsets
24,0.052632,"(BREAD, COKE)"
20,0.052632,"(COFFEE, BOURNVITA)"
17,0.052632,"(BISCUIT, SUGAR)"
25,0.052632,"(CORNFLAKES, BREAD)"
16,0.052632,"(BISCUIT, MILK)"
...,...,...
0,0.315789,(BISCUIT)
9,0.315789,(SUGAR)
10,0.368421,(TEA)
3,0.421053,(COFFEE)


Let's find rules with certain criteria

In [49]:
rules = association_rules(frequent_itemsets,
                          num_itemsets=frequent_itemsets.shape[0],
                          metric="confidence", min_threshold=0.6) #, metric="support", min_threshold=0.05
rules.sort_values(by=["support", "confidence"])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
12,"(CORNFLAKES, BREAD)",(BISCUIT),0.052632,0.315789,0.052632,1.000000,3.166667,1.0,0.036011,inf,0.722222,0.166667,1.000000,0.583333
13,"(BISCUIT, MILK)",(BREAD),0.052632,0.631579,0.052632,1.000000,1.583333,1.0,0.019391,inf,0.388889,0.083333,1.000000,0.541667
14,"(BISCUIT, SUGAR)",(BREAD),0.052632,0.631579,0.052632,1.000000,1.583333,1.0,0.019391,inf,0.388889,0.083333,1.000000,0.541667
25,"(BISCUIT, MILK)",(CORNFLAKES),0.052632,0.315789,0.052632,1.000000,3.166667,1.0,0.036011,inf,0.722222,0.166667,1.000000,0.583333
30,"(COFFEE, BOURNVITA)",(SUGAR),0.052632,0.315789,0.052632,1.000000,3.166667,1.0,0.036011,inf,0.722222,0.166667,1.000000,0.583333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6,(COKE),(COFFEE),0.157895,0.421053,0.157895,1.000000,2.375000,1.0,0.091413,inf,0.687500,0.375000,1.000000,0.687500
5,(SUGAR),(BREAD),0.315789,0.631579,0.210526,0.666667,1.055556,1.0,0.011080,1.105263,0.076923,0.285714,0.095238,0.500000
7,(CORNFLAKES),(COFFEE),0.315789,0.421053,0.210526,0.666667,1.583333,1.0,0.077562,1.736842,0.538462,0.400000,0.424242,0.583333
8,(SUGAR),(COFFEE),0.315789,0.421053,0.210526,0.666667,1.583333,1.0,0.077562,1.736842,0.538462,0.400000,0.424242,0.583333


Let's filter the rules further.

In [50]:
rules_filtered = rules[(rules['confidence'] > 0.5) & (rules['lift'] >= 3)]
rules_filtered.sort_values(by=["confidence", "lift"])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
44,(COKE),"(CORNFLAKES, COFFEE)",0.157895,0.210526,0.105263,0.666667,3.166667,1.0,0.072022,2.368421,0.8125,0.4,0.577778,0.583333
21,"(CORNFLAKES, BISCUIT)",(COKE),0.157895,0.157895,0.105263,0.666667,4.222222,1.0,0.080332,2.526316,0.90625,0.5,0.604167,0.666667
24,(COKE),"(CORNFLAKES, BISCUIT)",0.157895,0.157895,0.105263,0.666667,4.222222,1.0,0.080332,2.526316,0.90625,0.5,0.604167,0.666667
68,"(CORNFLAKES, BISCUIT)","(COFFEE, COKE)",0.157895,0.157895,0.105263,0.666667,4.222222,1.0,0.080332,2.526316,0.90625,0.5,0.604167,0.666667
72,"(COFFEE, COKE)","(CORNFLAKES, BISCUIT)",0.157895,0.157895,0.105263,0.666667,4.222222,1.0,0.080332,2.526316,0.90625,0.5,0.604167,0.666667
18,(COKE),"(BISCUIT, COFFEE)",0.157895,0.105263,0.105263,0.666667,6.333333,1.0,0.088643,2.684211,1.0,0.666667,0.627451,0.833333
34,"(MAGGI, BREAD)",(JAM),0.157895,0.105263,0.105263,0.666667,6.333333,1.0,0.088643,2.684211,1.0,0.666667,0.627451,0.833333
73,(COKE),"(CORNFLAKES, BISCUIT, COFFEE)",0.157895,0.105263,0.105263,0.666667,6.333333,1.0,0.088643,2.684211,1.0,0.666667,0.627451,0.833333
12,"(CORNFLAKES, BREAD)",(BISCUIT),0.052632,0.315789,0.052632,1.0,3.166667,1.0,0.036011,inf,0.722222,0.166667,1.0,0.583333
20,"(BISCUIT, COFFEE)",(CORNFLAKES),0.105263,0.315789,0.105263,1.0,3.166667,1.0,0.072022,inf,0.764706,0.333333,1.0,0.666667


## Questions:

* Compute basic descriptive statistics for the dataset, such as the average number of items per transaction and the 10 most frequent itemsets.

* The store's online platform aims to recommend a single additional item once a customer has added three items to their cart. Display and interpret the association rules that support such recommendations.

* The store has an excess inventory of tea that is approaching its expiration date and wants to sell it quickly. Based on your analysis, which customer segments should tea be promoted to, and why?

* Empirically demonstrate how the computational complexity (i.e., runtime) of your code changes as a function of:
  1. the number of transactions, and/or
  2. the filtering criteria (e.g., minimum support or confidence thresholds).
You may use tools such as `%%timeit` to support your analysis.

