<a href="https://colab.research.google.com/github/elhamod/BA820/blob/main/Hands-on/01-association-rules/association_rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

#Analyzing Trends at a Grocery Store.

In this notebook, we will analyze the transactions at a grocery store and extract some customer behavior that is of interest to us.

In [24]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

  and should_run_async(code)


## Load the data

In [25]:
import pandas as pd

url = "https://raw.githubusercontent.com/elhamod/BA820/main/Hands-on/01-association-rules/Kaggle_GroceryStoreDataSet_modified.csv"

df = pd.read_csv(url, header=None)
# In case there are too many rows to load, you can use this parameter:  nrows = 100)

df.head()

  and should_run_async(code)


Unnamed: 0,0
0,"MILK,BREAD,BISCUIT"
1,"BREAD,MILK,BISCUIT,CORNFLAKES"
2,"BREAD,TEA,BOURNVITA"
3,"JAM,MAGGI,BREAD,MILK"
4,"BREAD,TEA,BOURNVITA"


## Analyze data

Convert the text in the table to a list of items

In [26]:
data_column = df.iloc[:, 0]
data = list(data_column.apply(lambda x: x.split(',')))
data

  and should_run_async(code)


[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['', 'BREAD', 'MILK'],
 ['CAFE', 'COKE', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COCACOLA', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'CAFE', 'COKE'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'CAFE', 'SUGER'],
 ['BREAD', 'CAFE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

Out of curiousity, what are the unique values?

In [27]:
flat_list = []
for lst in data:
  flat_list = flat_list + lst
set(flat_list)

  and should_run_async(code)


{'',
 'BISCUIT',
 'BOURNVITA',
 'BREAD',
 'CAFE',
 'COCACOLA',
 'COFFEE',
 'COKE',
 'CORNFLAKES',
 'JAM',
 'MAGGI',
 'MILK',
 'SUGER',
 'TEA'}

Seems we have a couple of things to attend to:


1.   An empty string.
2.   COFFEE and CAFE are the same item.
3.   COKE and COCACOLA are the same product.

Let's clean up the data

In [28]:
for indx, lst in enumerate(data):
  # Remove empty strings:
  lst = [i for i in lst if i]

  # Replace CAFE with COFFEE
  lst = ['COFFEE' if i == 'CAFE' else i for i in lst]

  # Replace COCACOLA with COKE
  lst = ['COKE' if i == 'COCACOLA' else i for i in lst]

  # Update data
  data[indx] = lst

data

  and should_run_async(code)


[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COKE', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COKE', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COKE'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

## Market Basket Analysis

We now have the data in a basket format. We need to convert it into a encoded format.

In [29]:
# Transform data
te = TransactionEncoder()
te_data = te.fit(data).transform(data)

# Create a dataframe from the data
df_encoded = pd.DataFrame(te_data, columns=te.columns_)
df_encoded

  and should_run_async(code)


Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COFFEE,COKE,CORNFLAKES,JAM,MAGGI,MILK,SUGER,TEA
0,True,False,True,False,False,False,False,False,True,False,False
1,True,False,True,False,False,True,False,False,True,False,False
2,False,True,True,False,False,False,False,False,False,False,True
3,False,False,True,False,False,False,True,True,True,False,False
4,False,True,True,False,False,False,False,False,False,False,True
5,True,False,False,False,False,False,False,True,False,False,True
6,False,False,False,False,False,True,False,True,False,False,True
7,True,False,True,False,False,False,False,True,False,False,True
8,False,False,True,False,False,False,True,True,False,False,True
9,False,False,True,False,False,False,False,False,True,False,False


Let's find the most frequent itemsets.

In [30]:
frequent_itemsets = apriori(df_encoded, min_support=0.00001, use_colnames=True)
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.35,(BISCUIT)
1,0.20,(BOURNVITA)
2,0.65,(BREAD)
3,0.40,(COFFEE)
4,0.15,(COKE)
...,...,...
78,0.05,"(BISCUIT, TEA, MAGGI, BREAD)"
79,0.10,"(BISCUIT, CORNFLAKES, COKE, COFFEE)"
80,0.05,"(JAM, MAGGI, MILK, BREAD)"
81,0.05,"(TEA, JAM, MAGGI, BREAD)"


Let's find the rules of interest.

In [31]:
rules = association_rules(frequent_itemsets, num_itemsets=frequent_itemsets.shape[0], metric="support", min_threshold=0.05) #, metric="confidence", min_threshold=0.6
rules.sort_values(by="support")

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
167,(BREAD),"(CORNFLAKES, MILK)",0.65,0.10,0.05,0.076923,0.769231,1.0,-0.0150,0.975000,-0.461538,0.071429,-0.025641,0.288462
225,(TEA),"(MILK, COFFEE)",0.35,0.05,0.05,0.142857,2.857143,1.0,0.0325,1.108333,1.000000,0.142857,0.097744,0.571429
224,"(MILK, COFFEE)",(TEA),0.05,0.35,0.05,1.000000,2.857143,1.0,0.0325,inf,0.684211,0.142857,1.000000,0.571429
223,"(TEA, COFFEE)",(MILK),0.05,0.25,0.05,1.000000,4.000000,1.0,0.0375,inf,0.789474,0.200000,1.000000,0.600000
222,"(TEA, MILK)",(COFFEE),0.05,0.40,0.05,1.000000,2.500000,1.0,0.0300,inf,0.631579,0.125000,1.000000,0.562500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47,(COFFEE),(SUGER),0.40,0.30,0.20,0.500000,1.666667,1.0,0.0800,1.400000,0.666667,0.400000,0.285714,0.583333
69,(MAGGI),(TEA),0.25,0.35,0.20,0.800000,2.285714,1.0,0.1125,3.250000,0.750000,0.500000,0.692308,0.685714
68,(TEA),(MAGGI),0.35,0.25,0.20,0.571429,2.285714,1.0,0.1125,1.750000,0.865385,0.500000,0.428571,0.685714
34,(MILK),(BREAD),0.25,0.65,0.20,0.800000,1.230769,1.0,0.0375,1.750000,0.250000,0.285714,0.428571,0.553846


Let's filter the rules further.

In [32]:
rules_filtered = rules[(rules['confidence'] > 0.5) & (rules['lift'] >= 1)]
rules_filtered.sort_values(by="confidence")

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
68,(TEA),(MAGGI),0.35,0.25,0.20,0.571429,2.285714,1.0,0.1125,1.75,0.865385,0.500000,0.428571,0.685714
5,(COKE),(BISCUIT),0.15,0.35,0.10,0.666667,1.904762,1.0,0.0475,1.95,0.558824,0.250000,0.487179,0.476190
292,(COKE),"(BISCUIT, CORNFLAKES, COFFEE)",0.15,0.10,0.10,0.666667,6.666667,1.0,0.0850,2.70,1.000000,0.666667,0.629630,0.833333
289,"(COKE, COFFEE)","(BISCUIT, CORNFLAKES)",0.15,0.15,0.10,0.666667,4.444444,1.0,0.0775,2.55,0.911765,0.500000,0.607843,0.666667
284,"(BISCUIT, CORNFLAKES)","(COKE, COFFEE)",0.15,0.15,0.10,0.666667,4.444444,1.0,0.0775,2.55,0.911765,0.500000,0.607843,0.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
223,"(TEA, COFFEE)",(MILK),0.05,0.25,0.05,1.000000,4.000000,1.0,0.0375,inf,0.789474,0.200000,1.000000,0.600000
224,"(MILK, COFFEE)",(TEA),0.05,0.35,0.05,1.000000,2.857143,1.0,0.0325,inf,0.684211,0.142857,1.000000,0.571429
102,"(BISCUIT, COKE)",(COFFEE),0.10,0.40,0.10,1.000000,2.500000,1.0,0.0600,inf,0.666667,0.250000,1.000000,0.625000
241,"(JAM, MILK)",(MAGGI),0.05,0.25,0.05,1.000000,4.000000,1.0,0.0375,inf,0.789474,0.200000,1.000000,0.600000


## Let's do some analysis through visualizations.

**Visually identifying the rules that apply most commonly to my customers.**

First, some necessary string manipulation to beautify the visualization labels.

In [33]:
rules_filtered['antecedents'] = rules_filtered['antecedents'].apply(lambda a: ','.join(list(a)))
rules_filtered['consequents'] = rules_filtered['consequents'].apply(lambda a: ','.join(list(a)))

  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rules_filtered['antecedents'] = rules_filtered['antecedents'].apply(lambda a: ','.join(list(a)))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rules_filtered['consequents'] = rules_filtered['consequents'].apply(lambda a: ','.join(list(a)))


Then, I want to see the support for each antecedent-consequent pair.

In [43]:
support_table = rules_filtered.pivot(index='antecedents', columns='consequents', values='support')

import plotly.express as px
fig = px.imshow(support_table, x=support_table.columns, y=support_table.index)
fig.update_layout(width=1500,height=1500)
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



## Questions worth investigating


*   Get some stats on the dataset, such as the average number of items per transaction and the 10 most frequent transactions.
*   Filter the rules down to those that are useful for promoting items to customers once they have 3 items in their cart.
*   If you have an excess of tea that is expiring soon in your stock and you want to sell it out quickly, which customer base would you promote it to?
*   How does the computational complexity (i.e., how long it takes the algorithm to run) change as we change the number of transactions and/or our filtering criteria? One could use the `%%timeit`.

