# Example - Market basket analysis at the grocery outlet

### Introduction

**Market basket analysis** tells us which products tend to be purchased together and which are most amenable to promotion. This information is actionable: it can suggest new store layouts, determine which articles to put on special, indicate when to issue coupons, and so on. When these data can be tied to individual customers through a loyalty card or website registration, they become even more valuable. The application of **association rules** to market basket analysis is a classic of data mining. 

In this example, which I use to illustrate the extraction of association rules from **transaction data**, a Chicago-based marketing analyst focusing on the retail industry explores different approaches for modeling consumer behavior using data on **point-of-sale transactions** in small stores of the Chicago metropolitan area. She starts with a market basket analysis of data from a typical local grocery outlet, where she intends to identify **joint occurrence** of products in shopping baskets.

### The data set

The files `groceries.csv` data set covers one month of point-of-sale data. It contains 9,835 transactions and the items are aggregated to 169 categories. The data come as a **matrix transaction/item**: an entry equal to 1 in the intersection of row `i` and column `j` indicates that transaction `i` includes item `j`. 

I start by loading the data in the usual way.

In [1]:
import pandas as pd

In [2]:
url1 = 'https://raw.githubusercontent.com/cinnData/DataSci/main/'
url2 = '11.%20Association%20rules/groceries.csv'
url = url1 + url2
df = pd.read_csv(url)

Next, I check the size of the file:

In [3]:
df.shape

(9835, 169)

Now, we display the head of a few columns:

In [4]:
df[df.columns[:10]].head()

Unnamed: 0,frankfurter,sausage,liver_loaf,ham,meat,finished_products,organic_sausage,chicken,turkey,pork
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0


These data look very **sparse**. Let me explore this. The total number of entries is:

In [5]:
df.count().sum()

1662115

The total number of nonzero entries is:

In [6]:
df.sum().sum()

43367

So, there are 43,367 nonzero entries, out of the 1,662,115 terms of this matrix (2.6%). Undoubtedly, this is an inefficient way of transporting the data, which is used in this example to keep things simple. You'll never manage such files in business.

### The package mlextend

The tools used in the analysis of this example are taken from the Python package `mlxtend`, which contains miscellaneous tools for various jobs. I use here two functions from the subpackage `frequent_patterns`.

Association rules ming comes in `mlxtend` in two steps: (a) extracting the most frequent itemsets, and (b) selecting association rules by support and confidence. We do not always find these two steps separated in data science software applications, as they are in this package.

### Mining itemsets

To capture the frequent itemsets, we import the function `apriori`.

In [7]:
from mlxtend.frequent_patterns import apriori

What frequent itemset means depends on the particular data set. I use the **support**, which is the proportion of transactions that contain an itemset, to evaluate how frequent itemsets are. In this example, after some exploration (not reported), I set the **minimim support** to 0.01, which makes enough room for examination.

In [8]:
freq_itemsets = apriori(df, min_support=0.01, use_colnames=True)

`apriori` returns a data frame with two columns, the support and the itemset: 

In [9]:
freq_itemsets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   support   333 non-null    float64
 1   itemsets  333 non-null    object 
dtypes: float64(1), object(1)
memory usage: 5.3+ KB


The itemsets with highest and lowest support in this selection are:

In [10]:
freq_itemsets.sort_values('support', ascending=False).head()

Unnamed: 0,support,itemsets
18,0.255516,(whole_milk)
16,0.193493,(other_vegetables)
40,0.183935,(rolls_buns)
60,0.174377,(soda)
23,0.139502,(yogurt)


In [11]:
freq_itemsets.sort_values('support', ascending=False).tail()

Unnamed: 0,support,itemsets
151,0.010066,"(napkins, tropical_fruit)"
219,0.010066,"(whole_milk, hard_cheese)"
107,0.010066,"(sausage, fruit_vegetable_juice)"
329,0.010066,"(curd, yogurt, whole_milk)"
248,0.010066,"(rolls_buns, curd)"


This quick examination shows us that we are not wrong lowering the support threshold. The terms of the column `itemsets` are (frozen) sets. The **frozen set** is an immutable version of a Python set object. While elements of a set can be modified at any time, elements of a frozen set remain the same after creation.

In [12]:
freq_itemsets.itemsets[0]

frozenset({'frankfurter'})

I add the length of the itemsets, which will allow me to filter itemsets by legth, having a clearer picture. `apply` is used here to apply the function `len`, which returns the number of elements of a set, term by term, to the column `itemsets`. 

In [13]:
freq_itemsets['length'] = freq_itemsets['itemsets'].apply(len)

You can get the top-selling items by picking the itemsets of size 1 and sorting them by support (you would get the same with `value_counts`, but, instead of a proportion, you would get a count). We do not need these itemsets for describing the association rules, but they help us to understand the concepts. 

In [14]:
item1 = freq_itemsets[freq_itemsets['length'] == 1]
item1.sort_values('support', ascending=0).head(10)

Unnamed: 0,support,itemsets,length
18,0.255516,(whole_milk),1
16,0.193493,(other_vegetables),1
40,0.183935,(rolls_buns),1
60,0.174377,(soda),1
23,0.139502,(yogurt),1
59,0.110524,(bottled_water),1
13,0.108998,(root_vegetables),1
9,0.104931,(tropical_fruit),1
87,0.098526,(shopping_bags),1
1,0.09395,(sausage),1


Setting the length to 2, you can pick a second part of the collection of frequent itemsets.

In [15]:

item2 = freq_itemsets[freq_itemsets['length'] == 2]
item2.sort_values('support', ascending=0).head(10)

Unnamed: 0,support,itemsets,length
184,0.074835,"(whole_milk, other_vegetables)",2
223,0.056634,"(rolls_buns, whole_milk)",2
216,0.056024,"(yogurt, whole_milk)",2
166,0.048907,"(root_vegetables, whole_milk)",2
165,0.047382,"(root_vegetables, other_vegetables)",2
189,0.043416,"(yogurt, other_vegetables)",2
194,0.042603,"(rolls_buns, other_vegetables)",2
140,0.042298,"(whole_milk, tropical_fruit)",2
232,0.040061,"(whole_milk, soda)",2
273,0.038332,"(rolls_buns, soda)",2


Finally, setting the length to 3, you get a third list.

In [16]:
item3 = freq_itemsets[freq_itemsets['length'] == 3]
item3.sort_values('support', ascending=0).head(10)

Unnamed: 0,support,itemsets,length
313,0.023183,"(root_vegetables, whole_milk, other_vegetables)",3
319,0.022267,"(yogurt, whole_milk, other_vegetables)",3
322,0.017895,"(rolls_buns, whole_milk, other_vegetables)",3
308,0.017082,"(whole_milk, tropical_fruit, other_vegetables)",3
331,0.015557,"(rolls_buns, yogurt, whole_milk)",3
310,0.01515,"(yogurt, whole_milk, tropical_fruit)",3
320,0.014642,"(whipped_sour_cream, whole_milk, other_vegetab...",3
316,0.01454,"(yogurt, root_vegetables, whole_milk)",3
325,0.01393,"(whole_milk, other_vegetables, soda)",3
312,0.013523,"(whole_milk, other_vegetables, pip_fruit)",3


For mining association rules, you can apply the function `association_rules`. We use here the **confidence** for selecting the more relevant rules, setting the threshold to 0.4. You may find in the literature examples with much higher thresholds, but we cannot be so strict in this case.

In [17]:
from mlxtend.frequent_patterns import association_rules
rules = association_rules(freq_itemsets, metric="confidence", min_threshold=0.4)

Finally, I arrange things so my presentation of the rules looks nicer.

In [18]:
rules = rules.sort_values('confidence', ascending=0).head(10)
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
27,"(root_vegetables, citrus_fruit)",(other_vegetables),0.010371,0.586207,3.029608
31,"(root_vegetables, tropical_fruit)",(other_vegetables),0.012303,0.584541,3.020999
59,"(curd, yogurt)",(whole_milk),0.010066,0.582353,2.279125
47,"(butter, other_vegetables)",(whole_milk),0.01149,0.573604,2.244885
32,"(root_vegetables, tropical_fruit)",(whole_milk),0.011998,0.570048,2.230969
44,"(yogurt, root_vegetables)",(whole_milk),0.01454,0.562992,2.203354
52,"(domestic_eggs, other_vegetables)",(whole_milk),0.012303,0.552511,2.162336
60,"(whipped_sour_cream, yogurt)",(whole_milk),0.01088,0.52451,2.052747
45,"(rolls_buns, root_vegetables)",(whole_milk),0.01271,0.523013,2.046888
39,"(other_vegetables, pip_fruit)",(whole_milk),0.013523,0.51751,2.025351


What do these figures mean? Take the first rule in the above table. The confidence 58.6% is read as *the probability that a transaction containing root vegetables and citrus fruit contains also other vegetables*. The lift 3.03 is read as: *the probability that a transaction contains other vegetables is three times higher if the transaction contains root vegetables and citrus fruit*.

### Homework

Rewrite the code for the extraction of the association rules in such a way that you capture the rules with the highest lift.

1. How do you interpret the results?

2. Why do you get the rules in pairs *A* => *B* and  *B* => *A*, with the same lift but different confidence?