# 🛍 **Market Basket Analysis**

**What is Market Basket Analysis?**

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

**How Does Market Basket Analysis Work?**

To uncovers associations between items, Market Basket Analysis use one rules called **Association Rule**. Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules. In this case will use one Algorithm called **Apriori Algorithm**

### **Read Dataset**
📋 ***Kaggel***: [Groceries Dataset](https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset/data)

In [None]:
#!pip install --upgrade ipykernel

In [21]:
import pandas as pd

from warnings import filterwarnings
filterwarnings("ignore")

In [3]:
df = pd.read_csv('Groceries_dataset.csv')
df.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB


The dataset contains information on purchase made at grocery store, include a Member/Customer ID, Transaction date, and itemDescription.

In [5]:
df.isnull().sum()

Member_number      0
Date               0
itemDescription    0
dtype: int64

There are no missing value in the dataset.

In [6]:
df['Member_number'].nunique()

3898

In [7]:
df['itemDescription'].nunique()

167

### **Data Preparation**

To make it easy, we need to convert the data into a format that can easily to use into the **Apriori algorithm**.

We can set `Member_Number` or `Date` as an index for our dataset. But this time we will create one column called `singleTransaction` that combine `Member_Number` and `Date`, and set it as an Index of the Dataset

In [8]:
df['singleTransaction'] = df['Member_number'].astype(str) + '_' + df['Date'].astype(str)

df.head()

Unnamed: 0,Member_number,Date,itemDescription,singleTransaction
0,1808,21-07-2015,tropical fruit,1808_21-07-2015
1,2552,05-01-2015,whole milk,2552_05-01-2015
2,2300,19-09-2015,pip fruit,2300_19-09-2015
3,1187,12-12-2015,other vegetables,1187_12-12-2015
4,3037,01-02-2015,whole milk,3037_01-02-2015


**`singleTransaction`** represent for item purchased in one receipt, it contains member_number and date.

Group our data based on **`singleTransaction`** and **`itemDescription`**, then we will use **`.unstack()`** to make a pivot table to convert the **`itemDescription**` into columns, and transaction to rows. Then, set **`singleTransaction**` as an index

There are 2 ways to make it happen either group it, or use a function from pandas called **`crosstab()`**. ***Use one of them*** because *it will return same format.*

In [9]:
# the First way
basket = (df.groupby(['singleTransaction', 'itemDescription'])['Date'].count()
    .unstack()
    .reset_index()
    .fillna(0)
    .set_index('singleTransaction'))

basket.head(10)

itemDescription,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
singleTransaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000_15-03-2015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1000_24-06-2014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1000_24-07-2015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000_25-11-2015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000_27-05-2015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1001_02-05-2015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1001_07-02-2014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1001_12-12-2014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1001_14-04-2015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1001_20-01-2015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


----- # 2

In [22]:
# the Second way
#basket2 = pd.crosstab(df['singleTransaction'], df['itemDescription'])
#basket2.head()

itemDescription,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
singleTransaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000_15-03-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1000_24-06-2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1000_24-07-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_25-11-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_27-05-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Normalizing
To knowing what items are in one basket, it is necessary to **normalize** the data. 1 represent that item inside basket and 0 otherwise.

In [11]:
# function to normalize the data
def encode_units(item_frequency):
    res = 0
    if item_frequency > 0:
        res = 1
    return res

In [12]:
basket_encode = basket.applymap(encode_units)

basket_encode

itemDescription,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
singleTransaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000_15-03-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1000_24-06-2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1000_24-07-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_25-11-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_27-05-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4999_24-01-2015,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
4999_26-12-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5000_09-03-2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5000_10-02-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### **Apriori Algorithm**
Apriori is an algorithm for *frequent itemset* mining and *association rule learning* over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.

In this case we will use Apriori algorithm from **`mlxtend`** python package and use it to discover frequently-bought-together item combinations.

In [13]:
# if there are need to install apriori from mlxtend
#!pip install mlxtend

#or from apyroti
#!pip install apyori
#from apyori import apriori

The parameter inside we will set:
- **`min_support=0.001`**
- association rules **`metrics='lift'`**

In [14]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

frequent_itemset = apriori(basket_encode, min_support=0.001, use_colnames=True)
rules = association_rules(frequent_itemset, metric='lift')

rules.head()



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(UHT-milk),(bottled water),0.021386,0.060683,0.001069,0.05,0.823954,-0.000228,0.988755,-0.179204
1,(bottled water),(UHT-milk),0.060683,0.021386,0.001069,0.017621,0.823954,-0.000228,0.996168,-0.185312
2,(UHT-milk),(other vegetables),0.021386,0.122101,0.002139,0.1,0.818993,-0.000473,0.975443,-0.184234
3,(other vegetables),(UHT-milk),0.122101,0.021386,0.002139,0.017515,0.818993,-0.000473,0.99606,-0.201119
4,(UHT-milk),(sausage),0.021386,0.060349,0.001136,0.053125,0.880298,-0.000154,0.992371,-0.121998


There are 10 columns, but we can try to focus on several columns. `antecedents`, `consequents`, `support`, `confidence`, and `lift`.

In [15]:
new_rules = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

new_rules.head()

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(UHT-milk),(bottled water),0.001069,0.05,0.823954
1,(bottled water),(UHT-milk),0.001069,0.017621,0.823954
2,(UHT-milk),(other vegetables),0.002139,0.1,0.818993
3,(other vegetables),(UHT-milk),0.002139,0.017515,0.818993
4,(UHT-milk),(sausage),0.001136,0.053125,0.880298


In [16]:
# sort values based on lift metrics
new_rules.sort_values('lift', ascending=False).head(10)

Unnamed: 0,antecedents,consequents,support,confidence,lift
732,(sausage),"(yogurt, whole milk)",0.00147,0.024363,2.182917
729,"(yogurt, whole milk)",(sausage),0.00147,0.131737,2.182917
730,"(sausage, whole milk)",(yogurt),0.00147,0.164179,1.91176
731,(yogurt),"(sausage, whole milk)",0.00147,0.017121,1.91176
247,(specialty chocolate),(citrus fruit),0.001403,0.087866,1.653762
246,(citrus fruit),(specialty chocolate),0.001403,0.026415,1.653762
728,"(yogurt, sausage)",(whole milk),0.00147,0.255814,1.619866
733,(whole milk),"(yogurt, sausage)",0.00147,0.00931,1.619866
330,(tropical fruit),(flour),0.001069,0.015779,1.617141
331,(flour),(tropical fruit),0.001069,0.109589,1.617141


If we just focus only on **`lift`** metrics and ignoring the other. We will get the most popular product combinations that are frequently bought together are:
- Sausage and Yoghurt
- Specialty Chocolate and Citrus Fruit
- Whole Milk and Yoghurt
- Tropical Fruit and Flour

But we don't really want that, we will consider the other metrics **`support`** and **`confidence`**

In [17]:
# sort values based on support, confidence, and lift metrics
new_rules.sort_values(['support', 'confidence', 'lift'], ascending=False).head(10)

Unnamed: 0,antecedents,consequents,support,confidence,lift
622,(rolls/buns),(whole milk),0.013968,0.126974,0.804028
623,(whole milk),(rolls/buns),0.013968,0.088447,0.804028
694,(yogurt),(whole milk),0.011161,0.129961,0.82294
695,(whole milk),(yogurt),0.011161,0.070673,0.82294
551,(soda),(other vegetables),0.009691,0.099794,0.817302
550,(other vegetables),(soda),0.009691,0.079365,0.817302
648,(sausage),(whole milk),0.008955,0.148394,0.939663
649,(whole milk),(sausage),0.008955,0.056708,0.939663
625,(yogurt),(rolls/buns),0.007819,0.091051,0.827697
624,(rolls/buns),(yogurt),0.007819,0.071081,0.827697


The resulting table shows that the **five most popular product combinations** that are frequently bought together are:
- Rolls and Milk
- Yoghurt and Milk
- Soda and Vegetables
- Sausage and Milk
- Yoghurt and Rolls

By those result it could be that the grocery store ran a promotion on these items together or displayed them within the same line of sight to improve sales.

### **Validation**

In [18]:
# Creating a new function that returns which items are frequently bought together
def frequently_bought_together(item):

    # Basket of an item
    basket_item = basket_encode.loc[basket[item]==1]

    # Applying apriori algorithm on item df
    frequent_itemsets = apriori(basket_item, min_support=0.001, use_colnames=True)

    # Storing association rules
    rules = association_rules(frequent_itemsets, metric="lift")

    # Sorting on lift and support
    rules.sort_values(['lift', 'support'], ascending=False).reset_index(drop=True)

    print('Items frequently bought together with {0}'.format(item))

    # Returning top 5 items with highest lift and support
    return rules['consequents'].unique()[:5]

In [23]:
# Example 1
frequently_bought_together('beef')

Items frequently bought together with beef


array([frozenset({'Instant food products'}), frozenset({'beef'}),
       frozenset({'abrasive cleaner'}), frozenset({'UHT-milk'}),
       frozenset({'bottled beer'})], dtype=object)

In [24]:
# Example 1
frequently_bought_together('butter')

Items frequently bought together with butter


array([frozenset({'Instant food products'}), frozenset({'bottled beer'}),
       frozenset({'butter'}), frozenset({'cake bar'}),
       frozenset({'whipped/sour cream'})], dtype=object)

All work. **WELL DONE**❗❗