## Case Study
 
In this exercise, the Online Retail dataset is used to learn the association rules. Further, we will also find the most frequently ordered products to help us generate the association rules.

First, we will import all the relevant packages.


In [5]:
# !pip install mlxtend

In [4]:
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Next, let us load the Online Retail dataset in a dataframe. This dataset has 8 columns:
InvoiceNo - Unique ID assigned to the transaction
StockCode - Unique ID assigned to a product
Description - Description of the product
Quantity - Count of the product purchased in a transaction
InvoiceDate - Date of the transaction
UnitPrice - Price of individual product
CustomerID - Unique ID assigned to a customer
Country - Country of the customer

Using this dataset, the idea is to find Association Rules to recommend similar bought items to a customer.

In [7]:
retail_dataset = pd.read_csv('online_retail.csv', encoding = 'unicode_escape')
retail_dataset.head()

Unnamed: 0,index,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [8]:
# Let us remove any row with a null column.
retail_dataset = retail_dataset.dropna()
retail_dataset.shape

(406829, 9)

In [9]:
# First, let us get the transaction data by appending all the stock codes for an invoice/transaction.
transactions_df = retail_dataset.groupby('InvoiceNo').apply(lambda x: x['StockCode'].unique())
transactions_df

InvoiceNo
536365     [85123A, 71053, 84406B, 84029G, 84029E, 22752,...
536366                                        [22633, 22632]
536367     [84879, 22745, 22748, 22749, 22310, 84969, 226...
536368                          [22960, 22913, 22912, 22914]
536369                                               [21756]
                                 ...                        
C581484                                              [23843]
C581490                                       [22178, 23144]
C581499                                                  [M]
C581568                                              [21258]
C581569                                       [84978, 20979]
Length: 22190, dtype: object

The transaction dataset needs to be converted into a 1-hot encoding format for processing the data. Using the TransactionEncoder, convert the dataset to a 1-hot encoding format.

In [10]:
## Instantiate Transaction Encoder
te = TransactionEncoder()

## Fit and transform the Transaction Encoder 
transactions_one_hot = te.fit(transactions_df).transform(transactions_df)

## save the tranformed sparse data to a dataframe
transactions_one_hot_df = pd.DataFrame(transactions_one_hot, columns=te.columns_)
transactions_one_hot_df

Unnamed: 0,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214Y,90214Z,BANK CHARGES,C2,CRUK,D,DOT,M,PADS,POST
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22185,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
22186,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
22187,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
22188,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


As you know, most e-commerce sites have too many products and a customer buys a subset of the products only. So, using the sparse 1-hot encoding format would help us save the storage space to represent the same dataset.

In [11]:
## Fit and transform the Transaction Encoder with sparse as true
transactions_one_hot_sparse = te.fit(transactions_df).transform(transactions_df, sparse=True)

## save the tranformed sparse data to a dataframe
transactions_one_hot_sparse_df = pd.DataFrame.sparse.from_spmatrix(transactions_one_hot_sparse, columns=te.columns_)
transactions_one_hot_sparse_df.dtypes

10002     Sparse[bool, 0]
10080     Sparse[bool, 0]
10120     Sparse[bool, 0]
10123C    Sparse[bool, 0]
10124A    Sparse[bool, 0]
               ...       
D         Sparse[bool, 0]
DOT       Sparse[bool, 0]
M         Sparse[bool, 0]
PADS      Sparse[bool, 0]
POST      Sparse[bool, 0]
Length: 3684, dtype: object

In [12]:
# Now, let us use the Apriori algorithm to find the frequent itemsets with the minimum support of 2%.
frequent_itemsets = apriori(transactions_one_hot_sparse_df, min_support=0.02, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.026273,(20685)
1,0.022578,(20712)
2,0.024290,(20719)
3,0.022713,(20723)
4,0.034024,(20724)
...,...,...
177,0.024155,"(22726, 22727)"
178,0.020009,"(23209, 23203)"
179,0.021316,"(23203, 85099B)"
180,0.021000,"(23300, 23301)"


In [13]:
# Also, calculate the number of products in each frequent itemsets list generated in the last step.
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.sort_values("length")

Unnamed: 0,support,itemsets,length
0,0.026273,(20685),1
105,0.021226,(22952),1
106,0.043488,(22960),1
107,0.039658,(22961),1
108,0.025056,(22966),1
...,...,...,...
164,0.021181,"(22382, 20725)",2
163,0.021000,"(20728, 20725)",2
180,0.021000,"(23300, 23301)",2
170,0.020820,"(85123A, 21733)",2


In [14]:
# Check the product description of one of the most frequent itemset.
retail_dataset.query("StockCode in ['15056N']")[['StockCode', 'Description']].drop_duplicates()

Unnamed: 0,StockCode,Description
133,15056N,EDWARDIAN PARASOL NATURAL


In [15]:
# Finally, get the top association rules with minimum rule confidence of 60%.
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.60)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(21733),(85123A),0.031185,0.091032,0.02082,0.66763,7.334015,0.017981,2.734808,0.891449
1,(22910),(22086),0.031951,0.044615,0.02046,0.640339,14.352638,0.019034,2.656346,0.961033
2,(22386),(85099B),0.039838,0.074042,0.024966,0.626697,8.464031,0.022017,2.480444,0.918442
3,(22698),(22697),0.026634,0.033033,0.021226,0.796954,24.126079,0.020346,4.762313,0.984779
4,(22697),(22698),0.033033,0.026634,0.021226,0.642565,24.126079,0.020346,2.723197,0.991296
5,(22697),(22699),0.033033,0.037675,0.025101,0.759891,20.16983,0.023857,4.007866,0.982889
6,(22699),(22697),0.037675,0.033033,0.025101,0.666268,20.16983,0.023857,2.897435,0.98763
7,(22698),(22699),0.026634,0.037675,0.020324,0.763113,20.255366,0.019321,4.062388,0.976642
8,(22726),(22727),0.036458,0.040874,0.024155,0.662546,16.209376,0.022665,2.842244,0.97381
9,(23300),(23301),0.028932,0.034565,0.021,0.725857,20.999687,0.02,3.521643,0.980755


In [16]:
# Sort the rules list generated in the previous step using the rule lift value. Fetch the top 3 association rules.
rules.sort_values("lift", ascending = False).head(3)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
3,(22698),(22697),0.026634,0.033033,0.021226,0.796954,24.126079,0.020346,4.762313,0.984779
4,(22697),(22698),0.033033,0.026634,0.021226,0.642565,24.126079,0.020346,2.723197,0.991296
9,(23300),(23301),0.028932,0.034565,0.021,0.725857,20.999687,0.02,3.521643,0.980755


The top 3 rules extracted in the last step can be used to show the product recommendation.

In this exercise, we have learned to get the best association rules for a dataset. Further, we have also learned how to find the frequently bought itemsets using the Apriori algorithm. You can play around with the support, confidence, and lift values to extract the best rules as per the business requirements.