# 购物篮分析Market Basket Analysis

使用梦洁家纺销售数据集，基于**Apriori算法**,进行购物篮分析.  

## Apriori算法基础与关系规则
  
Apriori algorithms is a data mining algorithm used for mining **frequent itemsets** and **relevant association rules**. It is devised to operate on a database that contain transactions -like, items bought by a customer in a store. 

An itemset can be considered ***frequent*** if it meets a user-specified support threshold. For example, if the support threshold is set to 0.5(50%), a frequent itemset is a set of items that are bought/purchased together in atleast 50% of all transactions. 

***Association rules*** are a set of rules derived from a database, that can help determining relationship among variables in a large transactional database. 

For example, let I ={i(1),i(2)...,i(m)} be a set of m attributes called items, and T={t(1),t(2),...,t(n)} be the set of transactions. Every transaction t(i) in T has a unique transaction ID, and it contains a subset of itemsets in I.

Association rules are usually written as **i(j) -> i(k)**. This means that there is a strong relationship between the purchase of item i(j) and item i(k). Both these items were purchased together in the same transaction. 
  
In the above example, i(j) is the **前项antecedent** and i(k) is the **后项consequent**. 

Please note that both antecedents and consequents can have multiple items. For example, {Diaper,Gum} -> {Beer, Chips} is also valid. 

Since multiplie rules are possible even from a very small database, i-order to select the most relevant ones, we use constraints on various measures of interest. The most important measures are discussed below. They are:

** 1. Support : ** The support of an itemset X, *supp(X)* is the proportion of transaction in the database in which the item X appears. It signifies the popularity of an itemset.

supp(X) = (Number of transactions in which X appears)/(Total number of transactions)
  
We can identify itemsets that have support values beyond this threshold as **significant itemsets**.  

** 2. Confidence :** Confidence of a rule signifies the likelihood of item Y being purchased when item X is purchased. 

Thus, **conf(X -> Y) = supp(X *U* Y) / supp( X )** 

If conf (X -> Y) is 75%, it implies that, for 75% of transactions containing X & Y in transactions of containing X, this rule is correct. It is more like a conditional probability, P(Y|X), that the probability of finding itemset Y in transactions given that the transaction already contains itemset X.
  
  
** 3. Lift :** Lift explains the the likelihood of the itemset Y being purchased when itemset X is already purchased, while taking into account the popularity of Y. 
  
Thus, **lift (X -> Y) = supp (X *U* Y)/( supp(X) * supp (Y) )**

If the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset X, while a value less than 1 implies that the itemset Y is unlikely to be bought if the itemset X is bought. 

** 4. Conviction :** The conviction of a rule can be defined as :

*conv (X->Y) = (1-supp(Y))/(1-conf(X-Y))*

If the conviction means 1.4, it means that the rule X -> Y would be  40% more often if the association between X & Y was an accidental chance.

### Steps in Apriori Algorithm

The steps in implementing Apriori Algorithm are:
  
1. Create a frequency table of all items that occur in all transactions.
  
2. Select only those (significant) items - for which the support is greater than threshold (50%)
  
3. Create possible pairs of all items (remember AB is same as BA)
  
4. Select itemsets that are only significant (support > threshold)

5. Create tiplets using another rule, called self-join. It says, from the item pairs AB, AC, BC, BD, we look for pairs with identical first letter. So we from AB, AC we get ABC. From BC, BD we get BCD.
  
6. Find frequency of the new triplet pairs, and select only those pairs where the support of the new itemset (ABC or BCD) is greater than the threshold.  
  
7. If we get 2 pairs of significant triplets, combine and form groups of 4, repeat the threshold process, and continue.
  
8. Continue till the frequency after grouping is less than threshold support. 

### Pros of Apriori algorithm:

1. Easy to understand and implement
2. Can be used on large itemsets

### Cons of Apriori algoritm

1. Can get compuationally expensive if the candidate rules are large
2. Calculating support is also expensive since it has to go through the whole database

## Code
  
Just as a quick note, this analysis requires all data of a transaction to be included in 1 row, and the items should be 1-hot encoded. Since sklearn doesn't have a direct way to do this, we would be using **MLxtend** library here. 

In [1]:
## Code to install any package via python

def install_and_import(package):
    import importlib
    try:
        importlib.import_module(package)
    except ImportError:
        import pip
        pip.main(['install', package])
    finally:
        globals()[package] = importlib.import_module(package)

install_and_import('mlxtend')

Installing packages:

In [2]:
import pandas as pd
import numpy as np
import datetime
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Importing the data:

In [3]:
df = pd.read_excel('data/MongjieTransactions.xlsx',dtype={'orderid':str,'storeid':str,'productid':str}) #http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
df.head()

Unnamed: 0,orderid,billdate,storeid,productid,productname,amount,price,mny,customerid
0,{00007C84-6128-4A9B-930D-A43EA2D0F8BF},2011-08-06 16:35:00.000,1071001,22270,绣花提花八件套:我们结婚了 180*210,1,3627.0,3627.0,1361***6488
1,{0000BEBD-F957-4121-B349-59AC388FA770},2013-03-16 14:39:32.000,6976103,1010181630,纯棉印花四件套:泊美 180*210,1,388.0,388.0,1860***5566
2,{0000C880-5719-46CB-A1CB-17845204CD6C},2012-03-14 14:31:36.000,53403,51181,新思力侧睡枕 1000g 50*70,21,246.0,5166.0,1314***6888
3,{0000D7C6-4E18-412B-9E67-54E8413E4BC1},2013-03-15 20:57:21.000,7320002,1040171267,纯棉印花四件套 150*200,1,599.0,599.0,1369***0803
4,{00010842-573F-4F90-AB6A-22E7A42DFDEA},2012-06-01 12:16:54.000,26601,50005,羽丝绒枕 50*70,2,295.64,591.28,1387***3994


首先数据清洗，包括:
  
1. Stripping spaces in the description column
2. Dropping rows that doesn't contain involice numbers

In [7]:
df['productname'] = df['productname'].str.strip()
df.dropna(axis = 0, subset=['orderid'], inplace = True)
df['orderid'] = df['orderid'].astype('str')

Before proceeding, let us understand the data distribution by country:

In [8]:
df.groupby('storeid').count().reset_index().sort_values('orderid', ascending = False).head()

Unnamed: 0,storeid,orderid,billdate,productid,productname,amount,price,mny,customerid
325,54201,17416,17416,17416,17416,17416,17416,17416,17416
241,26601,8899,8899,8899,8899,8899,8899,8899,8899
248,50101,7815,7815,7815,7815,7815,7815,7815,7815
305,52701,7641,7641,7641,7641,7641,7641,7641,7641
386,6330003,7460,7460,7460,7460,7460,7460,7460,7460


Thus, we see that most of the transactions occur in the UK, and there are more frequent customers in UK. 
  
For the sake of this analysis, we will look at the transactions in Germany, and later with UK or France or EIRE to see if there is a difference in product purchase behaviour accross countries. 

**1-hot encoding :** This is the process pf consolidating items into one transaction per row.  

This can be done manually like below, or via the mlxtend.
  
The one-hot encoding from *mlxtend* encodes transaction data in form of a Python list into a NumPy integer array.  

The colums represent unique items present in the input array, and rows represent the individual transactions. 

Before proceeding with the 1-hot encoding, let us see the number of transactions by country. 

In [9]:
dd=df[df['storeid']=="054201"]
dd

Unnamed: 0,orderid,billdate,storeid,productid,productname,amount,price,mny,customerid
15,{00039771-7EE7-4FE4-B149-FB9DB20FA916},2011-07-19 09:46:22.000,054201,54023,纯棉四孔二合一被 2400g 200*230,1,779.68,779.68,1378***5871
225,{001B2CE8-0528-4DAA-BD21-3644D8508654},2013-01-08 20:17:05.000,054201,14125,纯棉绣花四件套:温莎夫人(红) 150*200,1,1133.00,1133.00,1560***4852
292,{002423AC-0FF2-43E9-877D-19F894C7DF2C},2012-01-17 13:30:18.000,054201,81074,仿羊羔绒床笠式床垫 180*200,1,429.00,429.00,1397***5780
293,{00242E7B-771A-4FC7-980F-30A3BC6497C6},2012-11-12 13:10:05.000,054201,2003,纯棉四件套 150*200,1,299.00,299.00,1862***9959
327,{00254B0C-C248-4E0B-8445-684E696E5917},2012-01-19 18:00:27.000,054201,52235,软木恬梦枕 50*70,2,348.00,696.00,1335***0889
...,...,...,...,...,...,...,...,...,...
707756,{FFBC9D79-B422-4CF1-8312-FC88F1564FEA},2012-08-25 11:30:51.000,054201,50127,四孔靠枕 65*65,2,0.02,0.04,1350***9227
707757,{FFBC9D79-B422-4CF1-8312-FC88F1564FEA},2012-08-25 11:30:51.000,054201,81076,时尚优眠对枕 50*70,2,0.01,0.02,1350***9227
707788,{FFBE6521-015B-47B5-8F3C-6758D3D4E719},2012-06-20 20:30:38.000,054201,58101,拱形方顶蚊帐 150*195*155,1,225.00,225.00,1397***2618
707989,{FFD2F5C8-4BD2-4FFF-9DF0-180023CCFA8C},2013-09-08 20:03:02.000,054201,80001,荞麦枕 50*70,2,158.00,316.00,1390***9808


In [10]:
Basket = (df[df['storeid']=="054201"]
          .groupby(['orderid', 'productname'])['amount']
          .sum().unstack().reset_index().fillna(0)
          .set_index('orderid'))

Basket.head()

productname,1.5米床宾馆提花四件套,100%丝享纯蚕丝二合一被 200*230 2400g,100%丝享纯蚕丝二合一被 248*248 3200g,100%双层连芯羊毛被 200*230,100%双层连芯羊毛被 248*248,100%温润羊毛厚被 200*230 2000g,100%温润羊毛厚被 248*248 2700g,100%羊毛厚被 2300g 200*230,100%羊毛厚被 3080g 248*248,50%双层连芯羊毛被 248*248,...,馨香蒲绒枕 1000g 50*70,馨香蒲绒枕 800g 50*70,骨头枕 250g 16*42,高密纯棉素色缎条提花被套,魅丽格调方巾 34*35,魅丽格调浴巾 76*148,魅丽格调面巾 34*78,魅影艺术毯 180*220,鹅绒被 1100g 200*230,鹅绒被 1500g 248*248
orderid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
{00039771-7EE7-4FE4-B149-FB9DB20FA916},0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
{001B2CE8-0528-4DAA-BD21-3644D8508654},0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
{002423AC-0FF2-43E9-877D-19F894C7DF2C},0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
{00242E7B-771A-4FC7-980F-30A3BC6497C6},0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
{00254B0C-C248-4E0B-8445-684E696E5917},0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In-order to complete the one-hot encoding process, we need to replace all values of quantity >=1 by 1. 

In [11]:
def sum_to_boolean(x):
    if x<=0:
        return 0
    else:
        return 1

Basket_Final = Basket.applymap(sum_to_boolean)


Dropping the postage column, and the final one-hot codded matrix. 

In [12]:
Basket_Final.head()

productname,1.5米床宾馆提花四件套,100%丝享纯蚕丝二合一被 200*230 2400g,100%丝享纯蚕丝二合一被 248*248 3200g,100%双层连芯羊毛被 200*230,100%双层连芯羊毛被 248*248,100%温润羊毛厚被 200*230 2000g,100%温润羊毛厚被 248*248 2700g,100%羊毛厚被 2300g 200*230,100%羊毛厚被 3080g 248*248,50%双层连芯羊毛被 248*248,...,馨香蒲绒枕 1000g 50*70,馨香蒲绒枕 800g 50*70,骨头枕 250g 16*42,高密纯棉素色缎条提花被套,魅丽格调方巾 34*35,魅丽格调浴巾 76*148,魅丽格调面巾 34*78,魅影艺术毯 180*220,鹅绒被 1100g 200*230,鹅绒被 1500g 248*248
orderid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
{00039771-7EE7-4FE4-B149-FB9DB20FA916},0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
{001B2CE8-0528-4DAA-BD21-3644D8508654},0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
{002423AC-0FF2-43E9-877D-19F894C7DF2C},0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
{00242E7B-771A-4FC7-980F-30A3BC6497C6},0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
{00254B0C-C248-4E0B-8445-684E696E5917},0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


** Apriori:**

To start with and have sufficient data, let us look at frequent itemsets that have a support of atleast 2%.

In [30]:
## Apriori to select the most important itemsets
Frequent_itemsets = apriori(Basket_Final, min_support = 0.02, use_colnames = True)

Frequent_itemsets.sort_values('support', ascending = False)

Unnamed: 0,support,itemsets
6,0.138425,(时尚优眠对枕 50*70)
3,0.097208,(四孔靠枕 65*65)
15,0.095878,(纯棉四件套 150*200)
7,0.091003,(温馨枕 50*70)
0,0.070912,(加厚温馨被)
14,0.060718,(纯棉冬鸟被 3080g 248*248)
22,0.054661,"(时尚优眠对枕 50*70, 四孔靠枕 65*65)"
13,0.047718,(纯棉冬鸟被 2300g 200*230)
19,0.045206,"(加厚温馨被, 四孔靠枕 65*65)"
20,0.037524,"(时尚优眠对枕 50*70, 加厚温馨被)"


** Association Rules:**

Now since we have identified the key itemsets, let us apply the association rules to learn the purchase behaviours.

In [31]:
Asso_Rules = association_rules(Frequent_itemsets, metric = "lift", min_threshold =3)
Asso_Rules.sort_values('lift',ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
19,(四孔靠枕 65*65),"(时尚优眠对枕 50*70, 加厚温馨被)",0.097208,0.037524,0.026001,0.267477,7.128162,0.022353,1.31392
14,"(时尚优眠对枕 50*70, 加厚温馨被)",(四孔靠枕 65*65),0.037524,0.097208,0.026001,0.692913,7.128162,0.022353,2.939862
18,(加厚温馨被),"(时尚优眠对枕 50*70, 四孔靠枕 65*65)",0.070912,0.054661,0.026001,0.366667,6.708018,0.022125,1.492641
15,"(时尚优眠对枕 50*70, 四孔靠枕 65*65)",(加厚温馨被),0.054661,0.070912,0.026001,0.475676,6.708018,0.022125,1.771973
0,(加厚温馨被),(四孔靠枕 65*65),0.070912,0.097208,0.045206,0.6375,6.558112,0.038313,2.490461
1,(四孔靠枕 65*65),(加厚温馨被),0.097208,0.070912,0.045206,0.465046,6.558112,0.038313,1.736762
5,(加厚温馨被),(温馨枕 50*70),0.070912,0.091003,0.029694,0.41875,4.601491,0.023241,1.563866
4,(温馨枕 50*70),(加厚温馨被),0.091003,0.070912,0.029694,0.326299,4.601491,0.023241,1.379081
11,(四孔靠枕 65*65),(纯棉冬鸟被 3080g 248*248),0.097208,0.060718,0.026296,0.270517,4.455298,0.020394,1.287599
10,(纯棉冬鸟被 3080g 248*248),(四孔靠枕 65*65),0.060718,0.097208,0.026296,0.43309,4.455298,0.020394,1.592479


In [32]:
Asso_Rules.to_excel(r'data/梦洁交叉销售的关联规则.xlsx',encoding = 'utf-8',index=False)