# Association Analysis Application -  Market Basket Analysis

One specific application of association analysis is often called market basket analysis. The most commonly cited example of market basket analysis is the so-called *beer and diapers* case. The basic story is that a large retailer was able to mine their transaction data and find an unexpected purchase pattern of individuals that were buying beer and baby diapers at the same time. The story is an illustrative (and entertaining) example of the types of insights that can be gained by mining transactional data. While these types of associations are normally used for looking at sales transactions; the basic analysis can be applied to other situations like *click stream tracking*, *spare parts ordering* and *online recommendation engines* - just to name a few.

**mlxtend** (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks, which provides implementation for frequent pattern mining algorithms Apriori and FP-Growth. The rest of this notebook will walk through an example of using this library to analyze a relatively large online retail
(http://archive.ics.uci.edu/ml/datasets/Online+Retail) data set and try to find interesting purchase combinations. By the end of this notebook, you should be familiar enough with the basic approach to apply it to your own data sets.
- Install mlxtend using ``pip install mlxtend``
- **Dataset:** ``Online Retail.xlsx`` is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

Get ``pandas`` and ``MLxtend`` imported and read the data:

In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
df = pd.read_excel('Online Retail.xlsx')

In [None]:
df.head()

## Preprocessing

There is a little cleanup we need to do. First, some of the descriptions have spaces that need to be removed. We’ll also drop the rows that don't have invoice numbers and remove the credit transactions (those with invoice numbers containing C).

In [None]:
# Remove leading and trailing whitespace from descriptions
df['Description'] = df['Description'].str.strip()

# Drop rows that don't have invoice numbers
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)

In [None]:
# Remove credit transactions (those with invoice numbers containing 'C')
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

The analysis requires that all the data for a transaction be included in 1 row and the items should be *1-
hot encoded*. Therefore, need to consolidate items into 1 transaction per row, with each product 1 hot encoded. 
- For sake of keeping the dataset small, we look at sales for France only

In [None]:
# group by invoiceNo and Description, keep record of the quantity
df[df['Country']=='France'].groupby(['InvoiceNo','Description'])['Quantity'].sum()

In [None]:
basket = (df[df['Country'] == 'France']
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [None]:
basket.head(10)

In [None]:
# Convert units to 1 hot encoded values
def encode_units(x):
    if x <= 0:
        return False
    if x >= 1:
        return True

In [None]:
basket_sets = basket.applymap(encode_units)

In [None]:
# Drop postage column since that charge is not one we wish to explore  
# postage column is used to indicate if the customer paid for postage or not
basket_sets.drop('POSTAGE', inplace=True, axis=1)
basket_sets.head(10)

For transforming the transactions into 1-hot encoded format, can also use ``TransactionEncoder()`` in ``mlxtend.preprocessing`` directly, refer to [User Guide](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) for more details.

Now that the data is structured properly, we can generate frequent item sets that have a support of at least 7% (this number was chosen in order to get enough useful examples):

In [None]:
# Build up the frequent itemsets
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

In [None]:
frequent_itemsets.head(10)

In [None]:
frequent_itemsets.values[50]

Now, we can generate the association rules with their corresponding support, confidence and lift.
- In each rule in the form of ``{item1}->{item2}``, the ``{item1}`` is the **antecedent** and ``{item2}`` is the **consequent**. Both the antecedent and consequent can have multiple items.

**Evaluation metrics**

For each rule, five metrics are given ``support``, ``confidence``, ``lift``, ``leverage`` and ``conviction``. 
- **Leverage** is the difference of $𝑋$ and $𝑌$ appearing together in the data set and what would be expected if  $𝑋$ and $𝑌$ are statistically dependent. $$leverage(X\rightarrow Y)= support(X\rightarrow Y)-support(X)support(Y)$$. 
    - Range is (-1,1) (0 indicates independence). 
    - The rational in a sales setting is to find out how many more units (items 𝑋 and 𝑌 together) are sold than expected from the independent sells.
- **Conviction** compares the probability that $𝑋$ appears without $𝑌$ if they were dependent with the actual frequency of the appearance of $𝑋$ without $𝑌$. $$conviction(X\rightarrow Y)= \frac{sup(X)sup(\overline{Y})}{sup(X\cup \overline{Y})}= \frac{p(X)(1-p(Y))}{p(X)-p(X\cup Y)}=\frac{1-p(Y)}{1-p(Y|X)}$$
    - Range (0,inf)
    - Conviction can be interpreted as the ratio of expected frequency that the rule makes an incorrect prediction (if $𝑋$ and $𝑌$ were independent) divided by the observed frequency of incorrect predictions.
    - A high conviction value means that the consequent ($𝑌$) is highly depending on the antecedent ($𝑋$). 
In the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'.
    - If antecedents and consequents are independent, the conviction is 1.

In [None]:
# Create the rules
# Metric to evaluate if a rule is of interest
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules.head()

Now, the tricky part is figuring out what this tells us. For instance, we can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected. We can also see several where the confidence is high as well. This part of the analysis is where the domain knowledge will come in handy. 

Next, we will just look for a couple of illustrative examples. 
For example, using a large lift (6) and high confidence (0.8):

In [None]:
rules[ (rules['lift'] >= 6) &
       (rules['confidence'] >= 0.8) ]

In looking at the rules, it seems that the green and red alarm clocks are purchased together and the red paper cups, napkins and plates are purchased together.

At this point, you may want to look at how much opportunity there is to use the popularity of one product to drive sales of another. For instance, we can see that we sell 340 Green Alarm clocks but only 316 Red Alarm Clocks so maybe we can drive more Red Alarm Clock sales through recommendations?

In [None]:
basket['ALARM CLOCK BAKELIKE GREEN'].sum()

In [None]:
basket['ALARM CLOCK BAKELIKE RED'].sum()

What is also interesting is to see how the combinations vary by country of purchase. Let’s check out what some popular combinations might be in Germany.

In [None]:
basket2 = (df[df['Country'] =="Germany"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

basket_sets2 = basket2.applymap(encode_units)

basket_sets2.drop('POSTAGE', inplace=True, axis=1)

frequent_itemsets2 = apriori(basket_sets2, min_support=0.05, use_colnames=True)

rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1)

rules2[ (rules2['lift'] >= 4) &
        (rules2['confidence'] >= 0.5)]

It seems that in addition Germans love Plasters in Tin Spaceboy and Woodland Animals.
In all seriousness, an analyst that has familiarity with the data would probably have a dozen different questions that this type of analysis could drive. 

For more examples of frequent pattern mining using ``mlxtend``, please refer to the [API](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/) and [User Guide](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) of ``mlxtend.frequent_patterns``.