## Market Basket Analysis Introduction

The python implementation in `MLxtend` should be very familiar to anyone that has exposure to scikit-learn and pandas.

This analysis requires that **all the data** for a transaction be included in **1 row** and the items should be **1-hot encoded**.

For example:

<img src="figures/mlxtend-ass-rule-data.png" width="80%">

In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

The specific data for this article comes from the UCI Machine Learning Repository and   
represents **transactional data from a UK retailer from 2010-2011**. 

This mostly represents sales to wholesalers so it is slightly different from consumer purchase patterns but is still a useful case study.

In [3]:
#df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df = pd.read_excel('data/Online-Retail.xlsx')

ImportError: Install xlrd >= 0.9.0 for Excel support

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Clean up spaces in description and remove any rows that don't have a valid invoice
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df.head()

In [None]:
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]
df.info()

After the cleanup, we need to consolidate the items into **1 transaction per row** with each **product 1 hot encoded**. 

For the sake of keeping the data set small, I’m only looking at sales for **France**. 

In [None]:
basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [None]:
basket.head()

In [None]:
# Show a subset of columns
basket.iloc[:,[0,1,2,3,4,5,6, 7]].head()

There are a lot of zeros in the data but we also need to make sure 
* any positive values are converted to a 1 and 
* anything less the 0 is set to 0. 

This step will complete the one hot encoding of the data and remove the postage column (since that charge is not one we wish to explore):

In [None]:
# Convert the units to 1 hot encoded values
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1    

In [None]:
basket_sets = basket.applymap(encode_units)
basket_sets.head()

In [None]:
# No need to track postage
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [None]:
basket_sets.head()

Now that the data is structured properly, we can generate **frequent item sets** that have a support of at least 7% (this number was chosen so that we could get enough useful examples):

In [None]:
# Build up the frequent items
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

In [None]:
frequent_itemsets.head()

The final step is to **generate the rules** with their corresponding **support**, **confidence** and **lift**:

In [None]:
# Create the rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

## That’s all! 
* Build the **frequent items** using apriori then 
* build the **rules** with association_rules .

## Figuring out what this tells us
* We can see that there are quite a few **rules with a high lift** value which means that `it occurs more frequently than would be expected` given the number of transaction and product combinations. 
* We can also see several where the **confidence** is **high** as well. 

We can **filter** the dataframe using standard pandas code.  
In this case, **look for a large lift (6) and high confidence (.8)**:

In [None]:
rules[ (rules['lift'] >= 6) &
       (rules['confidence'] >= 0.8) ]

In looking at the rules, it seems that 
1. the green and red alarm clocks are purchased together 
2. the red paper cups, napkins and plates are purchased together in a manner that is higher than the overall probability would suggest.

## Recommendations
At this point, you may want to look at how much opportunity there is to **use the popularity of one product** to **drive sales of another**. 

For instance, we can see that `we sell 340 Green Alarm clocks but only 316 Red Alarm Clocks` so maybe we can drive more Red Alarm Clock sales through **recommendations**?

In [None]:
basket['ALARM CLOCK BAKELIKE GREEN'].sum()

In [None]:
basket['ALARM CLOCK BAKELIKE RED'].sum()

## Change the country of purchase
What is also interesting is to see `how the combinations vary by country of purchase`. 

Let’s check out what some popular combinations might be in Germany:

In [None]:
basket2 = (df[df['Country'] =="Germany"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [None]:
basket_sets2 = basket2.applymap(encode_units)

In [None]:
basket_sets2.drop('POSTAGE', inplace=True, axis=1)

In [None]:
frequent_itemsets2 = apriori(basket_sets2, min_support=0.05, use_colnames=True)

In [None]:
rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1)
rules2

In [None]:
rules2[ (rules2['lift'] >= 4) &
        (rules2['confidence'] >= 0.5) ]

It seems that **Germans** love `Plasters in Tin Spaceboy` and `Woodland Animals`.

<div class="alert alert-success">
    
## Practice 
1. Find out what are the popular combinations for Italians using `Online-Retail.xlsx` dataset
2. Check out what some popular association rules might be using `order_products__prior-100000.csv` dataset
</div>

In [None]:
# write your code here:
