In [None]:
Build Association Rules

In [1]:
import pandas as pd

In [17]:
df = pd.read_excel('Online Retail.xlsx')

In [18]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


Because our goal is to recommend products purchased together by examining the frequency by which different items are purchased together we only need information that identifies individual orders and individual products.  Because it will be more convenient for display purposes we will also use the Description.  We don't need the rest of the columns for this project.

We will keep two DataFrames for this.

One for Building the recommendation system with the following features:
- `InvoiceNo`
- `StockCode`

And one for matching the description to the `StockCode`:
- `StockCode`
- `Description`



In [19]:
# DataFrame for building the recommendation system
orders = df[['InvoiceNo', 'StockCode']]
orders.head()

Unnamed: 0,InvoiceNo,StockCode
0,536365,85123A
1,536365,71053
2,536365,84406B
3,536365,84029G
4,536365,84029E


In [23]:
# DataFrame for retrieving product descriptions
products = df[['StockCode', 'Description']]
products = products.drop_duplicates()
products.head()

Unnamed: 0,StockCode,Description
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER
1,71053,WHITE METAL LANTERN
2,84406B,CREAM CUPID HEARTS COAT HANGER
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE
4,84029E,RED WOOLLY HOTTIE WHITE HEART.


In [25]:
# Number of unique products
len(products)

5752

# Number of orders

In [None]:
# total number of orders
orders['InvoiceNo'].nunique()

In [None]:
# orders with more than one item
num_items_in_order = orders.groupby('InvoiceNo').count()
num_items_in_order.columns = ['Count']
len(num_items_in_order[num_items_in_order['Count'] > 1])

There are 20k orders with more than one product.  That is about 80% of all orders.  People in this store often buy items together.  We are going to help new customers out by showing them which products are commonly purchased together.

# Restructure the data
We would like each Invoice Number to give us a list of stock codes.

In [32]:
orders = orders.groupby('InvoiceNo')['StockCode'].apply(list).reset_index()
orders.head()

Unnamed: 0,InvoiceNo,StockCode
0,536365,"[85123A, 71053, 84406B, 84029G, 84029E, 22752,..."
1,536366,"[22633, 22632]"
2,536367,"[84879, 22745, 22748, 22749, 22310, 84969, 226..."
3,536368,"[22960, 22913, 22912, 22914]"
4,536369,[21756]


# Confidence

Confidence is a measure of how confident we are in our prediction.  It defined by:

$$ \frac{\text{Number of transactions with both Item A AND Item B}}{\text{Number of transactions with Item A}}$$

This accounts for situations where some items are often purchased.  For example if everyone buys batteries with their order, we can't really recommend any of those items when someone buys batteries because the recommendations aren't useful.

In [None]:
def calculate_confidence(itema, itemb, df):
    df