# Market Basket Analysis with SQL, Association Rule, and the Apriori Algorithm

# Project Overview

Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them.

In this project, I will use this anonymized data from Instacart on customer orders over time to practice some SQL queries with PostgreSQL database, then try to find interesting purchase combinations with association rule and Apriori algorithm.

## What I Have Learned From This Project

* Utilized SQLAlchemy, which gives full power and flexibility of SQL<br>
* Practiced SQL queries: CREATE, SELECT, FROM, JOIN, DROP, INNER JOIN, etc.<br>
* Worked with PostgreSQL database<br>
* Association Analysis: a classic business intelligence data mining problem<br>
* Implemented Apriori algorithm with Python library MLxtend

# Data Description 

The dataset for this competition is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. Each entity (customer, product, order, aisle, etc.) has an associated unique id. Most of the files and variable names should be self-explanatory.

More information about this dataset can be found [here](https://www.kaggle.com/c/instacart-market-basket-analysis/data).

# Association Rule

## Concepts

Reference: https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html

Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association:<br>
* **Support**: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.<br>
* **Confidence**: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.<br>
* **Lift**: This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought.

## MLxtend Library and Apriori Algorithm

If you have some basic understanding of the python data science world, your first inclination would be to look at Scikit-Learn for a ready-made algorithm. However, Scikit-Learn does not support this algorithm. Fortunately, the very useful MLxtend library by Sebastian Raschka has an implementation of the Apriori algorithm for extracting frequent item sets for further analysis.

# Application

## Load the Preprocessed Data

In [1]:
# import necessary tools
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
# load the data
df = pd.read_csv('data.csv')
df.head(5)

Unnamed: 0,order_id,product_name
0,6,Cleanse
1,6,Clean Day Lavender Scent Room Freshener Spray
2,6,Dryer Sheets Geranium Scent
3,8,Original Hawaiian Sweet Rolls
4,14,Naturals Chicken Nuggets


## Get Data for Household Products

For the sake of keeping the dataset small, we are only looking at sales for household products in this project. First, we get the list of household products:

In [3]:
# load products.csv
products = pd.read_csv('products.csv')

# drop irrelevant columns
products.drop(['product_id', 'aisle_id'], axis=1, inplace=True)

# keep only household products, remove the rest
household = products[products['department_id'] == 17]
household.head(5)

Unnamed: 0,product_name,department_id
13,Fresh Scent Dishwasher Cleaner,17
47,"School Glue, Washable, No Run",17
56,Flat Toothpicks,17
70,Ultra 7 Inch Polypropylene Traditional Plates,17
104,"Easy Grab 9\""x13\"" Oblong Glass Bakeware",17


In [4]:
# create an empty list
household_list = []

# add product names to the list
for i, rows in household.iterrows():
    i = rows.product_name
    household_list.append(i)

In [5]:
# remove products that are not household product in the dataframe
df_household = df[df['product_name'].isin(household_list)]
df_household.head(5)

Unnamed: 0,order_id,product_name
1,6,Clean Day Lavender Scent Room Freshener Spray
2,6,Dryer Sheets Geranium Scent
24,46,"Ultra Soft Bathroom Tissue, Double Rolls"
29,46,Casual Napkins
31,46,"Plates, 10-1/16 Inch"


In [6]:
df_household.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1614 entries, 1 to 72892
Data columns (total 2 columns):
order_id        1614 non-null int64
product_name    1614 non-null object
dtypes: int64(1), object(1)
memory usage: 37.8+ KB


In [7]:
# create a basket dataframe from df_household, using custom aggregations
basket = df_household.groupby('order_id', as_index=False).agg({'product_name': lambda x: list(x)})
basket = basket.set_index('order_id')
basket.head()

Unnamed: 0_level_0,product_name
order_id,Unnamed: 1_level_1
6,[Clean Day Lavender Scent Room Freshener Spray...
46,"[Ultra Soft Bathroom Tissue, Double Rolls, Cas..."
64,[Ultra Soft & Strong® Toilet Paper Double Rolls]
85,"[L'elegance Toothpicks, Ultra Soft Facial Tiss..."
169,"[100% Recycled Paper Towels, 100% Recycled Bat..."


Association Analysis requires that all the data for a transaction be included in 1 row and the items should be 1-hot encoded. To one-hot encode the column product_name, we will use Scikit-Learn's preprocessing tool ```MultiLabelBinarizer```:

In [8]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
basket = basket.join(pd.DataFrame(mlb.fit_transform(basket.pop('product_name')),
                                  columns=mlb.classes_,
                                  index=basket.index))
basket.head()

Unnamed: 0_level_0,1 Ply Paper Towels,10 Inch Wheat Straw Plates,100 % Recycled Paper Towels,100% Natural Sponge Cloths,100% Recycled 1 Ply White Napkins,100% Recycled 2 Ply Jumbo Paper Towel Roll,100% Recycled 2 Ply Paper Towels,100% Recycled Bath Tissue Rolls,100% Recycled Bathroom Tissue,100% Recycled Facial Tissue,...,White Rosemary Scent Natural Surface Cleaner,White Select A Size Paper Towels,White Select-A-Size Paper Towels,Windex Electronic Wipes,With Febreze Bounce Fabric Softener Dryer Sheet Spring & Renewal 105CT Fabric Enhancers,Wood for Good Almond Scented Polish,Wrinkle Releaser Light Fresh Scent,Wrinkle Releaser Plus Light Fresh Scent,XL Pick-A-Size Paper Towel Rolls,Zipper Sandwich Bags
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
64,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
85,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
169,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Now that the data is structured properly, we can generate frequent item sets that have a support of at least 0.1% (this number was chosen so that I could get enough useful examples):

In [10]:
frequent_itemsets = apriori(basket, min_support=0.001, use_colnames=True)
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.005929,(1 Ply Paper Towels)
1,0.008893,(100% Recycled 2 Ply Jumbo Paper Towel Roll)
2,0.001976,(100% Recycled 2 Ply Paper Towels)
3,0.005929,(100% Recycled Bath Tissue Rolls)
4,0.018775,(100% Recycled Bathroom Tissue)


The final step is to generate the rules with their corresponding support, confidence and lift:

In [13]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(100% Recycled Bathroom Tissue),(100% Recycled Paper Towels),0.018775,0.0583,0.001976,0.105263,1.805531,0.000882,1.052488
1,(100% Recycled Paper Towels),(100% Recycled Bathroom Tissue),0.0583,0.018775,0.001976,0.033898,1.805531,0.000882,1.015654
2,(100% Recycled Bathroom Tissue),(2-Ply Right Size 100% Recycled Paper Towels),0.018775,0.005929,0.001976,0.105263,17.754386,0.001865,1.111021
3,(2-Ply Right Size 100% Recycled Paper Towels),(100% Recycled Bathroom Tissue),0.005929,0.018775,0.001976,0.333333,17.754386,0.001865,1.471838
4,(Sustainably Soft Bath Tissue),(100% Recycled Paper Towels),0.040514,0.0583,0.004941,0.121951,2.091773,0.002579,1.072491
5,(100% Recycled Paper Towels),(Sustainably Soft Bath Tissue),0.0583,0.040514,0.004941,0.084746,2.091773,0.002579,1.048327
6,(Tall Kitchen Drawstring Bags),(100% Recycled Paper Towels),0.006917,0.0583,0.001976,0.285714,4.900726,0.001573,1.318379
7,(100% Recycled Paper Towels),(Tall Kitchen Drawstring Bags),0.0583,0.006917,0.001976,0.033898,4.900726,0.001573,1.027928
8,(Paper Towels),(Aluminum Foil),0.008893,0.035573,0.001976,0.222222,6.246914,0.00166,1.239977
9,(Aluminum Foil),(Paper Towels),0.035573,0.008893,0.001976,0.055556,6.246914,0.00166,1.049407


That’s all there is to it! Build the frequent items using Apriori then build the rules with ```association_rules```.

# What does this tell us?

Now, the tricky part is figuring out what this tells us. For instance, we can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations. We can also see several where the confidence is high as well. This part of the analysis is where the domain knowledge will come in handy. 

Though I intend to make a project focused on the above objective sometime in the future, for now I just want to practice skills such as writing SQL queries, utilizing Python's libraries, and applying association rules on a market basket dataset. Domain knowledge is one of the most important skills for a data scientist, and hopefully I will have the opportunities to pick up these skills in the future.