Association rule mining

Revisit the notes on association rule mining and the R example on music playlists: playlists.R and playlists.csv. Then use the data on grocery purchases in groceries.txt and find some interesting association rules for these shopping baskets. The data file is a list of shopping baskets: one person's basket for each row, with multiple items per row separated by commas. Pick your own thresholds for lift and confidence; just be clear what these thresholds are and say why you picked them. Do your discovered item sets make sense? Present your discoveries in an interesting and visually appealing way.

Notes:

This is an exercise in visual and numerical story-telling. Do be clear in your description of what you've done, but keep the focus on the data, the figures, and the insights your analysis has drawn from the data, rather than technical details. The data file is a list of baskets: one row per basket, with multiple items per row separated by commas. You'll have to cobble together your own code for processing this into the format expected by the "arules" package. This is not intrinsically all that hard, but it is the kind of data-wrangling wrinkle you'll encounter frequently on real problems, where your software package expects data in one format and the data comes in a different format. Figuring out how to bridge that gap is part of the assignment, and so we won't be giving tips on this front.

In [1]:
#mlxtend library utilized for association rule mining
from efficient_apriori import apriori
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
# Step 1: Read in groceries.txt and perform data preprocessing
groceries = []
with open("groceries.txt", "r") as f:
    groceries = [line.strip().split(",") for line in f]

# Display the list of transactions (Checks to see if the data was read in correctly)
#for cart in groceries:
#    print(cart)

#Step 2: Perform Association Rule Mining
#Find frequent item sets using Apriori algorithm
#Can mess around with min support and min confidence, answer is written for supp = .03 and confidence = .4
#### Feel free to search for better Support and Confidence, this was what I went with just to see if the code worked
itemsets, rules = apriori(groceries, min_support = 0.03, min_confidence = 0.4)

#Step 3: Display Results
#Important Section of the Output. Look at Writeup Below that explains what the conf, supp, and lift mean
print("\nAssociation Rules:")
for rule in rules:
    print(rule)


Association Rules:
{root vegetables} -> {other vegetables} (conf: 0.435, supp: 0.047, lift: 2.247, conv: 1.427)
{root vegetables} -> {whole milk} (conf: 0.449, supp: 0.049, lift: 1.756, conv: 1.350)
{tropical fruit} -> {whole milk} (conf: 0.403, supp: 0.042, lift: 1.578, conv: 1.247)
{whipped/sour cream} -> {whole milk} (conf: 0.450, supp: 0.032, lift: 1.760, conv: 1.353)
{yogurt} -> {whole milk} (conf: 0.402, supp: 0.056, lift: 1.572, conv: 1.244)


Analysis

From the Association Rules Output, we see several relationships of note. For example, Root Vegetables has a confidence of .435 in relationship with other vegetables and .449 with whole milk. This means that given root vegetables are present in a grocery cart, other vegetables are also present in about 43.5% of other carts. Similarily, given Root Vegetables are present, Whole Milk will be present 44.9% of the time. The support represents how often both the antecedent and the consequent are present out of the whole dataset. So in this instance, Root Vegetables and other vegetables appear in a cart together 4.7% of the time and Root Vegetables appear with Whole Milk in a cart 4.9% of the time. The Lift measures the ratio of the observed support to the expected support if the antecedent and consequent were independent. A lift greater than 1 suggests a positive correlation between antecedent and consequent. For example, a lift of 2.247 in the rule "{root vegetables} -> {other vegetables}" indicates that the presence of "root vegetables" increases the likelihood of "other vegetables" being purchased together.

Overall, we see that there is noteworthy associations between Root Vegetables and whole milk as well as other vegetables. There is also an association between people buying tropical fruit, whipped/sour cream, and yogurt and also buying whole milk. This suggests that whole milk itself is a very common item to have in shopping carts, notably in carts that have health conscientious shoppers who buy items such as root vegetables, tropical fruit, whipped/sour cream, and yogurt.