<a href="https://colab.research.google.com/github/coryroyce/code_assignments/blob/main/211201_Market_Basket_Analysis_Apriori/211201_Market_Basket_Item_Apriori_Algorithm_Cory_Randolph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Market Basket Item Apriori Algorithm

CMPE 256

Cory Randolph

12/01/2021



# Prompt

Learning Objective: Apply Apriori algorithm to generate association rules and predict the next basket item.

Dataset: Excel Dataset contains Order ID, User ID, Product Item name.

Consider Order ID as Transaction ID and group items by order id. 

Generate Association rules MIN_SUP: 0.0045

Train Dataset:TRAIN-ARULES.csv

Test Dataset: testarules.csv


# Summary of Analysis

...


# Imports

Install needed packages

In [130]:
!pip install apyori

# Clear output for this cell
from IPython.display import clear_output
clear_output()

Import other needed packages

In [131]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

# Data

## Load Data

Load the data into Colab from a local CSV.

Load the TRAIN-ARULES.csv and the testarules.csv files

In [132]:
from google.colab import files
files.upload()

{}

Note: The files can also be dragged and dropped into the folder tab in Colabs left hand side bar menu.

Load the data into Dataframes

In [133]:
df = pd.read_csv('TRAIN-ARULES.csv')
df_test = pd.read_csv('testarules.csv')

df.head()

Unnamed: 0,order_id,user_id,product_name
0,1483,90,Organic Pink Lemonade Bunny Fruit Snacks
1,1483,90,Dark Chocolate Minis
2,1483,90,"Sparkling Water, Natural Mango Essenced"
3,1483,90,Peach-Pear Sparkling Water
4,1483,90,Organic Heritage Flakes Cereal


## Describe Data

Quick data overview. Note that we convert the dataframe to strings since the Order ID and User ID are categorical and not nuperical so mean and other stats don't apply to our data.

In [134]:
df.astype(str).describe(include = 'all')

Unnamed: 0,order_id,user_id,product_name
count,12963,12963,12963
unique,1418,100,3541
top,68288,27,Bag of Organic Bananas
freq,46,768,188


## Clean Data

Group the data by order ID so that we can start to get the data in a format that works well for the apriori library/package. Note: Uncomment the print statements to see the logical progression of the data transformation.

In [135]:
def transform_data(df):
  # Make a copy of the input data frame
  df_temp = df.copy()
  # print(df_temp.head())

  # Group the data by the order id (make a list of product item sets for each order)
  df_grouped = df_temp.groupby(by = ['order_id'])['product_name'].apply(list).reset_index(name='product_item_set')
  # print(df_grouped.head())

  # Unpack the list of product items into their own columns
  df_grouped = df_grouped['product_item_set'].apply(pd.Series)
  # print(df_grouped.head())

  # Replace the Nan values with 0's
  df_grouped.fillna(0,inplace=True)

  # Convert the grouped dataframe into a lists of lists to work with the apriori package
  data = df_grouped.astype(str).values.tolist()

  # Remvove 0's from each "row"
  data = [[ele for ele in sub if ele != '0'] for sub in data]

  return data

In [136]:
data = transform_data(df)

# Display the first few rows of data for reference
print(data[0:2])
print(f'Number of item transactions: {len(data)}')


[['Organic Pink Lemonade Bunny Fruit Snacks', 'Dark Chocolate Minis', 'Sparkling Water, Natural Mango Essenced', 'Peach-Pear Sparkling Water', 'Organic Heritage Flakes Cereal', 'Popped Salted Caramel Granola Bars', 'Healthy Grains Granola Bar, Vanilla Blueberry', 'Flax Plus Organic Pumpkin Flax Granola', 'Sweet & Salty Nut Almond Granola Bars', 'Cool Mint Chocolate Energy Bar', 'Chocolate Chip Energy Bars', 'Trail Mix Fruit & Nut Chewy Granola Bars'], ['Creme De Menthe Thins', 'Milk Chocolate English Toffee Miniatures Candy Bars', "Baker's Pure Cane Ultrafine Sugar", 'Plain Bagels', 'Cinnamon Bread']]
Number of item transactions: 1418


# Apriori Algorithm

Apply the apriori library to the data in order to generate the association rules.

From the assignment prompt we need to pass in the additional parameters of min_support = 0.0045. 

In [137]:
%%time

# Note: Added a min_lenght argument so that it was not just a single item
association_rules = apriori(transactions = data, min_support=0.0045)#, min_length=3) 
association_results = list(association_rules)
df_results = pd.DataFrame(association_results)

CPU times: user 26.7 s, sys: 87.8 ms, total: 26.8 s
Wall time: 26.9 s


See how many total rules were created

In [138]:
print(len(association_results))

1492


Review the first result

In [139]:
print(association_results[0])

RelationRecord(items=frozenset({'0% Greek Strained Yogurt'}), support=0.009873060648801129, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'0% Greek Strained Yogurt'}), confidence=0.009873060648801129, lift=1.0)])


View first few results

In [140]:
df_results.head()

Unnamed: 0,items,support,ordered_statistics
0,(0% Greek Strained Yogurt),0.009873,"[((), (0% Greek Strained Yogurt), 0.0098730606..."
1,"(100% Juice, Variety Pack)",0.004937,"[((), (100% Juice, Variety Pack), 0.0049365303..."
2,(100% Premium Select Not From Concentrate Pure...,0.009168,"[((), (100% Premium Select Not From Concentrat..."
3,(100% Recycled Paper Towels),0.005642,"[((), (100% Recycled Paper Towels), 0.00564174..."
4,(1500 Pale Ale),0.01481,"[((), (1500 Pale Ale), 0.014809590973201692, 1..."


What are the sizes of the item sets?

In [141]:
unique_item_set_lenghts = df_results['items'].apply(len).unique()
unique_item_set_lenghts

array([1, 2, 3, 4, 5, 6])

# Additional Example

Apply the Apriori algorithm to the simple dataset that was calculated by hand.

In [142]:
data_simple = [
               ['Noodles', 'Pickles', 'Milk'],
               ['Noodles', 'Cheese'],
               ['Cheese', 'Shoes'],
               ['Noodles', 'Pickles', 'Cheese'],
               ['Noodles', 'Pickles', 'Clothes', 'Cheese', 'Milk'],
               ['Pickles', 'Clothes', 'Milk'],
               ['Pickles', 'Clothes', 'Milk'],
]

Apply the algorithm to the simple data set.

In [143]:
%%time

# Note: Added a min_lenght argument so that it was not just a single item
association_rules_simple = apriori(transactions = data_simple, min_support=0.30, min_confidence = 0.80, min_length=3) 
association_results_simple = list(association_rules_simple)
df_results_simple = pd.DataFrame(association_results_simple)

CPU times: user 1.93 ms, sys: 0 ns, total: 1.93 ms
Wall time: 2.06 ms


View all rules that meet the criteria

In [144]:
df_results_simple

Unnamed: 0,items,support,ordered_statistics
0,"(Milk, Clothes)",0.428571,"[((Clothes), (Milk), 1.0, 1.75)]"
1,"(Clothes, Pickles)",0.428571,"[((Clothes), (Pickles), 1.0, 1.4)]"
2,"(Milk, Pickles)",0.571429,"[((Milk), (Pickles), 1.0, 1.4)]"
3,"(Milk, Clothes, Pickles)",0.428571,"[((Clothes), (Milk, Pickles), 1.0, 1.75), ((Mi..."


View the details of the triple item set.

In [145]:
df_results_simple.iloc[3,2]

[OrderedStatistic(items_base=frozenset({'Clothes'}), items_add=frozenset({'Milk', 'Pickles'}), confidence=1.0, lift=1.75),
 OrderedStatistic(items_base=frozenset({'Milk', 'Clothes'}), items_add=frozenset({'Pickles'}), confidence=1.0, lift=1.4),
 OrderedStatistic(items_base=frozenset({'Clothes', 'Pickles'}), items_add=frozenset({'Milk'}), confidence=1.0, lift=1.75)]

# Reference

Example of Apriori with code snippets [reference](https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/)

Example Apriori article [Article](https://towardsdatascience.com/market-basket-analysis-using-associative-data-mining-and-apriori-algorithm-bddd07c6a71a)

[Apriori Library](https://github.com/ymoch/apyori)