# Association Rules Mining Using Python Generators to Handle Large Datasets

### Motivation
I was looking to run association analysis in Python using the apriori algorithm to derive rules of the form {A} -> {B}.  However, I quickly discovered that it's not part of the standard Python machine learning libraries.  Although there are some implementations that exist, I could not find one capable of handling large datasets.  "Large" in my case was an orders dataset with 32 million records, containing 3.2 million unique orders and about 50K unique items (file size just over 1 GB).  So, I decided to write my own implementation, leveraging the apriori algorithm to generate simple {A} -> {B} association rules. Since I only care about understanding relationships between any given pair of items, using apriori to get to item sets of size 2 is sufficient.  I went through various iterations, splitting the data into multiple subsets just so I could get functions like crosstab and combinations to run on my machine with 8 GB of memory.  :)  But even with this approach, I could only process about 1800 items before my kernel would crash...  And that's when I learned about the wonderful world of Python generators.



### Python Generators

In a nutshell, a generator is a special type of function that returns an iterable sequence of items.  However, unlike regular functions which return all the values at once (eg: returning all the elements of a list), a generator <i>yields</i> one value at a time.  To get the next value in the set, we must ask for it - either by explicitly calling the generator's built-in "next" method, or implicitly via a for loop.  This is a great property of generators because it means that we don't have to store all of the values in memory at once.  We can load and process one value at a time, discard when finished and move on to process the next value.  This feature makes generators perfect for creating item pairs and counting their frequency of co-occurence.  Here's a concrete example of what we're trying to accomplish:  

1. Get all possible item pairs for a given order 
       eg:  order 1:  apple, egg, milk   -->  item pairs: {apple, egg}, {apple, milk}, {egg, milk}
            order 2:  egg, milk          -->  item pairs: {egg, milk}
            
2. Count the number of times each item pair appears
       eg: {apple, egg}: 1
           {apple, milk}: 1
           {egg, milk}: 2

Here's the generator that implements the above tasks:

In [1]:
import numpy as np
from itertools import combinations, groupby
from collections import Counter

# Sample data
orders = np.array([[1,'apple'], [1,'egg'], [1,'milk'], [2,'egg'], [2,'milk']], dtype=object)

# Generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    
    # For each order, generate a list of items in that order
    for order_id, order_object in groupby(orders, lambda x: x[0]):
        item_list = [item[1] for item in order_object]      
    
        # For each item list, generate item pairs, one at a time
        for item_pair in combinations(item_list, 2):
            yield item_pair                                      


# Counter iterates through the item pairs returned by our generator and keeps a tally of their occurrence
Counter(get_item_pairs(orders))


Counter({('apple', 'egg'): 1, ('apple', 'milk'): 1, ('egg', 'milk'): 2})

<i>get_item_pairs()</i> generates a list of items for each order and produces item pairs for that order, one pair at a time.  The first item pair is passed to Counter which keeps track of the number of times an item pair occurs.  The next item pair is taken, and again, passed to Counter.  This process continues until there are no more item pairs left.  With this approach, we end up not using much memory as item pairs are discarded after the count is updated.

### Apriori Algorithm 
Apriori is an algorithm used to identify frequent item sets (in our case, item pairs).  It does so using a "bottom up" approach, first identifying individual items that satisfy a minimum occurence threshold. It then extends the item set, adding one item at a time and checking if the resulting item set still satisfies the specified threshold.  The algorithm stops when there are no more items to add that meet the minimum occurrence requirement.  Here's an example of apriori in action, assuming a minimum occurence threshold of 3:


    order 1: apple, egg, milk  
    order 2: carrot, milk  
    order 3: apple, egg, carrot
    order 4: apple, egg
    order 5: apple, carrot

    
    Iteration 1:  Count the number of times each item occurs   
    item set      occurrence count    
    {apple}              4   
    {egg}                3   
    {milk}               2   
    {carrot}             2   

    {milk} and {carrot} are eliminated because they do not meet the minimum occurrence threshold.


    Iteration 2: Build item sets of size 2 using the remaining items from Iteration 1 
                 (ie: apple, egg)  
    item set           occurence count  
    {apple, egg}             3  

    Only {apple, egg} remains and the algorithm stops since there are no more items to add.
   
   
If we had more orders and items, we can continue to iterate, building item sets consisting of more than 2 elements.  For the problem we are trying to solve (ie: finding relationships between pairs of items), it suffices to implement apriori to get to item sets of size 2.

### Association Rules Mining
Once the item sets have been generated using apriori, we can start mining association rules.  Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}.  One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:

1. <b>support</b>  
    This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total 
    and {apple,egg} occurs in 3 of them, so: 
       
                    support{apple,egg} = 3/5 or 60%
        
    The minimum support threshold required by apriori can be set based on knowledge of your domain.  In this 
    grocery dataset for example, since there could be thousands of distinct items and an order can contain 
    only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.<br><br><br>
    
2. <b>confidence</b>  
    Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that 
    item A was purchased. This is expressed as:
       
                    confidence{A->B} = support{A,B} / support{A}   
                    
    Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 
    indicates that B is always purchased whenever A is purchased.  Note that the confidence measure is directional.     This means that we can also compute the percentage of times that item A is purchased, given that item B was 
    purchased:
       
                    confidence{B->A} = support{A,B} / support{B}    
                    
    In our example, the percentage of times that egg is purchased, given that apple was purchased is:  
       
                    confidence{apple->egg} = support{apple,egg} / support{apple}
                                           = (3/5) / (4/5)
                                           = 0.75 or 75%

    A confidence value of 0.75 implies that out of all orders that contain apple, 75% of them also contain egg.  Now, 
    we look at the confidence measure in the opposite direction (ie: egg->apple): 
       
                    confidence{egg->apple} = support{apple,egg} / support{egg}
                                           = (3/5) / (3/5)
                                           = 1 or 100%  
                                           
    Here we see that all of the orders that contain egg also contain apple.  But, does this mean that there is a 
    relationship between these two items, or are they occurring together in the same orders simply by chance?  To 
    answer this question, we look at another measure which takes into account the popularity of <i>both</i> items.<br><br><br>  
    
3. <b>lift</b>  
    Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items 
    are occuring together in the same orders simply by chance (ie: at random).  Unlike the confidence metric whose 
    value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), 
    lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}: 
       
                    lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})   
    
    In our example, we compute lift as follows:
    
         lift{apple,egg} = lift{egg,apple} = support{apple,egg} / (support{apple} * support{egg})
                         = (3/5) / (4/5 * 3/5) 
                         = 1.25    
               
    One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same 
    order if there was <i>no</i> relationship between them. In the example above, if apple occurred in 80% of the
    orders and egg occurred in 60% of the orders, then if there was no relationship between them, we would 
    <i>expect</i> both of them to show up together in the same order 48% of the time (ie: 80% * 60%).  The numerator, 
    on the other hand, represents how often apple and egg <i>actually</i> appear together in the same order.  In 
    this example, that is 60% of the time.  Taking the numerator and dividing it by the denominator, we get to how 
    many more times apple and egg actually appear in the same order, compared to if there was no relationship between     them (ie: that they are occurring together simply at random).  
    
    In summary, lift can take on the following values:
    
        * lift = 1 implies no relationship between A and B. 
          (ie: A and B occur together only by chance)
      
        * lift > 1 implies that there is a positive relationship between A and B.
          (ie:  A and B occur together more often than random)
    
        * lift < 1 implies that there is a negative relationship between A and B.
          (ie:  A and B occur together less often than random)
        
    In our example, apple and egg occur together 1.25 times <i>more</i> than random, so we conclude that there exists 
    a positive relationship between them.
   
Armed with knowledge of apriori and association rules mining, let's dive into the data and code to see what relationships we unravel!

In [1]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

In [2]:
# Function that returns the size of an object in MB
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

### Part 1:  Data Preparation

#### A. Load order  data

In [3]:
# import data
products = pd.read_csv('products.csv')
aisles = pd.read_csv('aisles.csv')
departments = pd.read_csv('departments.csv')
clusters = pd.read_csv('customer_group_Kmeans.csv')
order_detils = pd.read_csv('orders.csv')
order_products_prior = pd.read_csv('order_products__prior.csv')

In [4]:
# combine datasets
user_merged = pd.merge(order_detils, clusters, on='user_id')

product_aisle = pd.merge(products, aisles, on='aisle_id')
product_department = pd.merge(product_aisle, departments, on='department_id')

order_products = pd.merge(product_department, order_products_prior, on='product_id')

orders_all= pd.merge(user_merged, order_products, on='order_id')
orders_all.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,customer type,product_id,product_name,aisle_id,department_id,aisle,department,add_to_cart_order,reordered
0,2539329,1,prior,1,2,8,,0,12427,Original Beef Jerky,23,19,popcorn jerky,snacks,3,0
1,2539329,1,prior,1,2,8,,0,26088,Aged White Cheddar Popcorn,23,19,popcorn jerky,snacks,4,0
2,2539329,1,prior,1,2,8,,0,196,Soda,77,7,soft drinks,beverages,1,0
3,2539329,1,prior,1,2,8,,0,14084,Organic Unsweetened Vanilla Almond Milk,91,16,soy lactosefree,dairy eggs,2,0
4,2539329,1,prior,1,2,8,,0,26405,XL Pick-A-Size Paper Towel Rolls,54,17,paper goods,household,5,0


In [5]:
# setup datasets
orders = orders_all[["order_id", "user_id", "customer type", "product_id", "aisle_id", "department_id"]]
orders

Unnamed: 0,order_id,user_id,customer type,product_id,aisle_id,department_id
0,2539329,1,0,12427,23,19
1,2539329,1,0,26088,23,19
2,2539329,1,0,196,77,7
3,2539329,1,0,14084,91,16
4,2539329,1,0,26405,54,17
...,...,...,...,...,...,...
32434484,2977660,206209,0,9405,91,16
32434485,2977660,206209,0,16168,86,16
32434486,2977660,206209,0,14197,9,9
32434487,2977660,206209,0,39216,121,14


In [6]:
# orders.to_csv('merged_order_cluster.csv', encoding='utf-8')
orders.to_csv('merged_order_cluster_all.csv', encoding='utf-8')

In [4]:
# Read files
orders = pd.read_csv('merged_order_cluster.csv')

In [7]:
# Select the customer clusters: 0, 1, 2 ,3 , 4  
orders = orders[orders["customer type"] == 0] 
orders.sort_values("order_id", ascending = False) 

Unnamed: 0,order_id,user_id,customer type,product_id,aisle_id,department_id
3972198,3421083,25247,0,5020,3,19
3972202,3421083,25247,0,39678,74,17
3972204,3421083,25247,0,21162,92,18
3972205,3421083,25247,0,35211,92,18
3972206,3421083,25247,0,45309,92,18
...,...,...,...,...,...,...
31824592,2,202279,0,17794,83,4
31824591,2,202279,0,43668,123,4
31824590,2,202279,0,33120,86,16
31824588,2,202279,0,45918,19,13


In [8]:
# orders = pd.read_csv('order_products__prior.csv')
print('orders -- dimensions: {0};   size: {1}'.format(orders.shape, size(orders)))
display(orders.head())

orders -- dimensions: (29084126, 6);   size: 1628.71 MB


Unnamed: 0,order_id,user_id,customer type,product_id,aisle_id,department_id
0,2539329,1,0,12427,23,19
1,2539329,1,0,26088,23,19
2,2539329,1,0,196,77,7
3,2539329,1,0,14084,91,16
4,2539329,1,0,26405,54,17


#### B. Convert order data into format expected by the association rules function

In [9]:
# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders = orders.set_index('order_id')['product_id'].rename('item_id')
display(orders.head(10))
type(orders)

order_id
2539329    12427
2539329    26088
2539329      196
2539329    14084
2539329    26405
2398795    12427
2398795    26088
2398795    10258
2398795      196
2398795    13032
Name: item_id, dtype: int64

pandas.core.series.Series

#### C. Display summary statistics for order data

In [10]:
print('dimensions: {0};   size: {1};   unique_orders: {2};   unique_items: {3}'
      .format(orders.shape, size(orders), len(orders.index.unique()), len(orders.value_counts())))

dimensions: (29084126,);   size: 465.35 MB;   unique_orders: 2938109;   unique_items: 49652


### Part 2: Association Rules Function

#### A. Helper functions to the main association rules function

In [11]:
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().values
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]               

#### B. Association rules function

In [12]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100
    


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("(qualifying_items)Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Items above min support, Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("(qualifying_orders)Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("orders more than 2 items, Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("use paires generator, Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

### Part 3:  Association Rules Mining

In [13]:
%%time
rules = association_rules(orders, 0.01)  

Starting order_item:               29084126
(qualifying_items)Items with support >= 0.01:           10887
Items above min support, Remaining order_item:              26728412
(qualifying_orders)Remaining orders with 2+ items:     2747746
orders more than 2 items, Remaining order_item:              26557711
use paires generator, Item pairs:                        19529742
Item pairs with support >= 0.01:      58186

Wall time: 3min 27s


#### Results Without clustering 
Starting order_item:               32434489

Items with support >= 0.01:           10906

Items above min support, Remaining order_item:              29843570

Remaining orders with 2+ items:     3013325

orders more than 2 items, Remaining order_item:              29662716

use paires generator, Item pairs:                        30622410

Item pairs with support >= 0.01:      48751


Wall time: 4min 20s

### Results With clustering = 0

Starting order_item:               29084126

**(qualifying_items)Items with support >= 0.01:           10887**

Items above min support, Remaining order_item:              26728412

(qualifying_orders)Remaining orders with 2+ items:     2747746

orders more than 2 items, Remaining order_item:              26557711

use paires generator, Item pairs:                        28786889

**Item pairs with support >= 0.01:      46023**



### Results With clustering = 1

Starting order_item:                 560066

**(qualifying_items)Items with support >= 0.01:            9508**

Items above min support, Remaining order_item:                548294

(qualifying_orders)Remaining orders with 2+ items:       33776

orders more than 2 items, Remaining order_item:                547989

use paires generator, Item pairs:                         2330965

**Item pairs with support >= 0.01:     340757**



### Results With clustering = 2

Starting order_item:                2172599

(qualifying_items)Items with support >= 0.01:           10377

Items above min support, Remaining order_item:               2086915

(qualifying_orders)Remaining orders with 2+ items:      153312

orders more than 2 items, Remaining order_item:               2083974

use paires generator, Item pairs:                         6029162

**Item pairs with support >= 0.01:     143392**



### Results With clustering = 3

Starting order_item:                 286014

(qualifying_items)Items with support >= 0.01:            8281

Items above min support, Remaining order_item:                274323

(qualifying_orders)Remaining orders with 2+ items:       31546

orders more than 2 items, Remaining order_item:                272554

use paires generator, Item pairs:                         1035147

**Item pairs with support >= 0.01:      76287**



#### Results With clustering = 4

Starting order_item:                 331684

(qualifying_items)Items with support >= 0.01:            1786

Items above min support, Remaining order_item:                327019

(qualifying_orders)Remaining orders with 2+ items:       48592

orders more than 2 items, Remaining order_item:                322861

use paires generator, Item pairs:                          228222

**Item pairs with support >= 0.01:      53491**




In [14]:
# Replace item ID with item name and display association rules
item_name   = pd.read_csv('products.csv')
item_name   = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
rules_final = merge_item_name(rules, item_name).sort_values('lift', ascending=False)
display(rules_final)

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
0,Dairy Free Greek Yogurt Strawberry,Dairy Free Greek Yogurt Blueberry,293,0.010663,578,0.021035,531,0.019325,0.506920,0.551789,26.231423
1,Grassfed Whole Milk Strawberry Yogurt,Organic Strawberry Grassfed Whole Milk Yogurt,276,0.010045,561,0.020417,592,0.021545,0.491979,0.466216,22.835004
2,Organic Cashew Nondairy Blueberry Yogurt,Organic Nondairy Strawberry Cashew Yogurt,363,0.013211,576,0.020963,851,0.030971,0.630208,0.426557,20.348442
4,Organic Strawberry Chia Lowfat 2% Cottage Cheese,Organic Cottage Cheese Blueberry Acai Chia,489,0.017796,1089,0.039632,766,0.027877,0.449036,0.638381,16.107524
5,Hair Shampoos,Moroccan Argan Oil + Argan Stem Cell Triple Mo...,302,0.010991,634,0.023073,843,0.030680,0.476341,0.358244,15.526254
...,...,...,...,...,...,...,...,...,...,...,...
8314,Small Hass Avocado,Organic Hass Avocado,376,0.013684,44260,1.610775,186186,6.775954,0.008495,0.002019,0.001254
26549,Organic Cucumber,Cucumber Kirby,290,0.010554,72064,2.622659,90338,3.287713,0.004024,0.003210,0.001224
5314,Strawberries,Organic Strawberries,1093,0.039778,121251,4.412744,228619,8.320238,0.009014,0.004781,0.001083
11273,Organic Hass Avocado,Organic Avocado,813,0.029588,186186,6.775954,160047,5.824665,0.004367,0.005080,0.000750


In [15]:
rules_final.to_csv('Association_Rules_Mining_cluster_0.csv', encoding='utf-8')

In [16]:
rules_final.sort_values("lift", axis = 0, ascending = False) 

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
0,Dairy Free Greek Yogurt Strawberry,Dairy Free Greek Yogurt Blueberry,293,0.010663,578,0.021035,531,0.019325,0.506920,0.551789,26.231423
1,Grassfed Whole Milk Strawberry Yogurt,Organic Strawberry Grassfed Whole Milk Yogurt,276,0.010045,561,0.020417,592,0.021545,0.491979,0.466216,22.835004
2,Organic Cashew Nondairy Blueberry Yogurt,Organic Nondairy Strawberry Cashew Yogurt,363,0.013211,576,0.020963,851,0.030971,0.630208,0.426557,20.348442
4,Organic Strawberry Chia Lowfat 2% Cottage Cheese,Organic Cottage Cheese Blueberry Acai Chia,489,0.017796,1089,0.039632,766,0.027877,0.449036,0.638381,16.107524
5,Hair Shampoos,Moroccan Argan Oil + Argan Stem Cell Triple Mo...,302,0.010991,634,0.023073,843,0.030680,0.476341,0.358244,15.526254
...,...,...,...,...,...,...,...,...,...,...,...
8314,Small Hass Avocado,Organic Hass Avocado,376,0.013684,44260,1.610775,186186,6.775954,0.008495,0.002019,0.001254
26549,Organic Cucumber,Cucumber Kirby,290,0.010554,72064,2.622659,90338,3.287713,0.004024,0.003210,0.001224
5314,Strawberries,Organic Strawberries,1093,0.039778,121251,4.412744,228619,8.320238,0.009014,0.004781,0.001083
11273,Organic Hass Avocado,Organic Avocado,813,0.029588,186186,6.775954,160047,5.824665,0.004367,0.005080,0.000750


In [17]:
lift_over_1 = rules_final[rules_final['lift'] > 1]

In [None]:
# lift_over_1.to_csv('lift_over_1_cluster_0.csv', encoding='utf-8')

### Personized Recommender

In [22]:
# example User 1 Shopping history
user_1 = orders_all[orders_all['user_id'] == 1]
user_1["product_name"].value_counts()

Soda                                       10
Original Beef Jerky                        10
Pistachios                                  9
Organic String Cheese                       8
Zero Calorie Cola                           3
Cinnamon Toast Crunch                       3
Organic Half & Half                         2
XL Pick-A-Size Paper Towel Rolls            2
Aged White Cheddar Popcorn                  2
Bag of Organic Bananas                      2
Organic Unsweetened Almond Milk             1
Organic Unsweetened Vanilla Almond Milk     1
Creamy Almond Butter                        1
Milk Chocolate Almonds                      1
Bartlett Pears                              1
Organic Fuji Apples                         1
0% Greek Strained Yogurt                    1
Honeycrisp Apples                           1
Name: product_name, dtype: int64

In [30]:
# Our Shopping Recommandation
results = rules_final[rules_final['itemA'] == "Original Beef Jerky"]
results.sort_values('supportAB', ascending=False)

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
27733,Original Beef Jerky,Soda,759,0.027623,5170,0.188154,29206,1.062908,0.146809,0.025988,0.13812
3084,Original Beef Jerky,Bag of Organic Bananas,753,0.027404,5170,0.188154,326609,11.886433,0.145648,0.002306,0.012253
27615,Original Beef Jerky,Trail Mix,631,0.022964,5170,0.188154,9195,0.334638,0.12205,0.068624,0.364723
27660,Original Beef Jerky,Clementines,601,0.021872,5170,0.188154,19749,0.718735,0.116248,0.030432,0.161739
27633,Original Beef Jerky,0% Greek Strained Yogurt,532,0.019361,5170,0.188154,9612,0.349814,0.102901,0.055347,0.29416
29787,Original Beef Jerky,Extra Fancy Unsalted Mixed Nuts,474,0.017251,5170,0.188154,7291,0.265345,0.091683,0.065012,0.345523
27596,Original Beef Jerky,Milk Chocolate Almonds,317,0.011537,5170,0.188154,3781,0.137604,0.061315,0.08384,0.445593
34539,Original Beef Jerky,Sparkling Mineral Water,313,0.011391,5170,0.188154,14413,0.524539,0.060542,0.021717,0.115419
375,Original Beef Jerky,Banana,290,0.010554,5170,0.188154,428131,15.581171,0.056093,0.000677,0.0036


### Association Rules Mining with Ailes and Departments

In [None]:
## Test With Aisle and Departments

# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders_aisle = orders.set_index('order_id')['aisle_id'].rename('item_id')


orders_department = orders.set_index('order_id')['department_id'].rename('item_id')

print('dimensions: {0};   size: {1};   unique_orders: {2};   unique_items: {3}'
      .format(orders_aisle.shape, size(orders_aisle), len(orders_aisle.index.unique()), len(orders_aisle.value_counts())))

print('dimensions: {0};   size: {1};   unique_orders: {2};   unique_items: {3}'
      .format(orders_department.shape, size(orders_department), len(orders_department.index.unique()), len(orders_department.value_counts())))

%%time
rules = association_rules(orders_aisle, 0.01)  

%%time
rules = association_rules(orders_department, 0.01)  

### Part 4:  Conclusion

From the output above, we see that the top associations are not surprising, with one flavor of an item being purchased with another flavor from the same item family (eg: Strawberry Chia Cottage Cheese with Blueberry Acai Cottage Cheese, Chicken Cat Food with Turkey Cat Food, etc).  As mentioned, one common application of association rules mining is in the domain of recommender systems.  Once item pairs have been identified as having positive relationship, recommendations can be made to customers in order to increase sales.  And hopefully, along the way, also introduce customers to items they never would have tried before or even imagined existed!