# Instacart Recommender
In this notebook we will explore techniques for understanding and predicting customer purchasing habits.  Our primary goal will be to create a recommender.  Given a customer id, the recommender should provide a list of new products that the customer may be interested in purchasing.  Going one step further, given a product id, the recommender should provide a list of similar products that a customer may be interested in purchasing. 

We will explore 3 recommendation approaches.  Each offers different insight into the Instacart dataset.  The 3 approaches are ...
1. Global Recommendations
2. Association Rules
3. Collaborative Filtering

Before we dive into the data, we need to import our datasets into Pandas dataframes. 

In [76]:
# import libraries required for this notebook
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import mlxtend as mlx

import seaborn as sns
sns.set(style='whitegrid')

import random
import datetime

from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori

import implicit
from implicit.als import AlternatingLeastSquares
from implicit.nearest_neighbours import bm25_weight
import scipy.sparse as sparse

## Data Import
In order to reduce memory requirements, we've defined the column datatypes for our data.

In [77]:
# setup datatypes to save memory 
order_dtypes = {
    'order_id':np.int32,
    'user_id':np.int64,
    'eval_set':'category',
    'order_number':np.int16,
    'order_dow':np.int8,
    'order_hour_of_day':np.int8,
    'days_since_prior_order':np.float32
}

product_dtypes={
    'product_id':np.uint16,
    'aisle_id':np.int16,
    'department_id':np.int16
}

order_details_dtypes={
    'order_id':np.int32,
    'product_id':np.uint16,
    'add_to_cart_order':np.int32,
    'reordered':np.int8  
}

In [78]:
# import data
orders = pd.read_csv('../data/raw/orders.csv', dtype=order_dtypes)
products = pd.read_csv('../data/raw/products.csv', dtype=product_dtypes)
order_details_prior = pd.read_csv('../data/raw/order_products__prior.csv', dtype=order_details_dtypes)
#aisles = pd.read_csv('../data/raw/aisles.csv')
#order_details_train = pd.read_csv('../data/raw/order_products__train.csv')

Next, we can join our orders, order details and product information into a single dataframe.

In [79]:
# combine our dataset
order_details_all = pd.merge(orders, order_details_prior, on='order_id')
order_details_all = pd.merge(order_details_all, products, on='product_id')

In [80]:
order_details_all.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7
1,2398795,1,prior,2,3,7,15.0,196,1,1,Soda,77,7
2,473747,1,prior,3,3,12,21.0,196,1,1,Soda,77,7
3,2254736,1,prior,4,4,7,29.0,196,1,1,Soda,77,7
4,431534,1,prior,5,4,15,28.0,196,1,1,Soda,77,7


## 1. Global Recommendations
Global recommendations will be very generic.  We look to answer the question: **What top 2 products were purchased with item A?** (Where item A is any product available in the product catalog).

These recommendations will be derived from a product contingency matrix.  A product contingency matrix identifies how many times a product pairing was purchased.  For example, how many times were bananas purchased with avocados? 

The problem with this method is that these recommendations are not customer specific (hence the *global* identifier).  Recommendations are basically a popularity contest whereby the most popular product pairing wins.  It also does not determine whether purchasing item A results in a higher likelihood that item B is also purchased.  Nevertheless, lets dive in.

### Product Contingency Matrix
A product contingency matrix requires us to have a list of all product pairings for every order. This product pairing will then be summarized using the pandas crosstab function to create the product contingency matrix. 

We have over 3,000,000 order transactions within this dataset - this is a great wealth of data, but is too large for us to process a crosstab.  We will reduce the number of users included in the dataset which will decrease the number of transactions overall.

In [81]:
# create mask for users
user_subset = orders.user_id.unique()[:1500]

In [82]:
# create orders subset
orders_subset = order_details_all[order_details_all.user_id.isin(user_subset)]

In [83]:
len(orders_subset)

219257

We have reduced the dataset down to 1,500 customers, accounting for 219,257 orders.  We will build our contingency matrix from this data next.

In [84]:
from itertools import combinations, permutations, product

def get_contingency_matrix(order_details):
    '''
        Takes an Instacart order dataset and returns a product contingency matrix    
        
        Parameters
        ----------
            order_details: The Instacart order history
        
        Returns
        -------
            product contingency matrix, ordered by purchase frequency in descending order
    '''

    # define item_list to hold our product pairing
    item_list = pd.DataFrame(columns=['a', 'b'])

    # loop through order history by order ID
    for order_id, order_products in order_details.groupby('order_id'):

        # find a unique list of products
        p_names = list(order_products.product_name.unique())

        # create a container to hold all product combinations
        product_list = []

        # loop through permutations
        for a, b in product(p_names,p_names):

            # append the permutation to our list
            product_list.append([a, b])

        # create a temporary dataframe out of the product pairs
        temp = pd.DataFrame(product_list, columns=['a','b'])
        
        # add product pairs to the master item list
        item_list = pd.concat([item_list, temp], axis=0)
    
    # create our crosstab
    matrix = pd.crosstab(item_list.a, item_list.b)
    
    # sort our column values by purchase quantity (in descending order)
    sorted_names = matrix.sum().sort_values(ascending=False).index.tolist()
    
    # return a sorted contingency matrix
    return matrix.loc[sorted_names, sorted_names]


In [85]:
# create the contingency matrix
cont_matrix = get_contingency_matrix(orders_subset)

# preview first 15 columns
cont_matrix.iloc[:15, :15]

b,Banana,Bag of Organic Bananas,Organic Strawberries,Organic Baby Spinach,Organic Hass Avocado,Organic Avocado,Limes,Organic Raspberries,Large Lemon,Organic Garlic,Strawberries,Organic Zucchini,Organic Yellow Onion,Organic Whole Milk,Cucumber Kirby
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Banana,2910,7,310,293,149,340,192,125,245,99,219,116,83,177,206
Bag of Organic Bananas,7,2661,480,327,466,173,176,344,146,148,127,198,177,239,112
Organic Strawberries,310,480,1870,228,297,152,148,299,84,145,7,109,104,151,77
Organic Baby Spinach,293,327,228,1564,221,179,137,148,186,144,96,148,118,129,158
Organic Hass Avocado,149,466,297,221,1458,7,151,191,78,165,46,118,143,89,48
Organic Avocado,340,173,152,179,7,1181,179,78,161,91,74,81,78,123,88
Limes,192,176,148,137,151,179,1040,84,177,109,57,92,94,64,84
Organic Raspberries,125,344,299,148,191,78,84,989,71,44,65,96,55,112,56
Large Lemon,245,146,84,186,78,161,177,71,1039,85,90,84,55,49,172
Organic Garlic,99,148,145,144,165,91,109,44,85,786,27,69,156,63,56


Our contingency matrix is comprised of product names spanning both rows & columns.  The inner cells represent the purchase quantity of the product pairing.  

For example, Bananas & Limes were purchased 192 times within our dataset. We can now build recommendations using this matrix. 

In [86]:
def get_global_recommendations(product, matrix):
    '''
        Given a product contingency matrix (matrix), find the top 2 products to be recommended
        for purchase given a selected item (product)
        
        parameters
        ----------
            product: The product name for which recommendations should be generated
            
            marix: product contingency matrix
            
        returns
        -------
            product recommendatiosn in a list
        
    '''

    # find our purchased products paired with product p
    product_series = matrix[p]

    # create a mask so that we remove product p from this list
    product_series = product_series[matrix[p].index != p]

    # get top 2 products most frequently purchased
    recommendations = product_series.sort_values(ascending=False)[:2]
    
    # return reccomendations
    return recommendations

We will feed in the top 5 products to our recommender and print the top 2 recommendations for each product.

In [87]:
print('Recommendations based on global products contingency matrix')
print()

# print reccomendations for top 5 products
for p in cont_matrix.columns[:5]:
    
    print('Top 2 recommended items to buy with {} are:'.format(p))

    # get recommendations
    recommendations = get_global_recommendations(p, cont_matrix)
       
    # print results
    print(recommendations.index.tolist())
    print()

Recommendations based on global products contingency matrix

Top 2 recommended items to buy with Banana are:
['Organic Avocado', 'Organic Strawberries']

Top 2 recommended items to buy with Bag of Organic Bananas are:
['Organic Strawberries', 'Organic Hass Avocado']

Top 2 recommended items to buy with Organic Strawberries are:
['Bag of Organic Bananas', 'Banana']

Top 2 recommended items to buy with Organic Baby Spinach are:
['Bag of Organic Bananas', 'Banana']

Top 2 recommended items to buy with Organic Hass Avocado are:
['Bag of Organic Bananas', 'Organic Strawberries']



In this section we demonstrated how to build a simple recommender using a product contingency matrix.  Recommendations are provided based on purchase frequency - so this case, we are recommending the most popular items paired with each product.  

In the next sections we will build on this to include further intelligence. 

## 2. Association Rules
An association rules provide us with a simple equation, where the right hand side of the equation (consequents) represents products a customer is likely to purchase as a result of having the items on the left hand side (antecedants) already in their basket. 

Grocery stores utilize association rules to know which products they should place together in order to foster additional purchases.  We can also use association rules to recommend products based on what products have currently been added to a shopping cart.  Our next recommender will return a set of products based on association rules that we create. 

Association rules are created from itemsets (which are simply products that are frequently purchased together).  We will create itemsets using the apriori algorithm.  The frequent itemsets are then fed into the association_rules algorithm to generate our final association rules to be used by our recommender.

### Apriori Algoirthm
The apriori algorithm expects data in a one hot encoded pandas dataframe, similar to: 

| OrderID | Apple | Corn | Dill | Eggs | Ice Cream |
|---|---|---|---|---|
|0|0|1|0|1|1|
|1|1|0|0|0|1|
|2|0|1|0|0|0|
|3|0|1|1|0|0|
|4|0|1|0|0|0|

Each row represents a single order.  The columns then identify whether a particular product was found in the given order.  A value of 1 represents that the produce was present, and a value of 0 indicates that the product is not present.  

Our current order details dataset is not represented by a single line per transaction.  We will need to perform data manipulation to achieve this format.  

In [88]:
def create_oht_dataframe(order_details):
    '''
        Takes an instacart transaction dataframe, flattens the dataframe and encodes it as
        one hot transaction
        
        Parameters
        ----------
        data: Instacart transactional dataframe
        
        Returns
        -------
        One hot transaction encoded dataframe
        
    '''
    order_details_flattened = []
    order_ids = []

    # loop through each order & flatten order to a single line
    for group, data in order_details.groupby('order_id'):

        # find product names
        products_on_order = list(data.product_name.values)

        # append order Id and product names to new array
        order_ids.append(group)
        order_details_flattened.append(products_on_order)

    # create one hot transaction
    oht = OnehotTransactions()

    # convert our flattened order data structure
    oht_ary = oht.fit(order_details_flattened).transform(order_details_flattened)

    # convert results to a dataframe and return
    return pd.DataFrame(oht_ary, columns=oht.columns_, index=order_ids)    

In [89]:
# create our dataset
oht_df = create_oht_dataframe(orders_subset)

We now have a one hot encoded transaction dataframe.  There will be many gaps within this frame as we have an incredibly high number of products with much fewer transactions.  We want to decrease the size of this dataframe, removing any products that have not been frequently purchased. 

We can look at how many times each product has been purchased by summing each column and viewing the statistics:

In [90]:
oht_df.sum().describe()

count    18327.000000
mean        11.963606
std         52.063654
min          1.000000
25%          1.000000
50%          3.000000
75%          8.000000
max       2910.000000
dtype: float64

The majority of products have been purchased less than 4 times.  We are going to limit the number of products by selecting those that have been purchased at least 4 times.  This is going to allow us to process data faster (since we have less data) and ensure we create stronger association rules (as we are only looking at products that are purchased more frequently).

In [91]:
# before shape
oht_df.shape

(22531, 18327)

In [92]:
# Set our minimum purhcase frequency
min_freq = 4

# create a filter
min_freq_filter = (oht_df.sum() >= min_freq).values

# apply filter
oht_df_reduced = oht_df.loc[:,min_freq_filter]

In [93]:
# after shape
oht_df_reduced.shape

(22531, 8161)

With the dataset reduce, we can create our itemsets using the apriori algorithm.

*Note that this takes a large amount of time to process (about an hour)*

In [94]:
# Since we have such a massive number of transactions, we need to keep min_suppor low (0.002)
min_support = 0.002

# maximum length of the itemsets generated.  We set this to 3, but it can obviously change depending on ones
# need.  If set to None, all possible itemsets lenghts are evaluated
max_len = 3

# generate our itemsets
itemsets = apriori(oht_df_reduced, min_support=min_support, use_colnames=True, max_len=max_len) 

In [121]:
# have a peak at our results
itemsets.sort_values('support', ascending=False).head(10)

Unnamed: 0,support,itemsets
51,0.129155,[Banana]
48,0.118104,[Bag of Organic Bananas]
562,0.082997,[Organic Strawberries]
356,0.069415,[Organic Baby Spinach]
454,0.064711,[Organic Hass Avocado]
348,0.052417,[Organic Avocado]
278,0.046159,[Limes]
266,0.046114,[Large Lemon]
513,0.043895,[Organic Raspberries]
742,0.042164,[Strawberries]


Looking at the itemsets, we see that the top 5 itemsets correspond to our top 5 products as identified in section 1.  Let's move on to creating association rules using these itemsets.

### Building Association Rules

Before diving into association rules, lets review several metrics used to rank association rules.  For the purpose of these explanations, let's assume an association rules is represented by *A->B*

**Support**
Defines the popularity of an itemset.  It is calculated by dividing the number transactions in which an itemset appears by the total number of transactions.

**Confidence**
Used to identify how likely item B is purchased when item A is purchased.  It is calculated by dividing the number of transactions that contain item A & B, by the total number of transactions that contain item A.

**Lift**
Indicates whether the presence of item A is responsible for the increase in probability that the customer will also buy item B.  
* If lift > 1, it indicates that item A is responsible for the increase in probability
* If lift == 0, it indicates independence between item A & B
* If lift < 1, it indicates that item B is less likely to occur as a result of item A occurring. 

With an overview of metrics complete, we create the association rules.

In [96]:
#Lets generate the association rules for our given datasets
rules = association_rules(itemsets.reset_index(drop=True), metric='confidence', min_threshold=0.1)
rules.drop_duplicates(inplace=True)

#### Association Rules By Support
Lets see the top 10 association rules that have the strongest support (meaning the itemsets that are seen most often in our transactions)

In [122]:
rules.sort_values('support', ascending=False).head(10)

Unnamed: 0,antecedants,consequents,support,confidence,lift
148,(Banana),(Organic Avocado),0.129155,0.116838,2.229033
184,(Banana),(Organic Strawberries),0.129155,0.106529,1.283535
153,(Banana),(Organic Baby Spinach),0.129155,0.100687,1.450502
46,(Bag of Organic Bananas),(Organic Baby Spinach),0.118104,0.122886,1.770299
82,(Bag of Organic Bananas),(Organic Raspberries),0.118104,0.129275,2.945084
94,(Bag of Organic Bananas),(Organic Strawberries),0.118104,0.180383,2.173378
67,(Bag of Organic Bananas),(Organic Hass Avocado),0.118104,0.175122,2.706226
432,(Organic Strawberries),(Organic Baby Spinach),0.082997,0.121925,1.756455
585,(Organic Strawberries),(Organic Raspberries),0.082997,0.159893,3.642619
93,(Organic Strawberries),(Bag of Organic Bananas),0.082997,0.256684,2.173378


We see that bananas play a huge role - buying bananas is a precursor for buying items like organic strawberries, organic avocados and baby spinach.  These combinations were seen in about 13% of transactions. 

#### Association Rules By Confidence
Lets see the top 10 association rules that have the strongest confidence (meaning there is a higher probability of seeing items on the right if a customer has selected the item on the left )

In [123]:
rules.sort_values('confidence', ascending=False).head(10)

Unnamed: 0,antecedants,consequents,support,confidence,lift
636,(Raspberry Vinaigrette Salad Snax),(Thousand Island Salad Snax),0.003018,1.0,268.22619
635,(Thousand Island Salad Snax),(Raspberry Vinaigrette Salad Snax),0.003728,0.809524,268.22619
715,"(Organic Reduced Fat Omega-3 Milk, Whole Organ...",(Bag of Organic Bananas),0.002619,0.79661,6.744992
716,"(Organic Reduced Fat Omega-3 Milk, Bag of Orga...",(Whole Organic Omega 3 Milk),0.002619,0.79661,138.064798
717,"(Whole Organic Omega 3 Milk, Bag of Organic Ba...",(Organic Reduced Fat Omega-3 Milk),0.002619,0.79661,139.135068
680,"(Organic Reduced Fat Omega-3 Milk, Organic Has...",(Bag of Organic Bananas),0.002574,0.793103,6.7153
681,"(Organic Reduced Fat Omega-3 Milk, Bag of Orga...",(Organic Hass Avocado),0.002619,0.779661,12.048383
748,"(Sparkling Lemon Water, Sparkling Water Grapef...",(Lime Sparkling Water),0.003196,0.722222,51.658377
755,"(Organic Cucumber, Michigan Organic Kale)",(Organic Small Bunch Celery),0.003018,0.705882,31.556022
702,"(Organic Large Extra Fancy Fuji Apple, Organic...",(Bag of Organic Bananas),0.003107,0.685714,5.806024


We find that there is 100% chance that you will buy thousand island salad snax if you have already placed raspberry vinaigrette salad snax in your cart.  On the inverse, there is only a 80% chance of buying raspberry vinaigrette salad snax if you've already placed thousand island salad snax in your cart.

We also see that there is about 80% chance of buying organic reduced fat omega-3 milk if you have already purchased whole organic omega 3 milk and a bag of organic bananas.

These relationships give grocers an idea of what items to place next to one another to increase chances of purchasing more products when in a store. 

#### Association Rules by Lift
Lets see the top 10 association rules that have the strongest lift (meaning items on the left are responsible for the increase in purchase of items on the right)

In [124]:
rules.sort_values('lift', ascending=False).head(10)

Unnamed: 0,antecedants,consequents,support,confidence,lift
636,(Raspberry Vinaigrette Salad Snax),(Thousand Island Salad Snax),0.003018,1.0,268.22619
635,(Thousand Island Salad Snax),(Raspberry Vinaigrette Salad Snax),0.003728,0.809524,268.22619
247,(Enlightened Organic Raw Kombucha),(Spicy Avocado Hummus),0.003861,0.528736,165.457535
248,(Spicy Avocado Hummus),(Enlightened Organic Raw Kombucha),0.003196,0.638889,165.457535
717,"(Whole Organic Omega 3 Milk, Bag of Organic Ba...",(Organic Reduced Fat Omega-3 Milk),0.002619,0.79661,139.135068
718,(Organic Reduced Fat Omega-3 Milk),"(Whole Organic Omega 3 Milk, Bag of Organic Ba...",0.005725,0.364341,139.135068
716,"(Organic Reduced Fat Omega-3 Milk, Bag of Orga...",(Whole Organic Omega 3 Milk),0.002619,0.79661,138.064798
719,(Whole Organic Omega 3 Milk),"(Organic Reduced Fat Omega-3 Milk, Bag of Orga...",0.00577,0.361538,138.064798
651,(Total 2% Lowfat Greek Strained Yogurt with Pe...,(Total 2% Lowfat Greek Strained Yogurt With Bl...,0.005282,0.411765,88.356863
650,(Total 2% Lowfat Greek Strained Yogurt With Bl...,(Total 2% Lowfat Greek Strained Yogurt with Pe...,0.00466,0.466667,88.356863


Lift allows us to understand how much of an influence an antecendant has on a customer purchasing a consequent.  The higher the lift, the more likely the positive influence the antecendant has.  

In our case, we have extremely high lift on items like salad snaxs, organic raw kombucha, spicy avocado hummus, and organic reduce fat omega-3 milk.  With high lift values, we know that antecedants are a driving factor in customers purchasing items listed in the consequents and so should be recommended these consequent products when a customer purchases an antecedant. 

### Building a Recommender
We can query association rules to find product recommendations by filtering antecedants for a particular product.

We must note that because we didn't build the rules using our full dataset, we will have products missing.  This is expected. 

For our case, we will select 3 products known to be in within our rule subset.  These are:

| product_id | name |
|---|---|
| 5077 | 100% Whole Wheat Bread |
| 35221 | Lime Sparkling Water |
| 47766 | Organic Avocado |

We first need to create a function to create the recommendation function

In [133]:
def get_ar_recommendation(product_ids, rules):
    '''
        Given a target item (product) and a list of association rules (rules), 
        create list of recommended items to purchase. 
        
        Parameters
        ==========
        product: The target item that has been purchased.  Could be 1 or many products
        
        rules: A list of association rules
    '''
    
    # filter the rules to only look at antecedants of size 1
    antecedants1_mask = rules.antecedants.apply(lambda x: len(x) == 1)
    rules = rules[antecedants1_mask]

    # check to see if we have a list of product_ids or just a single id
    if type(product_ids) == list:
        
        p_names = []
        
        # we have a list, so parse the list out
        for p in prodcut_ids:
            name = products[products.product_id==p].product_name.iloc[0]
            p_names.append(name)
        
        target_item = frozenset(p_names)
    else:
        
        # we only have a single id
        name = products[products.product_id==product_ids].product_name.iloc[0]
        target_item = frozenset({name})

    # create the mask
    mask = [True if target_item.issubset(r) else False for r in rules.antecedants]
    
    # find our product recommendations
    #prod_names = {p_name for conseq in rules[mask].consequents for p_name in conseq}
    #recommendations = []

    # find the product ids for each recommendation name
    #for n in prod_names:
    #    p_id = products[products.product_name==n].product_id.iloc[0]
    #    recommendations.append((p_id, n))
    
    return rules[mask]

With our function created, we can check for recommendations

In [135]:
# get recommendations for 100% whole wheat bread
print('Recommendations for 100% Whole Wheat Bread:')
get_ar_recommendation(5077, rules).sort_values('lift', ascending=False)

Recommendations for 100% Whole Wheat Bread:


Unnamed: 0,antecedants,consequents,support,confidence,lift
4,(100% Whole Wheat Bread),(Limes),0.015667,0.135977,2.945871
5,(100% Whole Wheat Bread),(Organic Hass Avocado),0.015667,0.150142,2.320193
6,(100% Whole Wheat Bread),(Organic Strawberries),0.015667,0.169972,2.047931
3,(100% Whole Wheat Bread),(Banana),0.015667,0.263456,2.039838
2,(100% Whole Wheat Bread),(Bag of Organic Bananas),0.015667,0.161473,1.367212


After purchasing 100% whole wheat bread, a customer should be recommended items such as limes, avocados, strawberries, and bananas.  The lift values for each of these items is above 1, indicating that purchasing whole wheat bread increases the probability of a customer purchasing any of suggested items.  

Examining the confidence values, we do need to be aware that there is a low confidence that our suggested items will actually be purchased.  Bananas have the highest confidence, with a probability of 26% of being purchased with whole wheat bread. 

In [136]:
# get recommendations for Lime Sparkling Water
print('Recommendations for Lime Sparkling Water:')
get_ar_recommendation(35221, rules).sort_values('lift', ascending=False)

Recommendations for Lime Sparkling Water:


Unnamed: 0,antecedants,consequents,support,confidence,lift
751,(Lime Sparkling Water),"(Sparkling Lemon Water, Sparkling Water Grapef...",0.013981,0.165079,51.658377
323,(Lime Sparkling Water),(Sparkling Lemon Water),0.013981,0.285714,31.868458
325,(Lime Sparkling Water),(Sparkling Water Grapefruit),0.013981,0.377778,18.265474
280,(Lime Sparkling Water),(Half & Half),0.013981,0.184127,8.198745
320,(Lime Sparkling Water),(Organic Hass Avocado),0.013981,0.146032,2.256681
321,(Lime Sparkling Water),(Organic Strawberries),0.013981,0.180952,2.180234
142,(Lime Sparkling Water),(Banana),0.013981,0.203175,1.573102


After purchasing lime sparking water, a customer should be recommended items such as sparkling lemon water, sparkling grapefruit water, half & half, avocados, strawberries, and bananas.  

We see that sparkling lemon & grapefruit water has the highest confidence and lift values, making them most likely to be purchased if recommended. 

In [137]:
# get recommendations for organic avocados
print('Recommendations for Organic Avocados:')
get_ar_recommendation(47766, rules).sort_values('lift', ascending=False)

Recommendations for Organic Avocados:


Unnamed: 0,antecedants,consequents,support,confidence,lift
326,(Organic Avocado),(Limes),0.052417,0.151566,3.2836
307,(Organic Avocado),(Large Lemon),0.052417,0.136325,2.956248
381,(Organic Avocado),(Organic Whole Milk),0.052417,0.104149,2.56457
147,(Organic Avocado),(Banana),0.052417,0.287892,2.229033
365,(Organic Avocado),(Organic Baby Spinach),0.052417,0.151566,2.183468
380,(Organic Avocado),(Organic Strawberries),0.052417,0.128704,1.550717
41,(Organic Avocado),(Bag of Organic Bananas),0.052417,0.146486,1.240314


After purchasing avocados, a customer should be recommended limes, lemons, whole mlike, bananas, baby spinach, and strawberries. 

Limes have the highest lift value, but a customer is more likely to purchase bananas, as indicated by a confidence level of 29%.  

### Association Rules Conclusion
We've seen now how association rules can help business understand which products customers typically purchase together, how likely those products are purchased together, as well as the purchasing influence.

With this information we were able to create product recommendations.  The recommendations can be tuned to represent a particular support, confidence or lift threshold;  However we still have the problem that the recommendations are not personalized for each individual customer. 

In order to get this personalization, we can look at collaborative filtering. 

## 3. Collaborative Filtering

*The following section on collaborative filtering could not have been done without reading many blogs and papers on the subject.  I've linked several sites below which were extremely helpful in understanding the implementation of implicit collaborative filtering.*

Yifan Hu, Yehuda Koren, Chris Volinsky
* [Collaborative Filtering for Implicit Feedback Datasets ](http://yifanhu.net/PUB/cf.pdf)

Ben Frederickson
* [Distance Metrics for Fun and Profit](http://www.benfrederickson.com/distance-metrics/)
* [Faster Implicit Matrix Factorization](http://www.benfrederickson.com/fast-implicit-matrix-factorization/)
* [Finding Similar Music using Matrix Factorization](http://www.benfrederickson.com/matrix-factorization/)
* [Ben's implicit python implementation](https://github.com/benfred/implicit)

Victor Kohler
* [ALS implicit Collaborative Filtering](https://medium.com/@victorkohler/als-implicit-collaborative-filtering-5ed653ba39fe)


Collaborative Filtering is used to build personalized recommendation engines, where the recommendations are generated using the preferences of other users that are like you.  Data used for Collaborative Filtering can be classified as explicit or implicit:

**Explicit data** describes situations where we have item ratings
    * I rate the movie 'Guardians of the Galaxy' 5 out of 5
    * I rate this hard drive purchased on Amazon 1 out of 5 
    
**Implicit data** describes situations where no ratings are available, and rather, we gather data based on user behaviour.  There are loads of ways to gather this information:
     * Time spent watching a tv show
     * Time spent reading an article
     * Number of times a tv episode is watched
     * Number of times a song is played

Instacart's data provides us with a snapshot of purchases made - this is implicit data.  We don't know how well a user likes a product, but we know how many times that user has purchased a product.  Based on these numbers, we can draw conclusions to say **what products should I recommend to a customer based on their prior purchase history?**.  We will also be able to answer **What other products are similar to this product?**


### Data Preparation
Our raw data allows us to see each product purchased for any given order.  We need to summarize this data so that we can see how many times a customer purchased a product.  We will first perform this data preparation before moving onto building our recommender.   

Note: The Instacart data is massive - representing over 300,000 customers.  To save on processing time, I've restricted the number of customers to 150,000.

In [104]:
# create mask for users - we will only process 150K
user_subset = orders.user_id.unique()[:150000]

# create orders subset
orders_subset = orders[orders.user_id.isin(user_subset)]

# combine our datasets in order to find product details
order_details_temp = pd.merge(orders_subset, order_details_prior, on='order_id')
order_details = pd.merge(order_details_temp, products, on='product_id')

The next step is to summarize our product purchases by user_id.

In [105]:
# Create order details summary - we are only interested in user_id, product_id, and reordered
order_details_summary = order_details[['user_id', 'product_id', 'product_name', 'reordered']]

# rename the columns - we want a quantity column
order_details_summary.columns = ['user_id', 'product_id', 'product_name', 'quantity']

# group the dataframe by user_id & product_id, calculate the purchase count and then reset the index
order_details_summary = order_details_summary.groupby(['user_id', 'product_id', 'product_name']).count().reset_index()

# we now have a summary of each product purchased by a customer
order_details_summary.head()

Unnamed: 0,user_id,product_id,product_name,quantity
0,1,196,Soda,10
1,1,10258,Pistachios,9
2,1,10326,Organic Fuji Apples,1
3,1,12427,Original Beef Jerky,10
4,1,13032,Cinnamon Toast Crunch,3


Our data is not prepared - on to the next step.

## Sparse Matrix Creation
We will be using [Ben Frederickson's Implicit library](https://github.com/benfred/implicit) to create our recommendations.  Behind the scenes, the library uses Alternating Least Squares (ALS) & matrix factorization to create such recommendations.  

The input for our ALS algorithm will be a confidence sparse matrix.  A sparse matrix allows us to process massive amounts of data, as it stores only nonzero values.  he average customer will never purchase a large number of unqiue products, leaving large voids in a matrix.  As a result, implicit datasets are notoriously sparse. 

We will start by building a sparse matrix representing purchase frequency - **How many times did user A buy product X**?
* We will have product_ids representing the rows of the matrix and user_ids representing the columns.  The internal cells identify how many times a user purchased a product.  

With a purchase frequency matrix complete, we then calculate a confidence matrix - **How confident are we that user A will buy product X**?
* We assume that a user likes a product because they have purchased it once (or perhaps many times).  A user could purchase an item once and never purchase it again because they hate it.  The confidence matrix ensures products which have been purchased more frequently receive a higher confidence rating.

Let's begin with creating a purchase frequency matrix -

### Purchase Frequency Matrix

The ALS algorithm expects that our user_id & product_ids contain sequential numbering (i.e. 1, 2, 3, 4).  In a real life setting, this is rarely going to occur within a dataset.  Products are often removed from a catalog and users often sign up but never actually create an order.  

To overcome this, we will convert user_id & product_ids into categorical columns and use the categorical codes associated to each value to feed into our ALS system.  The codes themselves are sequential.   

We are also going to convert quantity to a float value - which our ALS algorithm also expects. 

In [106]:
# convert user_id to category
order_details_summary['user_id'] = pd.Categorical(order_details_summary.user_id)

# convert product_id to category
order_details_summary['product_id'] = pd.Categorical(order_details_summary.product_id)

# change quantity to a float
order_details_summary['quantity'] = order_details_summary.quantity.astype(np.float32)

The issue with changing user_id & product_id into categories is that we no longer have a one-to-one mapping between categories and codes.  

The categorical value of '196' maps to the category code of 114.  Since we are feeding codes into the ALS rather than the categories, we need a way to go between categories and codes.  We will use the below helper functions to do this. 

In [107]:
def find_cat_product_id(real_id):
    '''
        Returns the categorical index for the specified product_id
    '''   
    return list(order_details_summary.product_id.cat.categories).index(real_id)


def find_real_product_id(cat_id):
    '''
        Returns the product_id for the specified categorical index
    '''
    return order_details_summary.product_id.cat.categories[cat_id]


def find_cat_user_id(real_id):
    '''
        Returns the categorical index for the specified user_id
    '''
    return list(order_details_summary.user_id.cat.categories).index(real_id)


def find_real_user_id(cat_id):
    '''
        Returns the user_id for the specified categorical index
    '''
    return order_details_summary.user_id.cat.categories[cat_id]

We have our data summarized to a point that we can begin to create our purchase frequency matrix.  If we feed in our entire dataset, how do we know that we are recommending valid products?  We need to think about how we will evaluate our recommender.  

One way to approach evaluation is by masking purchases from users in a training dataset.  We then build a model using our training dataset and predict recommendations for users with masked purchases.  If our recommender is good, we should be able to recommend products that were masked.

To accomplish this evaluation we need to create a training dataset and a model evaluator to score our recommendations. 

#### Training Dataset & Purchase Frequency Matrix Creation

In [116]:
np.random.seed(42)

# find all of our index values
index_values = order_details_summary[order_details_summary.quantity > 5].index.values

# randomly select 10 values to remove
index_values_remove = np.random.choice(index_values, 20, replace=False)

# keep track of the records we dropped
removed_records = order_details_summary[order_details_summary.index.isin(index_values_remove)].copy()

# create training set by removing our selected index values from the dataset
df_train = order_details_summary.drop(index_values_remove, axis=0)

# create test dataset - which is the full dataset
df_test = order_details_summary.copy()

# get users
users = df_train.user_id.cat.codes

# get items
items = df_train.product_id.cat.codes

# create a sparse matrix.  The format is an items-users matrix, meaning items are the row values
sparse_product_user = sparse.coo_matrix((df_train.quantity, (items, users)))

#### Model Evaluation Creation 

In [109]:
def score_recommender(model, cui, masked_users, show_rec=False):
    '''
        Evaluate how accurate a recommender has performed.  It simply looks at how many products were correctly
        predicted. 
        
        Parameters
        ----------
        model: the recommender
        
        cui: the confidence matrix in form of user-item
        
        masked_users: a listing of purchaes that were removed from the dataset
        
        Returns
        -------
        masked_products: the number of products masked
        
        matched_products: the number of products successfully predicted
        
        score: percentage of masked_prodcuts correctly predicted
        
        show_rec: toggle to indicate whether to print recommendations
    '''
    
    masked_products = len(masked_users)
    matched_products = 0
    top2 = 0
    
    for index, row in masked_users.iterrows():
        
        recommended_products = []        
        target_product = row.product_id
        
        # only print recommendations if requested
        if show_rec:

            # print user ID
            print()
            print('------------------------------------------')
            print('Recommendations for User {}'.format(row.user_id))
            print()

            # print product removed
            print('|Product Removed|')
            print('{} {} {}'.format(row.product_id, row.product_name, row.quantity))
            print()
        
            # print recommendations
            print('|Product Recommendations|')

        # loop through each recommendation
        for index_rec, score_rec in model.recommend(find_cat_user_id(row.user_id), cui, N=10):  

            # get real product id & name
            real_prod_id = find_real_product_id(index_rec)
            real_prod_name = products[products.product_id == real_prod_id].product_name.iloc[0]
            
            recommended_products.append(real_prod_id)

            # print if requested
            if show_rec:
                print('{} {} {}'.format(real_prod_id, real_prod_name, score_rec))
                    
        # check to see if we matched a product
        if target_product in recommended_products:
            matched_products+=1
            
        # check to see if we matched a product in the top 2
        if target_product in recommended_products[:2]:
            top2+=1

    # calculate score
    score = matched_products / masked_products
    
    
    return masked_products, matched_products, top2, score

### Confidence Matrix

As previously mentioned, our ratings are implied - meaning we assume that a user likes a product because they have purchased it once (or perhaps many times).  The confidence matrix ensures products which have been purchased more frequently receive a higher confidence rating.

There are multiple ways to calculate this confidence matrix.  I will test two different approaches - 

1. Confidence weighting as described in [Collaborative Filtering for Implicit Feedback Datasets ](http://yifanhu.net/PUB/cf.pdf).  Basically, we multiple the current implied rating by an alpha value.  This ensure products which are purchased more frequently have a larger weighting than those that are purchased less frequently.  

2. Confidence weighting using Ben Frederickson's BM25 weighting algorithm ([as described in this post](http://www.benfrederickson.com/distance-metrics/)).  It's more or less allows us to identify similarities between products.

We will validate models using both approaches and select a winner.



#### Confidence Weighting Using Alpha
We will test various values of alpha to see how they impact recommendations.  We will also tune the regularization and latent factors used by the ALS algorithm in our hunt for a better recommender.  

For each model we create, a score is determined and results stored in a dataframe.  The dataframe is sorted by score and printed to allow us to inspect the results. 

In [119]:
regularization_val = [1, 0.1, 0.01]
sparse_val = [15, 20, 25, 30, 35, 40, 45]
factor_val = [100, 150, 200, 225, 250, 300]

reg_ary=[]
alpha_ary=[]
factors_ary=[]
masked_products_ary=[]
matched_products_ary=[]
top2_ary=[]
score_ary=[]

# loop through alpha values
for s in sparse_val:

    # create confidence matrix
    ciu = (sparse_product_user * s).astype(np.double)
    ciu = ciu.tocsr()

    # create transpose of confidence matrix -used for predictions
    cui = ciu.T.tocsr()

    # loop through latent factor values
    for f in factor_val:
        
        # loop through regularization values
        for r in regularization_val:

            # reset our seed
            np.random.seed(42)

            # Create our ALS model
            model = AlternatingLeastSquares(factors=f, regularization=r)

            # fit the model
            model.fit(ciu)

            # score the model
            masked_products, matched_products, top2, score = score_recommender(model, cui, removed_records)

            # create an entry to store the score
            reg_ary.append(r)
            alpha_ary.append(s)
            factors_ary.append(f)
            masked_products_ary.append(masked_products)
            matched_products_ary.append(matched_products)
            top2_ary.append(top2)
            score_ary.append(score)

alpha_scores = pd.DataFrame({'regularization':reg_ary,
                             'alpha':alpha_ary,
                             'factor':factors_ary,
                             'masked_products':masked_products_ary,
                             'matched_products':matched_products_ary,
                             'top2':top2_ary,
                             'score':score_ary})


In [120]:
alpha_scores.sort_values('score', ascending=False).head(8)

Unnamed: 0,alpha,factor,masked_products,matched_products,regularization,score,top2
34,20,300,20,6,0.1,0.3,0
52,25,300,20,6,0.1,0.3,0
16,15,300,20,6,0.1,0.3,0
15,15,300,20,6,1.0,0.3,1
33,20,300,20,5,1.0,0.25,0
69,30,300,20,5,1.0,0.25,0
51,25,300,20,5,1.0,0.25,0
26,20,200,20,4,0.01,0.2,0


Looking at the results - our best score provides us with a 30% match.  Not terribly great.  Our best result appears to be where:
* alpha = 15
* factors = 300
* regularizaiton = 1

It appears to be the only model that has scored at least 1 value in the top 2 recommendations.  Let's see how our weighting use bm25 performs. 

#### Confidence Weighting Using BM25
When it comes to using bm25_weighting, it was found that parameter values of K1=100 and B=0.5 worked best when creating a confidence matrix.  We are not going to spend time on tuning these parameters.  We will still investigate tuning the regularization and latent factors parameters for our ALS algorithm. 

As we saw above, for each model we create, a score will be determined and results stored in a dataframe.  The dataframe will be sorted by score and printed to allow us to inspect the results.

In [117]:
regularization_val = [1, 0.1, 0.01]
factor_val = [100, 150, 175, 200, 225, 250, 300, 325, 350]

reg_ary=[]
alpha_ary=[]
factors_ary=[]
masked_products_ary=[]
matched_products_ary=[]
top2_ary=[]
score_ary=[]

# represent confidence item-user matrix and covert to csr matrix
ciu = bm25_weight(sparse_product_user, K1=100, B=0.5)
ciu = ciu.tocsr()

# represent confidence user-item matrix and convert to csr matrix.  We need a User-Product matrix for predicting.
cui = ciu.T.tocsr()

# loop through latent factor values
for f in factor_val:

    # loop through regularization values
    for r in regularization_val:

        # reset our seed
        np.random.seed(42)

        # Create our ALS model
        model = AlternatingLeastSquares(factors=f, regularization=r)

        # fit the model
        model.fit(ciu)

        # score the model
        masked_products, matched_products, top2, score = score_recommender(model, cui, removed_records)

        # create an entry to store the score
        reg_ary.append(r)
        factors_ary.append(f)
        masked_products_ary.append(masked_products)
        matched_products_ary.append(matched_products)
        top2_ary.append(top2)
        score_ary.append(score)

        
bm25_scores = pd.DataFrame({'reg':reg_ary,
                            'factor':factors_ary,
                            'masked_products':masked_products_ary,
                            'matched_products':matched_products_ary,
                            'top2':top2_ary,
                            'score':score_ary})

In [138]:
bm25_scores.sort_values('score', ascending=False).head(10)

Unnamed: 0,factor,masked_products,matched_products,reg,score,top2
0,100,20,3,1.0,0.15,2
14,225,20,3,0.01,0.15,3
25,350,20,3,0.1,0.15,3
24,350,20,3,1.0,0.15,3
23,325,20,3,0.01,0.15,3
22,325,20,3,0.1,0.15,3
21,325,20,3,1.0,0.15,3
20,300,20,3,0.01,0.15,3
19,300,20,3,0.1,0.15,2
18,300,20,3,1.0,0.15,2


Our bm25 weighting performed worse than our alpha weighting.  However, when the bm25 weighting predicted the correct result, it predicted the value within the top 2 predictions - that's very notable!

### Final Model
We found that alpha weighting produced best results for our recommender.  We will proceed with building a final model to investigate our recommendations further by printing the actual recommendations for masked products. 

In [143]:
# create confidence matrix
ciu = (sparse_product_user * 15).astype(np.double)
ciu = ciu.tocsr()

# create transpose of confidence matrix -used for predictions
cui = ciu.T.tocsr()

In [144]:
# set the random seed
np.random.seed(42)

# Create our ALS model
als_model = AlternatingLeastSquares(factors=300, regularization=1)
als_model.approximate_recommend = False

# fit the model
als_model.fit(ciu)

In [146]:
# score the model
masked_products, matched_products, top2, score = score_recommender(als_model, cui, removed_records, show_rec=True)

print()
print('--------------------------------------------------------')
print('masked products: {} matched products: {} top2: {} score: {}'.format(masked_products, matched_products, top2, score))


------------------------------------------
Recommendations for User 11934

|Product Removed|
13176 Bag of Organic Bananas 9.0

|Product Recommendations|
13176 Bag of Organic Bananas 1.1618086199926427
12078 Shredded Mexican Blend Cheese 1.1496609292086672
27796 Real Mayonnaise 1.136446417603566
21903 Organic Baby Spinach 1.1298528963378787
30720 Sugar Snap Peas 1.1245117424279698
11759 Organic Simply Naked Pita Chips 1.085237612655447
23423 Original Hawaiian Sweet Rolls 1.0637301036659434
130 Vanilla Milk Chocolate Almond Ice Cream Bars Multi-Pack 1.06191272613793
20955 Cereal 1.0469151692372556
40939 Drinking Water 1.0338306865528504

------------------------------------------
Recommendations for User 14751

|Product Removed|
28985 Michigan Organic Kale 28.0

|Product Recommendations|
35887 Organic Mixed Vegetables 1.4989662055072495
22395 Tomato Sauce 1.374217882452223
32717 Baked Beans 1.3700633315900874
30489 Original Hummus 1.3522883177403788
29898 Organic Spaghetti Pasta 1.30739

26209 Limes 0.943624104840588
40377 Unrefined Virgin Coconut Oil 0.9408359513723606
34358 Garlic 0.9219440728708268
5876 Organic Lemon 0.9217793139754562
5134 Organic Thompson Seedless Raisins 0.9125711739397938

--------------------------------------------------------
masked products: 20 matched products: 6 top2: 1 score: 0.3


Building a recommendation model using collaborative filtering has allowed us to explore personalized recommendations.  We have thousands of products to choose from, yet we were able to create a recommender that predicts the correct product 30% of the time.  Pretty cool.  

Our final recommender functions better as it will recommend products a customer has never seen, but is more likely to purchase over a recommendation that is based purely on popularity.  

### Recommending Similar Items
We've seen how our model has performed recommending products for specific users.  We can also use our model to produce recommendations for similar items.  

For example, if I purchase if I purchase coca cola soda, I would expect similar products to be also be carbonated beverages.  Think of it as **If you liked this product, we think you will also like this similar product**.  

We will use our association rules to investigate the types of products that are purchased together and validate if they are produced by our recommender.  We will use our same 3 products used in section 2 of this notebook:

| product_id | name |
|---|---|
| 5077 | 100% Whole Wheat Bread |
| 35221 | Lime Sparkling Water |
| 47766 | Organic Avocado |

First, we will create a function that prints out our recommendations from our model.

In [163]:
def cf_similar_items(item_id, model):
    '''
        Returns the top 10 similar products 
        
        Parameters
        ----------
        item_id: The item id to use for finding similar products
        
        model: The implicit CF recommender
    '''
    
    # loop through product recommendations - the first product is the same id, so we will skip
    for p_id, score in model.similar_items(find_cat_product_id(item_id), N=11)[1:]:

        # find product id & description
        real_p_id = find_real_product_id(p_id)
        real_p_desc = products[products.product_id == real_p_id].product_name.iloc[0]
        
        # print results
        print('{} {} {}'.format(real_p_id, real_p_desc, score))    

#### 100% Whole Wheat Bread
Let's have a look at product recommendations for 100% whole wheat bread.  If we first examine the association rules, we can get an idea of products that are frequently purchased with bread.

In [169]:
# look up association rules
get_ar_recommendation(5077, rules).sort_values('lift', ascending=False)

Unnamed: 0,antecedants,consequents,support,confidence,lift
4,(100% Whole Wheat Bread),(Limes),0.015667,0.135977,2.945871
5,(100% Whole Wheat Bread),(Organic Hass Avocado),0.015667,0.150142,2.320193
6,(100% Whole Wheat Bread),(Organic Strawberries),0.015667,0.169972,2.047931
3,(100% Whole Wheat Bread),(Banana),0.015667,0.263456,2.039838
2,(100% Whole Wheat Bread),(Bag of Organic Bananas),0.015667,0.161473,1.367212


In [170]:
# find recommendations for 100% Whole Wheat Bread
print('Recommendations for 100% Whole Wheat Bread:')
print()
cf_similar_items(5077, als_model)

Recommendations for 100% Whole Wheat Bread:

3896 Organic Honey Sweet Whole Wheat Bread 0.3014301929240482
12756 Multigrain Oat Bread 0.2996927791215542
17706 Organic Whole Grain Wheat English Muffins 0.26599229609627456
46993 Organic Bakery Hamburger Buns Wheat - 8 CT 0.24667924187835344
28699 Organic 7 Grain with Flax Bread 0.22651880099481583
10804 Organic Cinnamon Raisin Bread 0.218794337744165
3957 100% Raw Coconut Water 0.198955571299454
36591 100% Whole Wheat Medium Soft Taco Flour Tortillas 0.1926325700808251
36717 Double Fiber Bread 0.19231412423626398
29299 Dutch Country Smooth Texture 100% Whole Wheat Bread 0.1870830532297801


As we can see, we receive items that are most similar to whole wheat bread.  The majority of recommendations are bakery products, rather than items that are most frequently purchased with bread as indicated in our association rules. 

#### Lime Sparking Water
Let's have a look at product recommendations for lime sparkling water.  If we first examine the association rules, we can get an idea of products that are frequently purchased with sparkling water.

In [167]:
# we only want to look at antecedants with a single value - so we print out the first 8
get_ar_recommendation(35221, rules).sort_values('lift', ascending=False)

Unnamed: 0,antecedants,consequents,support,confidence,lift
751,(Lime Sparkling Water),"(Sparkling Lemon Water, Sparkling Water Grapef...",0.013981,0.165079,51.658377
323,(Lime Sparkling Water),(Sparkling Lemon Water),0.013981,0.285714,31.868458
325,(Lime Sparkling Water),(Sparkling Water Grapefruit),0.013981,0.377778,18.265474
280,(Lime Sparkling Water),(Half & Half),0.013981,0.184127,8.198745
320,(Lime Sparkling Water),(Organic Hass Avocado),0.013981,0.146032,2.256681
321,(Lime Sparkling Water),(Organic Strawberries),0.013981,0.180952,2.180234
142,(Lime Sparkling Water),(Banana),0.013981,0.203175,1.573102


In [173]:
# find recommendations for Lime Sparking Water
print('Recommendations for Lime Sparking Water:')
print()
cf_similar_items(35221, als_model)

Recommendations for Lime Sparking Water:

21709 Sparkling Lemon Water 0.6707477206205216
44632 Sparkling Water Grapefruit 0.5851092385673125
49520 Orange Sparkling Water 0.4716499851544316
20119 Sparkling Water Berry 0.398525697448478
26620 Peach Pear Flavored Sparkling Water 0.38771802703762615
49191 Cran Raspberry Sparkling Water 0.340638462164059
30353 Curate Cherry Lime Sparkling Water 0.3366729272135947
39947 Blackberry Cucumber Sparkling Water 0.30303332142889755
12576 Kiwi Sandia Sparkling Water 0.29619769591919787
14947 Pure Sparkling Water 0.2919768930758159


We find that all of our similar products tend to be flavoured carbonated water - which one would expect from a proper recommender!

#### Organic Avocados
Let's have a look at product recommendations for organic avocados.  If we first examine the association rules, we can get an idea of products that are frequently purchased with avocados.

In [174]:
get_ar_recommendation(47766, rules).sort_values('lift', ascending=False)

Unnamed: 0,antecedants,consequents,support,confidence,lift
326,(Organic Avocado),(Limes),0.052417,0.151566,3.2836
307,(Organic Avocado),(Large Lemon),0.052417,0.136325,2.956248
381,(Organic Avocado),(Organic Whole Milk),0.052417,0.104149,2.56457
147,(Organic Avocado),(Banana),0.052417,0.287892,2.229033
365,(Organic Avocado),(Organic Baby Spinach),0.052417,0.151566,2.183468
380,(Organic Avocado),(Organic Strawberries),0.052417,0.128704,1.550717
41,(Organic Avocado),(Bag of Organic Bananas),0.052417,0.146486,1.240314


In [175]:
# find recommendations for organic avocados
print('Recommendations for Organic Avocados:')
print()
cf_similar_items(47766, als_model)

Recommendations for Organic Avocados:

47626 Large Lemon 0.5940473660665583
47209 Organic Hass Avocado 0.5699193845702325
26209 Limes 0.5543068053152882
21903 Organic Baby Spinach 0.5056744170037557
23341 Large Grade AA Eggs 0.49424532161720414
24852 Banana 0.48760331854138234
24964 Organic Garlic 0.4388062583849976
22935 Organic Yellow Onion 0.431076150172965
10151 Bolthouse Baby Carrots 0.38456225238825026
29921 Cheese Finely Shredded Mexican Four Cheese Blend 0.37914190139776754


We successfully recommend alternative avocados, which I would have expected to be the number one recommendation.  Nevertheless, we see other ogranic and citrus produce in our recommendations. 

## 4. Conclusion

In this document we explored 3 different approaches for building product recommendations.

Global recommendations give us insight into the most popular product pairings.  They are not useful for providing similar product recommendations or personalized product recommendations.  

Association rules provide us with insight into how often product itemsets are purchased together, as well as the influence that an itemset has on purchasing related itemsets.  Association rules are extremely expensive to create and still suffer from being able to produce personalized product recommendations.   

Finally, collaborative filtering offers us with the best of both worlds - it allows us to produce similar product recommendations and personalized product recommendations.  It also allows us to utilize a large dataset to create a model in a fairly quick amount of time.  It is the best option for recommendations that we've explored in this notebook.   

Why are recommenders such as the one we created using collaborative filtering important to companies like Instacart?  It allows users to discover more products that they could truly be interested in, but would never find on their own.  This increases sales and improves the customer experience.  As customers begin to trust the recommendations, they are more likely to discover additional items and increase basket size.  On 