## Loblaw Digital 

## Data Challenge : Product Recommendation

### Data 

1. Transactions.txt (each line is an ecommerce grocery order in json format. Together, they constitute 1.4 million transactions randomly sampled across a time period.)

2. products.txt (tab-separated product ID, MCH code (see below), and product name)

3. mch_categories.tsv (Merchandise Hierarchy Category (MCH) Codes, which, as the name implies, constitute a hierarchy of product categories.)

### Acknowledgements

1. LightFM library https://github.com/lyst/lightfm

2. Recommendation System in python with LightFm https://towardsdatascience.com/recommendation-system-in-python-lightfm-61c85010ce17

3. If you can't measure it, you can't improve it https://towardsdatascience.com/if-you-cant-measure-it-you-can-t-improve-it-5c059014faad

### Contents :

#### 1. Importing libraries and loading the data
#### 2. Building the baseline recommendation model based on co-purchase frequency
#### 3. A function that takes a product ID and returns, using your model, the ID and name of the top 5 recommended products
#### 4. Proposed metric for this baseline model
#### 5. Another model using LightFM based on frequency of purchase of items

In [1]:
# Importing libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from scipy.sparse import coo_matrix # for constructing sparse matrix

# lightfm 
from lightfm import LightFM # model
from lightfm.evaluation import auc_score

# timing
import time



In [2]:
# Loading products data

product_header_list = ["item", "MCH_Code", "Product_Name"]
products = pd.read_csv('products.txt', sep= '\\t', engine = 'python', names = product_header_list)
products.head()

Unnamed: 0,item,MCH_Code,Product_Name
0,20000002_EA,M10210701,Tuna Chunks in Broth
1,20000005_EA,M02270201,Fresh-Pressed Sweet Apple Cider
2,20000053_EA,M10210901,French Dijon Mustard
3,20000056001_KG,M02270304,Anaheim Peppers
4,20000068_KG,M05350101,Swiss Cheese


In [3]:
# Loading category data

mch_categories = pd.read_csv('mch_categories.tsv', sep= '\\t', engine = 'python')
mch_categories.head()

Unnamed: 0,code,name
0,M02,Produce
1,M0227,Produce
2,M022701,Fruit
3,M02270101,Apples
4,M02270102,Bananas


Since the transactions.txt has about 1.4 million transactions data, working with such a large dataframe was not possible in pandas. 
Hence I choose to work with a sample of this transaction data, just to show what my approach would be to solve a problem of this type.

In [4]:
%%time
# loading transaction data

import json
all_transactions = []
with open('transactions.txt') as f:
    limit = 8000
    for line in f.readlines():
        if limit == 0:
            break
        else:
            limit -=1
        json_data = json.loads(line)
        json_items = json_data['itemList']
        for each in json_items:
            each['customer'] = json_data['customer']
            each['date'] = json_data['date']
            each['store'] = json_data['store']
        all_transactions.extend(json_items)

df = pd.DataFrame(all_transactions)
df.head(10)

Wall time: 10.6 s


Unnamed: 0,item,price,quantity,customer,date,store
0,20126907_EA,1.88,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
1,20185742_EA,0.99,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
2,20138681_EA,1.79,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
3,20049778001_EA,2.47,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
4,20419715007_EA,3.33,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
5,20321434_EA,2.47,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
6,20068076_KG,28.24,10.086,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
7,20022893002_EA,1.77,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
8,20299328003_EA,1.25,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
9,20132638_KG,3.33,0.28,100,2017.01.10 12:59:09,a666587afda6e89aec274a3657558a27


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68767 entries, 0 to 68766
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   item      68767 non-null  object 
 1   price     68767 non-null  float64
 2   quantity  68767 non-null  float64
 3   customer  68767 non-null  int64  
 4   date      68767 non-null  object 
 5   store     68767 non-null  object 
dtypes: float64(2), int64(1), object(3)
memory usage: 3.1+ MB


In [6]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70771 entries, 0 to 70770
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   item          70771 non-null  object
 1   MCH_Code      70771 non-null  object
 2   Product_Name  70771 non-null  object
dtypes: object(3)
memory usage: 1.6+ MB


In [7]:
mch_categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 916 entries, 0 to 915
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   code    916 non-null    object
 1   name    877 non-null    object
dtypes: object(2)
memory usage: 14.4+ KB


### Please implement a baseline recommendation model based purely on co-purchase frequency

To build this baseline model, we would first need to transform our dataframe to an appropriate format. Given an input item, this model will output items which were bought most frequently with it. Essentially, we can think of it as some sort of item-item filtering. In any case, we will first create the item-transaction matrix. In this matrix, the rows indicate the items and the columns indicate the transactions. A value in the matrix at item i and transaction j is 1, if that item was bought in the transaction, else it is 0.

Such a matrix (dataframe) can be created using pandas pivot_table. 

Also, note, here, we assume the timestamp to be a unique indicator of a transaction.
For eg. in df.head(10) above, index 0 to 8 occured in the same transaction, under the same time stamp. 

In [9]:
# item-transaction matrix

df_trans_item = df[['item', 'date', 'customer']]
trans_item = df_trans_item.pivot_table('customer', index = 'item', columns = 'date', aggfunc='count')

trans_item.fillna(0,inplace=True)

trans_item.columns.name = None

trans_item.head()

Unnamed: 0_level_0,2017.01.01 11:06:25,2017.01.01 16:30:02,2017.01.01 16:43:08,2017.01.01 20:11:38,2017.01.02 07:33:44,2017.01.02 09:28:27,2017.01.02 09:47:38,2017.01.02 11:04:49,2017.01.02 11:12:37,2017.01.02 11:33:22,...,2017.06.30 18:15:34,2017.06.30 18:35:28,2017.06.30 19:35:37,2017.06.30 19:46:45,2017.06.30 19:50:44,2017.06.30 19:53:09,2017.06.30 20:18:30,2017.06.30 21:32:44,2017.06.30 22:17:14,2017.06.30 22:55:59
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20000005_EA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20000093_EA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20000172002_EA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20000177_EA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20000207001_EA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


From this we will create a new dataframe to store the item-item co-purchase frequency. 

Linear Algebra Alert! This can be done by matrix multiplication of trans_item and transpose of trans_item.

In [10]:
#transpose dataframe
trans_item_transpose =  trans_item.T

#product of two dataframes
df_co_purchase_freq = trans_item.dot(trans_item_transpose)

df_co_purchase_freq.head()

item,20000005_EA,20000093_EA,20000172002_EA,20000177_EA,20000207001_EA,20000207002_EA,20000330_EA,20000356_EA,20000368_EA,20000433_EA,...,21033570_KG,21033742_EA,21033751_EA,21035875_KG,21035960001_EA,21036130_EA,21036172001_KG,21037948001_EA,21039072_EA,21040856_EA
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20000005_EA,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
20000093_EA,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20000172002_EA,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20000177_EA,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20000207001_EA,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Given a product, consider the column representing that product. The top 5 recommended products can now be obtained by looking at the index of 6 maximum values in that column (6 because, because the max will the product itself, so we ignore that)

The following function takes a product ID and returns, using your model, the ID and name of the top 5 recommended
products.

In [11]:
#We want a function that returns the product and the product name as well

def recommend_prod(product, k):
    #create empty dataframe
    new_df = pd.DataFrame(columns = ['top_product_id', 'top_product_name'])
    
    #product ids of products with highest co-purchase frequency values
    top_prod = df_co_purchase_freq.sort_values([product], ascending = False).head(k)[product]
    top_prod_id = top_prod.index.tolist()
    
    #get the names of the products from the products dataframe
    top_prod_name = []
    for p_id in top_prod_id:
        name = products[products['item'] == p_id]['Product_Name'].values[0]
        top_prod_name.append(name)
    
    #fill in the values with the product id and product names and return them 
    new_df['top_product_id'] = top_prod_id
    new_df['top_product_name'] = top_prod_name
    return new_df[new_df['top_product_id'] != product]

In [12]:
display(recommend_prod('20000005_EA', 6))

Unnamed: 0,top_product_id,top_product_name
1,20175355001_KG,"Bananas, Bunch"
2,20640512001_EA,"Cracker Chips, BBQ"
3,20618577_EA,Social Tea Biscuits
4,20870907002_EA,Roasted Garlic & Herb Seasoning
5,21025435_EA,Gourmet Multigrain Bun


### What are the item ID and name of the top 5 co-purchased items for item with ID 20592676_EA ?

In [13]:
print('For Product ID 20592676_EA :', products[products['item'] == '20592676_EA']['Product_Name'].values[0],
      ' the top 5 recommended items are ')
recommend_prod('20592676_EA', 6)

For Product ID 20592676_EA : Celebration Cupcakes, Chocolate  the top 5 recommended items are 


Unnamed: 0,top_product_id,top_product_name
1,20312788001_EA,"Sour Cream Dip, French Onion"
2,20545632_EA,Hungry Man Beer Battered Chicken
3,20798572_EA,Party Mix Puffs Barnyard Bonanza Cat Treats
4,20313741003_C06,Diet Coke
5,20216046003_EA,Party Mix Original Crunch Cat Treats


This makes sense since celebration cupcakes seems like a party item and the items being recommended are diet coke, sour cream dips, party mix puffs, all of which seem to be party must haves as well. 

### What are the item ID and name of the top 5 co-purchased items for item with ID 20801754003_C15?

In [14]:
%%time
print('For Product ID 20801754003_C15 :', products[products['item'] == '20801754003_C15']['Product_Name'].values[0],
      ' the top 5 recommended items are ')
recommend_prod('20801754003_C15', 6)

For Product ID 20801754003_C15 : 7 Up  the top 5 recommended items are 
Wall time: 8.49 s


Unnamed: 0,top_product_id,top_product_name
1,20628488001_EA,"Salad Dressing, Caesar"
2,20308043001_EA,"Jam, Pure Strawberry"
3,20143381001_KG,Roma Tomatoes
4,20864961001_EA,Mandarin Oranges
5,20297818004_EA,"Margarine, Salt Free"


That does not look very convincing, since our model seems to think that 7 Up is similar to fruits. This seems more like items one would have on an everyday basis in their fridge. As a customer, I would have wanted recommendations of more soft drinks, when the input is 7 Up. But then of course this is just a baseline model and that too has not been trained on all transactions, but just 8000. So this is not too bad. 

We will soon go ahead and build a better recommendation model.

### How would you deploy and serve the model?

The model we just built is a baby model. To work with the total transaction data, we would need to use PySpark. I would choose to build a simple app with Streamlit to deploy the model. If given more time, I would implement this model in the local virtual environment using a Flask python script and some html templates.

To serve the model, I would use Amazon Web Services or Google Cloud.

### Propose metrics to measure performance of the model. Justify your choices using example applications of this model

There are various methods to evaluate a recommender system.

One of those methods would be to run an A/B test with this model.

Firstly, this A/B test would involve presenting recommendations from this baseline model to (say) half of the users (GROUP A) and recommendations from the current model on the website to the other half of the users (GROUP B).

Then, run the test for(say) 50000 transactions in each group. 

Then measure the MAP (Mean average precision) in each group and run a hypothesis test to see if the change in the mean average precision is significant.

If the change is significant and MAP is higher in Group A, that would imply that our model performed well.

#### Mean average precision

recommender system precision  = $ \dfrac{(number \mbox{ } of\mbox{ }relevant\mbox{ }recommendations)}{(total\mbox{ }number\mbox{ } of\mbox{  }items\mbox{ }recommendations)}$

Precision at cutoff k, P(k), is simply the precision calculated by considering only the subset of your recommendations from rank 1 through k.

If we are asked to recommend N items, the number of relevant items in the full space of items is m, then:

Average precision $APN = (1/m) \sum_{k=1}^n P(k).rel(k)$

where rel(k) is just an indicator that says whether that kth item was relevant (rel(k)=1) or not (rel(k)=0). 

Averaging this APN over all transactions would give the mean average precision.

Depepndong on the time there at our disposal, we might want to run the A/B test for longer or shorter time frame. This will depend on various business factors.

#### AUC Score

Another metric to evaluate the model would be using the AUC Score of the model. 

AUC computes the area under the ROC (Receiver Operating Characteristic) curve for classification problems, and a larger the number better the recommender. 

The ROC curve plots the true positive rate against the false positive rate. We want a classifier that correctly identifies as many positive instances as are available, with a very low percentage of negative instances incorrectly classified as positive. The ROC curve of a perfect classifier will go straight up the y-axis (true positive rate) and then along the x-axis (false positive rate). A classifier with no power will sit on the x=y diagonal. 

### Please build a model that is better than the pure co-purchase frequency model you just built.

We plan on building a recommendation system based on popularity of the items which we can get based on the number of items bought.

We will be using pure collaborative filtering method and will be using the LightFM library.

While reading about Recommender Systems, one thing missing in many articles was the lack of a clear metric to evaluate the performance of the model. I chose LightFM because it provides clear metrics like the AUC score and Precision at cutoff k that can help evaluate the performance of the trained model. this can be very useful while working towards building better models and achieving higher accuracy. We are going to use AUC score in this case since it measures the quality of the overall ranking and can be interpreted as the probability that a randomly chosen positive item is ranked higher than a randomly chosen negative item.


### LightFM library

“LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback, including efficient implementation of BPR and WARP ranking losses. It’s easy to use, fast (via multithreaded model estimation), and produces high quality results.”- LightFm Documentation. (https://github.com/lyst/lightfm)


In using LightFM library we need one sparse matrix named as user-item interaction matrix. 

The user-item interaction matrix defines the interaction between the user (customer) to the item (product), this can be shown as movies ratings voted by customers. However, in the grocery products case, we can’t take an explicit rating from the historical data. In this case, implicitly, I take into account the “number of items bought” as the rating. If a customer A bought product B 10 times, then we can say customer A rated product B with rating 10. You can also take into account binary ratings where 1 refers to customer A had bought or 0 as had never bought product B. The user-item interaction matrix represents the collaborative filtering contribution to the model.

The first step is constructing a user-item interaction matrix. We need to take into account that LightFM library can only read a sparse coo matrix, can be constructed using coo_matrix from scipy.sparse , in which we need to convert the item column into integer index . Therefore, I build the user-item interaction matrix with converted user_id into index representing the row of the matrix and into indexes as the column.

In [15]:
# Some helper functions

def get_user_list(df, user_column):
    """
    
    creating a list of user from dataframe df, user_column is a column 
    consisting of users in the dataframe df
    
    """
    
    return np.sort(df[user_column].unique())

def get_item_list(df, item_name_column):
    
    """
    
    creating a list of items from dataframe df, item_column is a column 
    consisting of items in the dataframe df
    
    return to item_id_list and item_id2name_mapping
    
    """
    
    item_list = df[item_name_column].unique()
    
    
    return item_list

# creating user_id, item_id

def id_mappings(user_list, item_list):
    """
    
    Create id mappings to convert user_id, item_id, and feature_id
    
    """
    user_to_index_mapping = {}
    index_to_user_mapping = {}
    for user_index, user_id in enumerate(user_list):
        user_to_index_mapping[user_id] = user_index
        index_to_user_mapping[user_index] = user_id
        
    item_to_index_mapping = {}
    index_to_item_mapping = {}
    for item_index, item_id in enumerate(item_list):
        item_to_index_mapping[item_id] = item_index
        index_to_item_mapping[item_index] = item_id
          
        
    return user_to_index_mapping, index_to_user_mapping, \
           item_to_index_mapping, index_to_item_mapping

To collect test data we gather more 4000 transactions from transactions.txt

In [16]:
import json
all_transactions = []
with open('transactions.txt') as f:
    limit = 12000
    for line in f.readlines():
        if limit == 0:
            break
        else:
            limit -=1
        json_data = json.loads(line)
        json_items = json_data['itemList']
        for each in json_items:
            each['customer'] = json_data['customer']
            each['date'] = json_data['date']
            each['store'] = json_data['store']
        all_transactions.extend(json_items)

df_1 = pd.DataFrame(all_transactions)
df_1.head(10)

Unnamed: 0,item,price,quantity,customer,date,store
0,20126907_EA,1.88,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
1,20185742_EA,0.99,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
2,20138681_EA,1.79,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
3,20049778001_EA,2.47,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
4,20419715007_EA,3.33,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
5,20321434_EA,2.47,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
6,20068076_KG,28.24,10.086,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
7,20022893002_EA,1.77,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
8,20299328003_EA,1.25,1.0,10,2017.04.06 12:09:32,825f9cd5f0390bc77c1fed3c94885c87
9,20132638_KG,3.33,0.28,100,2017.01.10 12:59:09,a666587afda6e89aec274a3657558a27


In [17]:
df.shape

(68767, 6)

In [18]:
df_1.shape

(101252, 6)

In [32]:
# test data
order_products_test_df = df_1.tail(32485)

In [33]:
order_products_test_df = order_products_test_df.merge(products, on = 'item')

In [34]:
df = df.merge(products, on = 'item')

In [35]:
df_1 = df_1.merge(products, on = 'item')

In [37]:
# create the user, item, feature lists
users = get_user_list(df_1, 'customer')
items = get_item_list(df_1, 'Product_Name')

In [38]:
users

array([    10,     13,     15, ..., 162747, 162772, 162780], dtype=int64)

In [39]:
items

array(['Sweetened Condensed Milk', 'Yellow Split Peas',
       'Whipping Cream, 32%', ..., "Jack's Special Salsa, Medium",
       'York Peppermint Patty', 'Gummi Squirmies'], dtype=object)

LightFM library can’t read unindexed objects, therefore we need to create mappings for users and items into their corresponding indexes

In [40]:
# generate mapping, LightFM library can't read other than (integer) index

user_to_index_mapping, index_to_user_mapping, \
           item_to_index_mapping, index_to_item_mapping = id_mappings(users, items)

Before generating interaction matrices, I prepared the train and test data

In [41]:
# creating a dataframe consists of TWO columns user_id, and product_name (product bought by the user) for the train data
user_to_product_train_df = df[["customer", "date", 'item']].merge(products[["item", "Product_Name"]])\
[["customer", "Product_Name"]].copy()
    

# giving rating as the number of product purchase count
user_to_product_train_df["product_count"] = 1
user_to_product_rating_train = user_to_product_train_df.groupby(["customer", "Product_Name"], as_index = False)["product_count"].sum()
    

# creating a dataframe consists of TWO columns user_id, and product_name (product bought by the user) for the test data
user_to_product_test_df = order_products_test_df[["customer", "date", 'item']].merge(products[["item", "Product_Name"]])\
[["customer", "Product_Name"]].copy()
    

# giving rating as the number of product purchase count (including the previous purchase in the training data)
user_to_product_test_df["product_count"] = 1
user_to_product_rating_test = user_to_product_test_df.groupby(["customer", "Product_Name"], as_index = False)["product_count"].sum()
    

# merging with the previous training user_to_product_rating_training
    
user_to_product_rating_test = user_to_product_rating_test.\
merge(user_to_product_rating_train.rename(columns = {"product_count" : "previous_product_count"}), how = "left").fillna(0)

user_to_product_rating_test["product_count"] = user_to_product_rating_test.apply(lambda x: x["previous_product_count"] + \
                                                                                 x["product_count"], axis = 1)
    
user_to_product_rating_test.drop(columns = ["previous_product_count"], inplace = True)
    
# return user_to_product_rating_train, user_to_product_rating_test



user_to_product_rating_test.head()

Unnamed: 0,customer,Product_Name,product_count
0,35,Steamed Lobster,1.0
1,37,"2-Bar Soap, Advanced",1.0
2,37,"Beauty Bar, Revitalize",1.0
3,37,Lever 2000 Bar Aloe & Cucumber,1.0
4,37,Natural Spring Water,1.0


Generating the interaction matrices

In [42]:
def get_interaction_matrix(df, df_column_as_row, df_column_as_col, df_column_as_value, row_indexing_map, 
                          col_indexing_map):
    
    row = df[df_column_as_row].apply(lambda x: row_indexing_map[x]).values
    col = df[df_column_as_col].apply(lambda x: col_indexing_map[x]).values
    value = df[df_column_as_value].values
    
    return coo_matrix((value, (row, col)), shape = (len(row_indexing_map), len(col_indexing_map)))


In [43]:
# Convert every table into interaction matrices


# generate user_item_interaction_matrix for train data
user_to_product_interaction_train = get_interaction_matrix(user_to_product_rating_train, "customer", 
                                                    "Product_Name", "product_count", user_to_index_mapping, item_to_index_mapping)



# generate user_item_interaction_matrix for test data
user_to_product_interaction_test = get_interaction_matrix(user_to_product_rating_test, "customer", 
                                                    "Product_Name", "product_count", user_to_index_mapping, item_to_index_mapping)

#### Building the model

Here I try to perform cross-validation of pure collaborative filtering method using LightFM library.

WARP” loss function often provides the best performance option in LightFM library, hence that is what I use. By fitting to the training datasetand testing on test dataset , we can try to evaluate the AUC score of the test dataset.

In [44]:
# initialising model with warp loss function
model_without_features = LightFM(loss = "warp")

# fitting into user to product interaction matrix only / pure collaborative filtering factor
start = time.time()


model_without_features.fit(user_to_product_interaction_train,
          user_features=None, 
          item_features=None, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)


end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

# auc metric score (ranging from 0 to 1)

start = time.time()


auc_without_features = auc_score(model = model_without_features, 
                        test_interactions = user_to_product_interaction_test,
                        num_threads = 4, check_intersections = False)

end = time.time()

print("time taken = {0:.{1}f} seconds".format(end - start, 2))
print("average AUC without adding item-feature interaction = {0:.{1}f}".format(auc_without_features.mean(), 2))

time taken = 0.64 seconds
time taken = 13.69 seconds
average AUC without adding item-feature interaction = 0.75


the time taken for training can be around .7 secs and for validating is 15 seconds using my 8 GB RAM laptop. WARP loss function can be slow but the performance is very good. The AUC = 0.75 is amazing, given we only used such a small sample of our transaction data. This is telling us that this dataset is a warm start problem and rich in transaction data, and if we use the whole transaction data, we are off to a great start.

#### Note :
item_features attribute is currently null in this model. One obvious add-on to this model would be to build a hybrid collaborative — content based model by adding products/items and features interactions. Given we have the mch_categories data, one could use the categories in which products fall into as features of the products. We would accordingly have to create some feature-item mapping and product-to-feature interaction matrix which can then be passed to the item_feature parameter in this LightFM model.

It is not always guaranteed that AUC for the hybrid case will improve the model. 

At the end, we do explore this idea and add the feature and see if we van build a better model. As it turns out the pure collaborative filtering give beter results.

#### Items recommendation

In [45]:
# Combine the train and the test dataset into one by combining through function below

def combined_train_test(train, test):
    """
    
    test set is the more recent rating/number_of_order of users.
    train set is the previous rating/number_of_order of users.
    non-zero value in the test set will replace the elements in 
    the train set matrices
    """
    # initialising train dict
    train_dict = {}
    for train_row, train_col, train_data in zip(train.row, train.col, train.data):
        train_dict[(train_row, train_col)] = train_data
        
    # replacing with the test set
    
    for test_row, test_col, test_data in zip(test.row, test.col, test.data):
        train_dict[(test_row, test_col)] = max(test_data, train_dict.get((test_row, test_col), 0))
        
    
    # converting to the row
    row_element = []
    col_element = []
    data_element = []
    for row, col in train_dict:
        row_element.append(row)
        col_element.append(col)
        data_element.append(train_dict[(row, col)])
        
    # converting to np array
    
    row_element = np.array(row_element)
    col_element = np.array(col_element)
    data_element = np.array(data_element)
    
    return coo_matrix((data_element, (row_element, col_element)), shape = (train.shape[0], train.shape[1]))

# create a user to product interaction matrix

user_to_product_interaction = combined_train_test(user_to_product_interaction_train, 
                                                 user_to_product_interaction_test)

In [46]:
# retraining the final model with combined dataset

final_model = LightFM(loss = "warp")

# fitting to combined dataset with pure collaborative filtering result

start = time.time()
#===================

final_model.fit(user_to_product_interaction,
          user_features=None, 
          item_features=None, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

time taken = 0.90 seconds


We create a class object, which will create recommendations, given a user id.

In [56]:
# class object to make the recommendation 

class recommendation_sampling:
    
    def __init__(self, model, items = items, user_to_product_interaction_matrix = user_to_product_interaction, 
                user2index_map = user_to_index_mapping):
        
        self.user_to_product_interaction_matrix = user_to_product_interaction_matrix
        self.model = model
        self.items = items
        self.user2index_map = user2index_map
    
    def recommendation_for_user(self, user):
        
        # getting the userindex
        
        userindex = self.user2index_map.get(user, None)
        
        if userindex == None:
            return None
        
        users = [userindex]
        
        # products already bought
        
        known_positives = self.items[self.user_to_product_interaction_matrix.tocsr()[userindex].indices]
        
        # scores from model prediction
        scores = self.model.predict(user_ids = users, item_ids = np.arange(self.user_to_product_interaction_matrix.shape[1]))
        
        # top items
        
        top_items = self.items[np.argsort(-scores)]
        
        # printing out the result
        print("User %s" % user)
        print("     Known positives:")
        
        for x in known_positives[:3]:
            print("                  %s" % x)
            
            
        print("     Recommended:")
        
        for x in top_items[:3]:
            print("                  %s" % x)

In [59]:
# Sample recommendation using the final model
 
recom = recommendation_sampling(model = final_model)

recom.recommendation_for_user(15)    

User 15
     Known positives:
                  Strawberries
                  Sweet Baby Peppers, Club Pack
                  Bananas, Bunch
     Recommended:
                  Bananas, Bunch
                  Plastic Bags
                  PENNY ROUNDING - DO NOT TOUCH


Not great recommendations. But this is a very simple model. Let us involve the MCH_Codes as features for the LightFM model and see if they produce any better results.

#### Improvement?

In [52]:
# function to get the list of features, MCH_Code in this case
def get_feature_list(df, feature_name_column):
    
    feature_list = df[feature_name_column].unique()
    
    return feature_list


#getting the id mapping, now also including the features
def id_mappings(user_list, item_list, feature_list):
    """
    
    Create id mappings to convert user_id, item_id, and feature_id
    
    """
    user_to_index_mapping = {}
    index_to_user_mapping = {}
    for user_index, user_id in enumerate(user_list):
        user_to_index_mapping[user_id] = user_index
        index_to_user_mapping[user_index] = user_id
        
    item_to_index_mapping = {}
    index_to_item_mapping = {}
    for item_index, item_id in enumerate(item_list):
        item_to_index_mapping[item_id] = item_index
        index_to_item_mapping[item_index] = item_id
        
    feature_to_index_mapping = {}
    index_to_feature_mapping = {}
    for feature_index, feature_id in enumerate(feature_list):
        feature_to_index_mapping[feature_id] = feature_index
        index_to_feature_mapping[feature_index] = feature_id
        
        
    return user_to_index_mapping, index_to_user_mapping, \
           item_to_index_mapping, index_to_item_mapping, \
           feature_to_index_mapping, index_to_feature_mapping

Now we take steps to create a item_feature matrix which will be items along the row, MCH_Code along the columns and the matrix at the ij-th position has value 1 if item i belongs to MCH_category j.

In [54]:
#creating the matrix with item and feature. 


item_feature_df = df[["Product_Name", "MCH_Code"]]
    
# start indexing
item_feature_df["product_name"] = item_feature_df["Product_Name"]
item_feature_df["MCH_Code"] = item_feature_df["MCH_Code"]

    
# allocate aisle and department into one column as "feature"
    
product_feature_df = item_feature_df[["Product_Name", "MCH_Code"]]
product_feature_df["feature_count"] = 1 # adding weight to aisle feature
   
    
# grouping for summing over feature_count
product_feature_df = product_feature_df.groupby(["Product_Name", "MCH_Code"], as_index = False)["feature_count"].sum()
    

#getting features    
features = get_feature_list(df, 'MCH_Code')

# generate mapping, LightFM library can't read other than (integer) index
user_to_index_mapping, index_to_user_mapping, \
           item_to_index_mapping, index_to_item_mapping, \
           feature_to_index_mapping, index_to_feature_mapping = id_mappings(users, items, features)


# generate item_to_feature interaction
product_to_feature_interaction = get_interaction_matrix(product_feature_df, "Product_Name", "MCH_Code",  "feature_count", 
                                                        item_to_index_mapping, feature_to_index_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


#### Model with features

Now we modify our previous model to add the item_product interaction matrix as a parameter to our model. 

In [55]:
# initialising model with warp loss function
model_with_features = LightFM(loss = "warp")

# fitting the model with hybrid collaborative filtering + content based (product + features)
start = time.time()
#===================


model_with_features.fit(user_to_product_interaction_train,
          user_features=None, 
          item_features=product_to_feature_interaction, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

start = time.time()
#===================
auc_with_features = auc_score(model = model_with_features, 
                        test_interactions = user_to_product_interaction_test,
                        train_interactions = user_to_product_interaction_train, 
                        item_features = product_to_feature_interaction,
                        num_threads = 4, check_intersections=False)
#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

print("average AUC without adding item-feature interaction = {0:.{1}f}".format(auc_with_features.mean(), 2))

time taken = 0.48 seconds
time taken = 12.92 seconds
average AUC without adding item-feature interaction = 0.68


As we see, this model with the mch_categories taken into account performs poorly compared to model without any features. There are many ways of interpreting this result.

1. We did not use the complete transactions data. It is possible that given we train our LightFM model on the whole transactions data, adding features might improve the AUC score of the model.

2. In case, the AUC score does not improve even when the whole transaction data is utilised, it would imply that pure collaborative filtering provides better performance with this dataset and with this approach.

Of course, there are various recommendation algorithms one could use. Here we used this item based collaborative filtering technique. One can try a user based collaborative filtering as well.
