In [1]:
ls

Ibotta_Question_1.ipynb             [31mcustomers.csv[m[m*
Ibotta_Question_2.ipynb             item_sample.csv
Test Collaborative Filtering.ipynb  [31mitem_sample.zip[m[m*
Test_PCA.ipynb                      lat_long_data.csv
[31mbrands.csv[m[m*                         [31mretailers.csv[m[m*
[31mcategories.csv[m[m*                     top_related_sprite_customers.csv


In [2]:
import pandas as pd
import scipy.sparse as sparse
import numpy as np
from scipy.sparse.linalg import spsolve

# Ibotta Test

A large number of product redemption offers are available through our mobile app to consumers. However, consumers do not see the majority of these offers. We need to assist consumers discover relevant offers.

Create a machine learning system which  recommends the top 10 product IDs for each user that they have not previously purchased, based on prior purchase behavior for that consumer.

● List any assumptions that you make, and provide clarity on the methodology you employ.
● Submit any details/documentation that will support your analysis.
● Identify additional data you would want before implementing the system into production.
● Provide guidance for implementation of the recommendation system.
● Provide any underlying analysis where possible, and present back your findings and a summary of
what you did and learned.

In [3]:
# Pull in item list and drop nulls
cleaned_retail = pd.read_csv("item_sample.csv").dropna()

In [4]:
# Aggregate quantity bought by each customer
grouped_purchased = cleaned_retail.groupby(['customer_id', 'product_id'])['purchase_timestamp'].count().reset_index()
grouped_purchased.columns = ["customer_id", "product_id", "quantity"]
grouped_purchased.head()

Unnamed: 0,customer_id,product_id,quantity
0,156,8119,2
1,156,8120,1
2,156,8121,1
3,156,150209,2
4,156,174738,1


Instead of representing an explicit rating, the purchase quantity can represent a “confidence” in terms of how strong the interaction was. Items with a larger number of purchases by a customer can carry more weight in our ratings matrix of purchases.

Our last step is to create the sparse ratings matrix of users and items utilizing the code below:

In [5]:
customers = list(np.sort(grouped_purchased.customer_id.unique())) # Get our unique customers
products = list(np.sort(grouped_purchased.product_id.unique())) # Get our unique products that were purchased
quantity = list(np.sort(grouped_purchased.quantity)) # All of our purchases

rows = grouped_purchased.customer_id.astype('category', categories = customers).cat.codes 

# Get the associated row indices
cols = grouped_purchased.product_id.astype('category', categories = products).cat.codes

# Get the associated column indices
purchases_sparse = sparse.csr_matrix((quantity, (rows, cols)), shape=(len(customers), len(products)))



In [6]:
# Check out sparse matrix object
purchases_sparse

<27055x114871 sparse matrix of type '<class 'numpy.int64'>'
	with 2239532 stored elements in Compressed Sparse Row format>

In [7]:
len(purchases_sparse.getnnz(1))

27055

In [8]:
len(purchases_sparse.getnnz(0))

114871

In [9]:
# Check the sparsity of the matrix
matrix_size = purchases_sparse.shape[0]*purchases_sparse.shape[1] # Number of possible interactions in the matrix
num_purchases = len(purchases_sparse.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (num_purchases/matrix_size))
sparsity

99.9279391580165

99.9% of the interaction matrix is sparse. For collaborative filtering to work, the maximum sparsity you could get away with would probably be about 99.5% or so. We are above this, so we need to reduce the sparsity. 

Luckily we can easily remove rows and columns that have very low total counts.  Essentially removing transcation items.

In [10]:
purchases_sparse = purchases_sparse[purchases_sparse.getnnz(1)>30][:,purchases_sparse.getnnz(0)>30]

matrix_size = purchases_sparse.shape[0]*purchases_sparse.shape[1] # Number of possible interactions in the matrix
num_purchases = len(purchases_sparse.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (num_purchases/matrix_size))
sparsity

99.41163625991679

We chose to reduce the sparsity to around 99.3% just to provide a buffer between 99.5.
This new sparsity should work fine. 

# Creating a Training and Validation Set

Typically in Machine Learning applications, we need to test whether the model we just trained is any good on new data it hasn’t yet seen before from the training phase.

With collaborative filtering, that’s not going to work because you need all of the user/item interactions to find the proper matrix factorization. A better method is to hide a certain percentage of the user/item interactions from the model during the training phase chosen at random. Then, check during the test phase how many of the items that were recommended the user actually ended up purchasing in the end.

Ideally, you would ultimately test your recommendations with some kind of A/B test or utilizing data from a time series where all data prior to a certain point in time is used for training while data after a certain period of time is used for testing.

We will split 

In [11]:
len(purchases_sparse.getnnz(1))

15317

In [12]:
len(purchases_sparse.getnnz(0))

16077