## Collaborative Filtering with Neighborhood-Based Method and Matrix Factorization using Turicreate

In this notebook, neighborhood-based models using different similarity metrics are implemented to generate recommendations. <br>
I am using the library [TuriCreate](https://github.com/apple/turicreate) to implement the item-item collaborative filtering.

In [1]:
import pandas as pd
import numpy as np
import os
import turicreate as tc

PROJ_ROOT = os.path.join(os.pardir)

# Table of Contents

* [1. Load Datasets](#1.-Load-Datasets)
* [2. Data Preprocessing](#2.-Data-Preprocessing)
* [3. Evaluation Metric: Recall@K](#3.-Evaluation-Metric:-Recall@K)
* [4. Benchmark Model: Non-personalized Popularity Model](#4.-Benchmark-Model:-Non-personalized-Popularity-Model)
* [5. Neighborhood-Based Models](#5.-Neighborhood-Based-Models)
    * [5.1 Neighborhood-Based Model using Cosine Similarity Metric](#5.1-Neighborhood-Based-Model-using-Cosine-Similarity-Metric)
    * [5.2 Neighborhood-Based Model using Jaccard Similarity Metric](#5.2-Neighborhood-Based-Model-using-Jaccard-Similarity-Metric)

## 1. Load Datasets

In [2]:
orders = pd.read_csv('../data/raw/orders.csv')
order_products = pd.read_csv('../data/raw/order_products__prior.csv')
products = pd.read_csv('../data/raw/products.csv')

In [3]:
# fill NAs
orders.fillna(0, inplace=True)

In [4]:
# merge orders and order_products on order_id
orders_df = order_products[['order_id', 'product_id']].merge(orders[['order_id', 'user_id', 'order_number']])

In [5]:
orders_df.head()

Unnamed: 0,order_id,product_id,user_id,order_number
0,2,33120,202279,3
1,2,28985,202279,3
2,2,9327,202279,3
3,2,45918,202279,3
4,2,30035,202279,3


## 2. Data Preprocessing

In [6]:
def prior_latest(df):
    '''Label each order with prior or latest'''
    max_row = df['order_number'].max()
    labels = np.where(df['order_number'] == max_row,
                     'latest', 
                     'prior')
    return pd.DataFrame(labels, index=df.index)

def split_train_test_set(df):
    '''
    df is merged order-products dataset containing order_id, product_id, user_id and order_number,
    Split df into training and test data where prior orders are training data and
    most recent orders are test data
    '''
    df['set'] = df.groupby('user_id').apply(prior_latest)
    training_set = df[df.set == 'prior'].drop('set', axis=1)
    test_set = df[df.set == 'latest'].drop('set', axis=1)
    
    # sanity check
    assert len(training_set) + len(test_set) == len(df)
    assert training_set.order_id.nunique() + test_set.order_id.nunique() == df.order_id.nunique()
    
    return training_set, test_set

def make_test_data(test_set):
    '''
    convert test_set to a dataframe in the form of only two columns, first column is user_id, 
    second column is a list of products purchased by the user in their most recent order
    '''
    test_data = test_set.groupby('user_id').product_id.apply(list).reset_index().rename(
                columns={'product_id': 'products'})
    return test_data

def get_user_product_quantity_df(training_set):
    '''
    generate a dataframe showing how many times each user has purchased certain products
    according to their prior order history
    '''
    user_product_quantity_df = training_set.drop('order_number', axis=1).groupby(['user_id', 'product_id']).count(
                                ).reset_index().rename(columns={'order_id':'quantity'})
    return user_product_quantity_df

def get_prod_names(product_ids, df=products):
    '''generate product names from a list of product ids'''
    return df[df.product_id.isin(product_ids)][['product_id', 'product_name']]

In [7]:
# split the merged orders_df into training and test set
training_set, test_set = split_train_test_set(orders_df)

# Make test_data in the form we want
test_data = make_test_data(test_set)

# Prepare training_data
user_product_quantity_df = get_user_product_quantity_df(training_set)

# Get training_data ready for Turicreate
training_data = tc.SFrame(user_product_quantity_df)

In [8]:
training_data

user_id,product_id,quantity
1,196,9
1,10258,8
1,10326,1
1,12427,9
1,13032,2
1,13176,2
1,14084,1
1,17122,1
1,25133,7
1,26088,2


In [9]:
test_data.head()

Unnamed: 0,user_id,products
0,1,"[196, 46149, 39657, 38928, 25133, 10258, 35951..."
1,2,"[24852, 16589, 1559, 19156, 18523, 22825, 2741..."
2,3,"[39190, 18599, 23650, 21903, 47766, 24810]"
3,4,"[26576, 25623, 21573]"
4,5,"[27344, 24535, 43693, 40706, 16168, 21413, 139..."


- **training_data** is the input for our model, which gives us each users' purchase history with their purchased products and the total quantity purchased for each product.
- **test_data** is the hold out data for us to evaluate the model which shows a list of products purchased by each user in their most recent order with Instacart. 

We need to transform the output of the model to a form like the test_data. The output should provide a list of recommended products for each user. Then we can compare the list of recommended with the list of actual bought items to evaluate the model. 

In [10]:
def output_transformer(rec):
    '''Transform the output of model to the form of test_data'''
    rec_df = rec.to_dataframe()
    return rec_df.groupby('user_id').product_id.apply(list).reset_index().rename(
                columns={'product_id': 'products'})

## 3. Evaluation Metric: Recall@K

Evaluation metric I choose for the recommender of this case is recall@k. Since for our case, the feedback is implicit, there's not explicit rating scores. It is more appropriate to use classification accuracy metrics. Precision as we know represents the proportion of recommended items that actually bought by the users. It is not that appropriate since we do not know users' reaction to the recommended items. Besides, precision will give us the same result for two users who both bought 2 items from the recommendation list with 10 items, but one of them bought a total of 5 items while the other only bought a total of those 2 items. While the performance of the recommender should not be the same. Recall is more appropriate which represents the proportaion of acutal bought items that are from the recommended list. 

In [11]:
def recall_at_k(rec, act):
    '''
    calculate recall from a list of predicted products and a list of actual purchased products
    '''
    return len(set(rec).intersection(set(act)))/len(set(act)) 
def mean_recall_at_k_rec(rec_k, test_data):
    '''
    Calculate mean recall score for our recommender given the transformed output and the test data
    '''
    score = []
    for i in range(len(test_data)):
        score.append(recall_at_k(rec_k.products[i], test_data.products[i]))
    return np.mean(score)

From the EDA part, we know that reorder ratio is quite high. Since recommender is not only about recommending items that users have already purchased before, but also about recommending items that users may be interested in but are unaware of them. Therefore, I would like to build another evaluation metric, still using recall, but eliminating reordered items from test data to check the proportion of first time ordered products that are from the recommended products. 

In [12]:
def mean_recall_at_k_rec_new(rec_k, test_data):
    '''
    Calculate mean recall score for our recommender given the transformed output and the test data 
    with reordered products eliminated
    '''
    purchased = training_set.groupby('user_id').product_id.apply(list).reset_index().rename(
                columns={'product_id': 'purchased'})
    score = []
    for i in range(len(test_data)):
        # get rid of reordered products
        new = list(set(test_data.products[i]) - set(purchased.purchased[i]))
    # deal with the situation when all products in users's most recent order are reordered
        if not new:
            score.append(0)
        else:
            score.append(recall_at_k(rec_k.products[i], new))
    return np.mean(score)

## 4. Benchmark Model: Non-personalized Popularity Model

In [13]:
def most_popular_k(k, df=training_set):
    '''
    get the most popular k items to be our benchmark recommender
    '''
    return df.product_id.value_counts().head(k).keys().tolist()

def mean_recall_at_k_pop(top_k, test_data):
    '''
    Calculate mean recall score for the benchmark recommender given the top_k list and the test data
    '''
    score = []
    for i in range(len(test_data)):
         score.append(recall_at_k(top_k, test_data.products[i]))
    return np.mean(score)

def mean_recall_at_k_pop_new(top_k, test_data):
    '''
    Calculate mean recall score for the benchmark recommender given the top_k list and the test data with 
    reordered items eliminated
    '''
    purchased = training_set.groupby('user_id').product_id.apply(list).reset_index().rename(
                columns={'product_id': 'purchased'})
    score = []
    for i in range(len(test_data)):
        # get rid of reordered products
        new = list(set(test_data.products[i]) - set(purchased.purchased[i]))
    # deal with the situation when all products in users's most recent order are reordered
        if not new:
            score.append(0)
        else:
            score.append(recall_at_k(top_k, new))
    return np.mean(score)

In [14]:
k = [10, 20, 50]

for i in k:
    top_k = most_popular_k(i, df=training_set)
    print('recall@{0} for popularity model is: {1}'.format(i, mean_recall_at_k_pop(top_k, test_data)))
    print('recall@{0} for popularity model without reordered products is: {1}'.format(i, 
                                                            mean_recall_at_k_pop_new(top_k, test_data)))

recall@10 for popularity model is: 0.0699413179973539
recall@10 for popularity model without reordered products is: 0.027280349700253143
recall@20 for popularity model is: 0.09588023492718917
recall@20 for popularity model without reordered products is: 0.043705006414388715
recall@50 for popularity model is: 0.15431186904512134
recall@50 for popularity model without reordered products is: 0.08048940655391035


- We can see that recall score increases with the number of products recommended increases. It makes sense cause as we recommend more, more acutually purchased items will fall into the recommended list. 
- The popularity model is not performing that well in recommending people to try new products they've never bought before. This is because as we know from the EDA the top popular products are similar as those most reordered items. Thus when we get rid of the reordered items from the test data, the interaction of those two lists would shrink. 

#### Example: user_id == 100

In [15]:
print('User_id 100 actually bought: ')
get_prod_names(test_data.products[99])

User_id 100 actually bought: 


Unnamed: 0,product_id,product_name
21136,21137,Organic Strawberries
21615,21616,Organic Baby Arugula
24851,24852,Banana
26368,26369,Organic Roma Tomato
27343,27344,Uncured Genoa Salami
38546,38547,Bubblegum Flavor Natural Chewing Gum
38688,38689,Organic Reduced Fat Milk
48627,48628,Organic Whole Wheat Bread


In [16]:
print('User_id 100 got recommended: ')
get_prod_names(most_popular_k(10, df=training_set))

User_id 100 got recommended: 


Unnamed: 0,product_id,product_name
13175,13176,Bag of Organic Bananas
16796,16797,Strawberries
21136,21137,Organic Strawberries
21902,21903,Organic Baby Spinach
24851,24852,Banana
26208,26209,Limes
27844,27845,Organic Whole Milk
47208,47209,Organic Hass Avocado
47625,47626,Large Lemon
47765,47766,Organic Avocado


For the customer with user_id = 100, we see that he/she bought two of the products from the recommended list: organic straberries and banana. 

## 5. Neighborhood-Based Models

### 5.1 Neighborhood-Based Model using Cosine Similarity Metric

In [17]:
# create the recommender
model_cos = tc.item_similarity_recommender.create(training_data, user_id='user_id', item_id='product_id', 
                                                  target='quantity', similarity_type='cosine')

In [18]:
# Get all the unique user_ids for recommendation
users_to_recommend = list(orders['user_id'].unique())

In [19]:
# check the recommendation result
rec_cos = model_cos.recommend(users=users_to_recommend, verbose=False)
rec_cos.print_rows(15)

+---------+------------+---------------------+------+
| user_id | product_id |        score        | rank |
+---------+------------+---------------------+------+
|    1    |   37710    |  0.806175700823466  |  1   |
|    1    |    6184    |  0.6475631872812907 |  2   |
|    1    |   38928    |  0.5847909847895304 |  3   |
|    1    |   11759    |  0.5210654894510905 |  4   |
|    1    |   39657    |  0.5002970337867737 |  5   |
|    1    |   18023    | 0.49242939949035647 |  6   |
|    1    |   31651    | 0.48198878367741904 |  7   |
|    1    |   41400    |  0.478752330938975  |  8   |
|    1    |   46562    |  0.4697097897529602 |  9   |
|    1    |   13575    | 0.43342450857162473 |  10  |
|    2    |   21137    |  0.3162227161228657 |  1   |
|    2    |   21903    | 0.27282796365519363 |  2   |
|    2    |   26209    | 0.23250874069829783 |  3   |
|    2    |   47626    | 0.23084485717117786 |  4   |
|    2    |   24964    | 0.20456942170858383 |  5   |
+---------+------------+----

In [20]:
# evaluate
N = [10, 20, 50]

for n in N:
    # get k recommended items
    rec_k = model_cos.recommend(users=users_to_recommend, k=n, verbose=False)
    # transform the Sframe output to dataframe
    rec_df = output_transformer(rec_k)
    # calculate recall scores
    print('recall@{0} is: {1}'.format(n, mean_recall_at_k_rec(rec_df, test_data)))
    print('recall@{0} without reordered products is: {1}'.format(
                                                        n,mean_recall_at_k_rec_new(rec_df, test_data)))

recall@10 is: 0.018600407125835553
recall@10 without reordered products is: 0.03901478816963808
recall@20 is: 0.029320426868367437
recall@20 without reordered products is: 0.06160048200319186
recall@50 is: 0.050095479252988476
recall@50 without reordered products is: 0.10480483440728991


By comparison, all recall@k with reordered products are lower than the benchmark popularity model. While the neighborhood-based model is performing better after we get rid of all the reordered products. So it's doing a better job in recommending new stuff to consumers. 

### 5.2 Neighborhood-Based Model using Jaccard Similarity Metric

In [21]:
# default similarity metric is jaccard, set verbose=False to save space
model_jac = tc.item_similarity_recommender.create(training_data, user_id='user_id', item_id='product_id', 
                                                  target='quantity', verbose=False)

In [22]:
# check the recommendation result
rec_jac = model_jac.recommend(users=users_to_recommend, verbose=False)
rec_jac.print_rows(15)

+---------+------------+----------------------+------+
| user_id | product_id |        score         | rank |
+---------+------------+----------------------+------+
|    1    |   37710    | 0.04831514755884806  |  1   |
|    1    |   38928    |  0.0440241018931071  |  2   |
|    1    |    6184    | 0.04236073096593221  |  3   |
|    1    |   41400    | 0.04022345145543416  |  4   |
|    1    |   39657    |  0.0396228035291036  |  5   |
|    1    |   21137    | 0.038310543696085615 |  6   |
|    1    |   13575    | 0.03744825124740601  |  7   |
|    1    |   21903    | 0.03731383085250854  |  8   |
|    1    |   31759    | 0.037294272581736246 |  9   |
|    1    |   11759    | 0.03690311113993327  |  10  |
|    2    |   21137    | 0.037031336997946106 |  1   |
|    2    |    8277    | 0.03226686455309391  |  2   |
|    2    |   40706    | 0.03164547371367613  |  3   |
|    2    |   26209    | 0.03134287086625894  |  4   |
|    2    |   24964    | 0.031124326090017956 |  5   |
+---------

In [23]:
# evaluate
N = [10, 20, 50]

for n in N:
    # get k recommended items
    rec_k = model_jac.recommend(users=users_to_recommend, k=n, verbose=False)
    # transform the Sframe output to dataframe
    rec_df = output_transformer(rec_k)
    # calculate recall scores
    print('recall@{0} is: {1}'.format(n, mean_recall_at_k_rec(rec_df, test_data)))
    print('recall@{0} without reordered products is: {1}'.format(
                                                        n,mean_recall_at_k_rec_new(rec_df, test_data)))

recall@10 is: 0.01750721214063602
recall@10 without reordered products is: 0.03639181336089281
recall@20 is: 0.02887196356977312
recall@20 without reordered products is: 0.06012400006240632
recall@50 is: 0.053131122258287596
recall@50 without reordered products is: 0.11039707262651227


- Same as the model with cosine similarity metric, this model also outperforms popularity model when we exclude those reordered products. 
- Comparing models with cosine and jaccard similarity metric, we can see that when we're recommending 10 and 20 products, the cosine model is a bit better. When we're recommending 50 products, the jaccard model is a bit better. 

#### Example: user_id == 100

In [31]:
get_prod_names(model_cos.recommend(users=['100'], verbose=False).to_dataframe().product_id)

Unnamed: 0,product_id,product_name
13175,13176,Bag of Organic Bananas
21136,21137,Organic Strawberries
21902,21903,Organic Baby Spinach
24963,24964,Organic Garlic
26208,26209,Limes
45006,45007,Organic Zucchini
47208,47209,Organic Hass Avocado
47625,47626,Large Lemon
47765,47766,Organic Avocado
49682,49683,Cucumber Kirby


In [32]:
print('User_id 100 actually bought: ')
get_prod_names(test_data.products[99])

User_id 100 actually bought: 


Unnamed: 0,product_id,product_name
21136,21137,Organic Strawberries
21615,21616,Organic Baby Arugula
24851,24852,Banana
26368,26369,Organic Roma Tomato
27343,27344,Uncured Genoa Salami
38546,38547,Bubblegum Flavor Natural Chewing Gum
38688,38689,Organic Reduced Fat Milk
48627,48628,Organic Whole Wheat Bread
