In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
color = sns.color_palette()

from sklearn.metrics.pairwise import cosine_similarity

import pickle
import warnings
warnings.filterwarnings("ignore")

np.random.seed(12345)

In [2]:
X = pd.read_csv("X.csv")
y = pd.read_csv("y.csv")

In [3]:
X = X.set_index('user_id')
y = y.set_index('user_id')

In [4]:
X.head(3)

Unnamed: 0_level_0,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,beauty,beers coolers,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1
2,0.0,0.214286,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,...,0.142857,0.071429,0.071429,0.0,0.0,0.0,0.0,0.142857,0.0,0.642857
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.363636,0.090909,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0


In [5]:
y.head(3)

Unnamed: 0_level_0,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,beauty,beers coolers,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


# Recommender System

## Collaborative Based Recommendations


Collaborative filtering considers similarity between users and items (product categories). Since we have a normalized data frame where each row represents a user and each column represents a product category we can treat values from zero to one (how often the products were purchased, on average) as how much users like the product category or dependent on it. The way we normalized the data accounts for different purchase history sizes, so if users have similar rating of a category, it brings similar value to both users. Therefore, if users rate same products similarly, the users are similar.

For this task we already have our user-item utility matrix, where each entry corresponds to a "rating" for a particular category given by a user.

In [6]:
X.head(5)

Unnamed: 0_level_0,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,beauty,beers coolers,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1
2,0.0,0.214286,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,...,0.142857,0.071429,0.071429,0.0,0.0,0.0,0.0,0.142857,0.0,0.642857
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.363636,0.090909,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0
5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.75



Zero values of the matrix represent categories that were never purchased by a given user and we can assume that corresponding category is disliked or never tried.



User-item filtering, that we will be implementing, assumes that users with similar preferences for a given set of categories have similar preferences over other categories. Therefore, if we want to make a recommendation to a given user, we need to find similar customers first (with similar preferences) and estimate the scores for categories that were never purchased by the user in question. For instance, user 2 occasionally buys "asian food" and much more ofter purchases "yogurt"; user 5 also likes "asian foods" as well as "yogurts" but has higher scores for both. Model would estimate how much user 5 would like "spreads" based on a similarity between two users.

As per assumptions above, the following function restricts the data to users who have the record of purchasing (at least) same set of categories as the customer we are providing the recommendation for. Then it calculates the cosine similarity between a customer in question and each user in the set. Since the assumption is that users who have similar scores for a given set of categories, they will have similar scores for another set of categories. Therefore, the function extracts all the purchased categories from the set of similar customers and keeps only those categories that a user in question have not bought before. And as a final step, it estimates the scores for each category and recommends the categories with the highest estimated score.

In [7]:
def sort_dict(dictionary):
    return {key: value for key, value in sorted(dictionary.items(), key = lambda item: item[1], reverse = True)}

In [8]:
def get_recommendation(user_id, number_of_items_to_recommend=3):
    
    '''
    This function takes in a user_id and a number_of_items_to_recommend
    
    - finds a set of similar users using cosine similarity and not purchased categories from the set of users
    
    - estimates the scores for each category and outputs number_of_items_to_recommend based on the highest scores
    
    '''
    
    # get categories purchased by the user (True/False):
    user_categories = X.loc[user_id, :].ne(0.0)
    
    
    # select names of categories that are True 
    # (for future use of establishing which categories we need to recommend for)
    user_categories_names = user_categories.index[user_categories == True]
    
    # filter the data frame for only categories in review:
    users_within_categories = X.loc[:, user_categories]
    
    # keep only users who bought the same set of categories at least once:
    complete_cases = (users_within_categories != 0.0).all(axis=1)
    
    complete_cases_df = users_within_categories[complete_cases]
    
    # calculate cosine for the set of users and make a data frame:
    cosine_values = cosine_similarity(complete_cases_df.loc[user_id,:].values.reshape(1,-1), complete_cases_df)[0]
    cosine_df = pd.DataFrame({'user_id': complete_cases_df.index, 'similarity': cosine_values}).sort_values(by = 'similarity', ascending = False)
    
    # get all categories purchased by similar users:
        # get their IDs first:
    similar_users_id = cosine_df['user_id'][1:]
    
    #instantiate an empty list for categories
    categories = []
    
    # for each similar user:
    for user in similar_users_id:
        # get a binary representation of all categories
        binary_categories = X.loc[user, :].ne(0.0)
        
        # keep only true categories (purchased at least once)
        purchased_categories = binary_categories.index[binary_categories == True]
        
        # add the categories to the list
        categories = categories + list(purchased_categories)
    
    # now we have a list of all purchased categories for all similar users. Keep only unique
    categories = list(set(categories))
    
    # get all categories user in question have not purchased:
        # instantiale an empty list: 
    candidates = []
    
    # take all categories extracted from similar users and remove categories 
    # that a user in question have purchased before
    for category in categories:
        if category not in user_categories_names:
            candidates.append(category)
    
    
    # estimate the score for each candidate
        # instantiate an empty dictionary:
    estimated_scores = {}
    
    for category in candidates:
        
        # get all similar users who purchased a given category
        df = X.loc[similar_users_id, category].to_frame()
        
        # attach cosine similarities for those users
        df = df.merge(cosine_df, left_index = True, right_on = 'user_id').set_index('user_id')
        
        # get an array of the scores
        scores = df[category].values
        
        # get an array of similarities
        similarity = df['similarity'].values
        
        # estimate the score (value) of the category for the user in question
        predicted_score = np.dot(scores, similarity)/np.sum(similarity)
        
        # save the results to a dictionary
        estimated_scores[category] = predicted_score
    
    # sort the dictionary in descending order
    estimated_scores = sort_dict(estimated_scores)
    
    # output first 'number_of_recommended_items'
    return list(estimated_scores.keys())[:number_of_items_to_recommend]


In [13]:
get_recommendation(6852, 5)

['chips pretzels',
 'soy lactosefree',
 'eggs',
 'water seltzer sparkling water',
 'lunch meat']

Now we have a recommendation function that uses Collaborative Filtering approach. The disadvantage of this method is that we cannot use traditional measures of accuracy such as Root Mean Square Error etc. Therefore, we will use more advanced models such as `BaselineOnly()` that predicts baseline estimates for a user-category pair and matrix factorization-based algorithm called `FunkSVD()` from `suprise` library.

# Recommender System using `Surprise` package

This particular library is commonly used for recommender systems, however, it requires a specific format of a data frame. Specifically, the frame must have only 3 columns: user, item (category) and a score. For that we have to use out test set and transform it accordingly. 

In [19]:
# in this cell we will loop through the entire binary dataframe and store the each element as a tuple of format:
# (user_id, category, score)
X_data = []

for user in X.index:
    for category in X.columns:
        X_data.append((user, category, X.loc[user,category]))

In [20]:
# here we unpack the list of tuples in order to create a proper format of the dataframe.
X_df = pd.DataFrame(X_data)

In [21]:
X_df.columns = ['user_id', 'category', 'score']
X_df

Unnamed: 0,user_id,category,score
0,1,air fresheners candles,0.000000
1,1,asian foods,0.000000
2,1,baby accessories,0.000000
3,1,baby bath body care,0.000000
4,1,baby food formula,0.000000
...,...,...,...
27632001,206209,trash bags liners,0.076923
27632002,206209,vitamins supplements,0.000000
27632003,206209,water seltzer sparkling water,0.000000
27632004,206209,white wines,0.000000


In [45]:
#X_df.to_csv('Recommender/X_df.csv', index = False)

In [22]:
from surprise import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.prediction_algorithms.baseline_only import BaselineOnly
from surprise.model_selection import cross_validate

from surprise import accuracy
from surprise.model_selection import train_test_split

In [23]:
# using special methods required by suprise library we load data from X_df into a special 'suprise' object
# and pass a proper scaling range to a reader() method
dataset = Dataset.load_from_df(X_df, Reader(rating_scale = (0,1))) 

train_set, test_set = train_test_split(dataset, test_size=0.2)

In [32]:
dataset

<surprise.dataset.DatasetAutoFolds at 0x7ff43fc06890>

In [25]:
benchmark = []
# Iterate over two potential algorithms
for algorithm in [FunkSVD(), BaselineOnly()]:
    # Perform cross validation of the data for a given algorithm using RMSE as accuracy measure
    results = cross_validate(algorithm, dataset, measures=['RMSE'], cv=3, verbose=False, n_jobs = -1)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')

# (Li, 2019)

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BaselineOnly,0.129237,13.367005,211.294091
SVD,0.129436,1175.113537,255.555259


Based on a table above, out of 2 model candidates, BaselineOnly() has a marginally smaller test RMSE (the smaller the better), but the fit time that is 88 times faster than FunkSVD(). As an obvious choice, we will proceed with BaselineOnly() model. 

A very popular approach for recommender systems is to use Alternating Least Squares(ALS) in collaborative filtering. ALS is used to determine latent factors, which are used for explaining user-item ratings, and then finds optimal weights in order to minimize the least squares between predicted and actual values. (Elena.cuoco, 2017)

In [27]:
# BaselineOnly is able to take in parameters only as a dictionary:
# 'reg_u' = regularization for items (categories)
# 'reg_i' = regularization for users
bsl_parameters = {'method': 'als', 'n_epochs': 5, 'reg_u': 12, 'reg_i': 5}


final_model = BaselineOnly(bsl_options= bsl_parameters)

predictions = final_model.fit(train_set).test(test_set)

Estimating biases using als...


In [29]:
df_final = pd.DataFrame(predictions)
df_final.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,25845,dish detergents,0.0,0.045451,{'was_impossible': False}
1,137714,baking ingredients,0.0,0.057933,{'was_impossible': False}
2,141009,other,0.0,0.041065,{'was_impossible': False}
3,137739,facial care,0.0,0.0,{'was_impossible': False}
4,179000,fresh pasta,0.0,0.033377,{'was_impossible': False}


In [40]:
# get absolute value of the error by subtracting true user score from estimated score (for each each user-item pair)
df_final['err'] = abs(df_final.est - df_final.r_ui)

# split the data into 10 best and worst predictions:
best_predictions = df_final.sort_values(by='err')[:10]
worst_predictions = df_final.sort_values(by='err')[-10:]

In [48]:
best_predictions.columns = ['user_id', 'category', 'true_user_score', 'estimated_score', 'details', 'absolute_error']
best_predictions

Unnamed: 0,user_id,category,true_user_score,estimated_score,details,absolute_error
3265662,71813,muscles joints pain relief,0.0,0.0,{'was_impossible': False},0.0
4190261,167109,bulk grains rice dried goods,0.0,0.0,{'was_impossible': False},0.0
5013878,102911,granola,0.0,0.0,{'was_impossible': False},0.0
1289484,38245,ice cream toppings,0.0,0.0,{'was_impossible': False},0.0
3795223,158664,kosher foods,0.0,0.0,{'was_impossible': False},0.0
738649,188108,skin care,0.0,0.0,{'was_impossible': False},0.0
4632965,123879,body lotions soap,0.0,0.0,{'was_impossible': False},0.0
5013875,169832,cleaning products,0.0,0.0,{'was_impossible': False},0.0
4190257,111234,indian foods,0.0,0.0,{'was_impossible': False},0.0
3795229,123812,trail mix snack mix,0.0,0.0,{'was_impossible': False},0.0


In [49]:
worst_predictions.columns = ['user_id', 'category', 'true_user_score', 'estimated_score', 'details', 'absolute_error']
worst_predictions

Unnamed: 0,user_id,category,true_user_score,estimated_score,details,absolute_error
4452234,62850,red wines,1.0,0.0,{'was_impossible': False},1.0
3266271,156144,more household,1.0,0.0,{'was_impossible': False},1.0
1122026,65609,packaged seafood,1.0,0.0,{'was_impossible': False},1.0
5306057,167661,diapers wipes,1.0,0.0,{'was_impossible': False},1.0
3198997,145544,cat food care,1.0,0.0,{'was_impossible': False},1.0
2412404,158813,first aid,1.0,0.0,{'was_impossible': False},1.0
781711,181814,beers coolers,1.0,0.0,{'was_impossible': False},1.0
2445694,51165,spirits,1.0,0.0,{'was_impossible': False},1.0
3049514,61995,frozen meat seafood,1.0,0.0,{'was_impossible': False},1.0
2219035,175906,beers coolers,1.0,0.0,{'was_impossible': False},1.0


Tables above represents top/bottom 10 predictions for given user-category pairs. The second table represents the pairs which model could not classify correctly and estimated that users would not like those products, when in reality they would.

In [36]:
FCP = accuracy.fcp(predictions, verbose=False)
print(FCP)

0.7808038042236706


Fraction of Concordant Pairs (FCP) - The fraction of pairs whose relative ranking order is correct. This metric is commonly used to test the accuracy of the recommender systems. The maximum value of 1 would indicate perfect recommendations. Here we were able to achieve a fairly good result of 78% accuracy given the data set and normalization method. As a future work we will include more models as candidates for cross_validation check in order to see whether there are any that will outperform `BaselineOnly()`. Unfortunately, running a cross-validation for multiple models on such a large dataset will take a significant amount of time. 

In [44]:
pkl_filename = 'Recommender/BaselineOnly.pkl'
with open(pkl_filename, 'wb') as file:
    pickle.dump(predictions, file)

References:

- Elena.cuoco. (2017, May 17). Alternating Least Squares (ALS) Spark ML. Retrieved September 19, 2020, from https://www.elenacuoco.com/2016/12/22/alternating-least-squares-als-spark-ml/?cn-reloaded=1

- LearnerLearner 57444 silver badges1414 bronze badges, Jezraeljezrael 546k4747 gold badges712712 silver badges730730 bronze badges, Vivek KalyanaranganVivek Kalyanarangan 6, &amp; MysteriousMysterious 41711 gold badge55 silver badges1717 bronze badges. (1968, January 01). Get non zero values for each column in pandas. Retrieved September 18, 2020, from https://stackoverflow.com/questions/52054945/get-non-zero-values-for-each-column-in-pandas
- Li, S. (2019, September 26). Building and Testing Recommender Systems With Surprise, Step-By-Step. Retrieved September 19, 2020, from https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b

In [59]:
%load_ext watermark

%watermark -v -m -p numpy,pandas,sklearn -g

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.7.6
IPython 7.12.0

numpy 1.18.1
pandas 1.0.1
sklearn 0.22.1

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.5.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   :
