# Using Recommender Systems to Identify Top Beauty Products

Student name: Jonathan Lee

Student pace: Full Time

Scheduled project review date/time: June 22, 2pm

Instructor name: James Irving

Blog post URL: 

## Overview

This project uses the Surprise package from scikit with Amazon review data of Luxury Beauty products to build a recommendation system. For this analysis, we will examine the performance of memory-based collaborative filtering in the form of K-Nearest Neighbors, as well as of model-based collaborative filtering in the form of Singular Value Decomposition. From our test results, we find that out of KNN methods, Singular Value Decomposition, and Alternating Least Squares methods, Singular Value Decomposition was the best performing model for our selected data. We also examine what the optimal hyperparameters are for this particular dataset.

## Business Problem

Our client is a beauty product retailer that wants to know what the most popular products on Amazon are, as well as what other products customers would be likely to give high ratings to, under the assumption that they would give high ratings to these popular products. We want to optimize a recommender system based on Amazon reviews that as accurately as possible predicts other products that customers would be likely to enjoy. Using this optimized recommender system, we will move forward with the goal of using our client's customer preferences to extract insights into what other brands/products would be successful if our client were to add them to their product offering.
***
Questions to address:
* What is are the optimal model and hyperparameters to build a recommender system to work with Amazon ratings dataset to provide recommendations for our own customers?
* What are Amazon's most popular products in terms of number of ratings?
* Assuming that our client's customers currently give high ratings to the popular products on Amazon, what other products can we recommend adding to inventory?
***

## Data Understanding and Preparation

In this analysis, we use [Amazon review data](https://nijianmo.github.io/amazon/index.html) and [product metadata](http://deepyeti.ucsd.edu/jianmo/amazon/index.html) featured in the following paper:

**Justifying recommendations using distantly-labeled reviews and fined-grained aspects**

Jianmo Ni, Jiacheng Li, Julian McAuley


*Empirical Methods in Natural Language Processing (EMNLP), 2019*

Due to the large size of the complete dataset and hardware limitations, we will complete the analysis with only reviews and metadata from the luxury beauty product category.

Let's begin by loading in our data and doing some Exploratory Data Analysis.

In [None]:
# Import standard packages
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import requests

%matplotlib inline

# Set random seed
np.random.seed(27)

In [None]:
# Set theme and style for plots
sns.set_theme('talk')
sns.set_style('darkgrid')

### Loading in the Data

We have two tables to work with in this analysis:
1. Review data: contains product ASIN code, user code, and the rating that user provided.
2. Product metadata which includes all product metadata including price, product name, and product images paired with ASIN codes.

In [None]:
# Load review dataset and metadata
review_df = pd.read_csv('data/Luxury_Beauty.csv', names=['asin', 'user',
                                                         'rating', 'timestamp'])
meta_df = pd.read_json('data/meta_Luxury_Beauty.json.gz', lines=True)
display(review_df, meta_df)

### Dropping Duplicates and Null Values

We are dealing with quite a large dataset, with the number of ratings being over 570,000. Therefore, it will be important to reduce the memory as much as possible by removing unnecessary features and reducing the memory usage. Since the timestamp data is unnecessary to our analysis, we will go ahead and drop that column from our ratings dataset. We also go through an initial iteration of removing duplicates and null values.

We will also write a function that displays the size of a dataframe, so that we can confirm that the transformations performed on the dataset are resulting in a reduced memory footprint.

In [None]:
def get_df_size(df):
    """
    Gets size of dataframe and prints value in MB.
    Function inspired by James Irving.

    Args:
        df (DataFrame) : DataFrame to print size of.
    Returns:
        
    """
    size = round((sys.getsizeof(df) * 1e-6), 2)
    
    print(f"Dataframe memory usage: {size} MB.")

In [None]:
# Drop duplicates and timestamp column from review table
review_df.drop_duplicates(inplace=True)
review_df.drop('timestamp', axis=1, inplace=True)
review_df

In [None]:
# Print size of original ratings df
get_df_size(review_df)

Similarly with our metadata, we will go ahead and slice out the ASIN code and product names, since those are the pieces of data that will be used in our analysis. Then, we go on to drop duplicates from this table as well.

In [None]:
# Slice asin and title columns from metadata table
meta_df = meta_df[['asin','title', 'imageURLHighRes']]

In [None]:
# Drop duplicates from metadata table
meta_df.drop_duplicates(['asin', 'title'], inplace=True)
meta_df

### Merging Data Tables

Now, we will create a catalog_df which contains all of our ratings combined with their titles. This dataframe contains all of the information we will need for the purpose of our analysis. Let's also keep note of the size of our original catalog_df before we make transformations to reduce the memory allocation, and after dropping any duplicated or null values.

In [None]:
# Combine review data and metadata to create catalog table
catalog_df = review_df.merge(meta_df, how='left', on='asin')
catalog_df

In [None]:
# Drop duplicates from merged catalog table
catalog_df.drop_duplicates(['asin', 'user', 'rating', 'title'],inplace=True)
catalog_df

In [None]:
# Check for null values
catalog_df.isna().sum()

Since the number of null values in this catalog dataframe are quite small, we can go ahead and remove the observations where we do not have a product name paired with its ASIN code.

In [None]:
# Drop null values
catalog_df.dropna(inplace=True)
catalog_df

In [None]:
# Print size of initial catalog_df
get_df_size(catalog_df)

### Visualizing the Data

In this section, we will proceed to visualize the distribution of our ratings as well as how many users gave how many ratings each.

In [None]:
# Check distribution of ratings
catalog_df['rating'].value_counts().sort_index(ascending=False)

In [None]:
# Check distribution of ratings in percent
catalog_df['rating'].value_counts(normalize=True).sort_index(ascending=False)

In [None]:
# Create bar plot of rating distribution
fig, ax = plt.subplots(figsize=(10,7))

g = sns.histplot(data=catalog_df, x='rating', hue='rating', palette='cool_r',\
                 discrete=True, legend=True)

ax.set_title('Distribution of Ratings')
ax.set_xlabel('Rating')
ax.set_ylabel('Number of Reviews')
ax.set_xticks([1,2,3,4,5])
ax.legend(['66.3%','12.3%','7.3%','5.1%','8.9%']);

In [None]:
# Get number of ratings per user
freq_df = catalog_df.groupby('user').agg('count').reset_index()
freq_df

In [None]:
# Inspect measures of central tendency
freq_df.describe()

In [None]:
# Create table with number of users vs number of ratings per user
plot_df = freq_df.groupby('asin').agg('count')[:10]
plot_df

In [None]:
# Create bar plot of users per ratings given
fig, ax = plt.subplots(figsize=(10,7))

g = sns.barplot(data=plot_df, x=plot_df.index, y=plot_df['user'], \
                palette='cool')

ax.set_title('Number of Users per Ratings Given')
ax.set_xlabel('Ratings Given')
ax.set_ylabel('Number of Users')

for p in ax.patches:
             ax.annotate("%.0f" % p.get_height(), \
                         (p.get_x() + p.get_width() / 2., p.get_height()),\
                          ha='center', va='center', fontsize=13, \
                          color='black', xytext=(0, 5), \
                          textcoords='offset points');

In [None]:
# Check measures of central tendency
catalog_df.describe()

### Data Mapping

As mentioned before, due to the large size of this dataset, it is important to reduce the data to minimize the amount of memory being used. Hence, we map our ASIN and user codes to integer values in order to optimize memory allocation during the modeling process as well as converting our data types to the smallest possible integer type without losing any information.

In [None]:
# Create list of unique asin codes
asin_list = catalog_df['asin'].unique()

In [None]:
# Create an array of integers to map asin codes to
np.arange(len(asin_list))

In [None]:
# Construct dictionary using asin and corresponding product code
asin_map = dict(zip(asin_list, np.arange(len(asin_list))))

In [None]:
# Map asin to product code integer and check
catalog_df['asin'] = catalog_df['asin'].map(asin_map)
catalog_df

In [None]:
# Rename 'asin' column to 'product_code'
catalog_df = catalog_df.rename(columns={'asin': 'product_code'})

In [None]:
# Create list of unique users
user_list = catalog_df['user'].unique()

In [None]:
# Create an array of integers to map user codes to
np.arange(len(user_list))

In [None]:
# Construct dictionary using user code and corresponding integer
user_map = dict(zip(user_list, np.arange(len(user_list))))

In [None]:
# Map asin to product code integer and check
catalog_df['user'] = catalog_df['user'].map(user_map)
catalog_df

In [None]:
# Convert to more efficient integer types
catalog_df['rating']=catalog_df['rating'].astype(np.int8)
catalog_df['product_code']=catalog_df['product_code'].astype(np.int32)
catalog_df['user']=catalog_df['user'].astype(np.int32)

In [None]:
# Check data types
catalog_df.dtypes

In [None]:
# Check datatype of columns
catalog_df.dtypes

Now that we have reduced the datasize by converting each feature to its lowest possible integer type, let's take a look at the memory usage of our optimized catalog_df.

In [None]:
# Print size of transformed and optimized catalog_df
get_df_size(catalog_df)

Great! We have successfully reduced the memory usage of this catalog_df from 190.61 MB to 117.31 MB without losing any essential information.

### Slicing Data for Modeling

We're almost ready to enter the modeling process, so let's go ahead and slice out just the columns we need to do so.

In [None]:
# Create dataframe with user item rating
df = catalog_df[['user', 'product_code', 'rating']]

In [None]:
# Print size of optimized ratings data only
get_df_size(df)

Again, when we compare our initial ratings df size to our optimized ratings size, we can see that we have gone from 82.72 MB down to 9.11 MB. Much more efficient.

In [None]:
# Save csv file to use in Databricks ALS model
# catalog_df.to_csv(r'data/Luxury_Beauty_reduced.csv', index=False)

## Data Modeling

In this section, we will take a look at using the Surprise scikit package to test which algorithm will be the best for building a recommender system using our Amazon review data.

The models we will look at are some K-Nearest Neighbor models and a series of gridsearched Singular Value Decomposition models. You can find the process behind modeling using Alternating Least Squares in PySpark, but we will leave this model out of our main analysis due to its poor performance on this specific dataset as well as the fact that we will need to use PySpark to perform the modeling process.

In [None]:
# If using Colab, install Surprise
# %pip install scikit-surprise

In [None]:
# Import necessary packages for building recommender system
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise.model_selection import cross_validate, train_test_split
from surprise.prediction_algorithms import SVD
from surprise.model_selection import GridSearchCV

In [None]:
# Create reader object and format review data for processing
reader = Reader(line_format = 'user item rating', sep = ',')
data = Dataset.load_from_df(df, reader=reader)

In [None]:
# Create train test split
trainset, testset = train_test_split(data, test_size=0.25, random_state=27)

### Memory-Based Item-Item Collaborative Filtering

As we see below, the number of unique items is much less than the number of unique users. Hence, for the following K-Nearest Neighbor models, it will be more effective to use item-based filtering in terms of computational efficiency as well as performance due to the fact that the average rating of each item is less likely to change as quickly as the ratings given by each user to different items. 

For the KNN Basic and KNN with Means algorithms, we will examine performance based on cosine similarity and Pearson correlation coefficient. However, for the KNN with Z-score and KNN Baseline algorithms, we will only examine the Pearson baseline metric, since the Surprise documentation recommends this in order to achieve the best performance.

As we iterate through each model, we will save the resulting mean scores in a cumulative dataframe to be able to easily compare performances and runtimes.

In [None]:
# Write function to calculate average test metrics
def get_avg_metrics(score_dict):
    """
    Calculates average of each list in the specified dictionary.
    
    Inspired by solution by Jiby on StackOverflow:
    https://stackoverflow.com/questions/30687244/python-3-4-how-to-get-the-average-of-dictionary-values

    Args:
        score_dict (dict) : Dictionary with model test scores.
        
    Returns:
        avgDict (dict) : Dictionary with calculated mean average values.
    """
    
    avgDict = {}
    for k,v in score_dict.items():
        avgDict[k] = sum(v)/ float(len(v))
    return avgDict

In [None]:
# Initialize cumulative results dataframe
cumulative_results = pd.DataFrame()

In [None]:
# Check how many unique values for asin
catalog_df['product_code'].nunique()

In [None]:
# Check how many unique values for user
catalog_df['user'].nunique()

#### KNN Basic

We begin with the most basic form of the K-Nearest Neighbors algorithm.

In [None]:
# KNN Basic with cosine similarity
KNN_basic_cos = knns.KNNBasic(sim_options={'name': 'cosine', 
                                          'user_based': False}).fit(trainset)

In [None]:
# Get predictions on test data and print RMSE and MAE
predictions = KNN_basic_cos.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with cross validated average scores
KNN_basic_cos_dict = cross_validate(KNN_basic_cos, data, verbose= True, \
                                    n_jobs=-1)
KNN_basic_cos_dict = get_avg_metrics(KNN_basic_cos_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(KNN_basic_cos_dict, index=["KNN_basic_cos"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

Now that we have a starting point, let's compare how using the Pearson correlation coefficient as our similarity measure alters the RMSE.

In [None]:
# KNN Basic with pearson correlation similarity
KNN_basic_pearson = knns.KNNBasic(sim_options={'name': 'pearson', 
                                              'user_based': False})\
                        .fit(trainset)

In [None]:
# Get predictions on test data and print RMSE and MAE
predictions = KNN_basic_pearson.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with cross validated average scores
KNN_basic_pearson_dict = cross_validate(KNN_basic_pearson, \
                                        data, verbose= True, n_jobs=-1)
KNN_basic_pearson_dict = get_avg_metrics(KNN_basic_pearson_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(KNN_basic_pearson_dict, index=["KNN_basic_pearson"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

We see that we have a slightly lower RMSE when we use the Pearson correlation coefficient on the KNN basic algorithm. Although the fit time is quite a bit longer than when we used the cosine similarity, this difference is not large enough for us to sacrifice a lower RMSE.

#### KNN With Means

Next, we move onto a KNN algorithm which takes into account the mean ratings of each item.

In [None]:
# KNN with Means with cosine similarity
KNN_mean_cos = knns.KNNWithMeans(sim_options={'name': 'cosine', \
                                              'user_based': False})\
                   .fit(trainset)

In [None]:
# Get predictions on test data and print RMSE and MAE
predictions = KNN_mean_cos.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with cross validated average scores
KNN_mean_cos_dict = cross_validate(KNN_mean_cos, data, verbose= True, \
                                   n_jobs=-1)
KNN_mean_cos_dict = get_avg_metrics(KNN_mean_cos_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(KNN_mean_cos_dict, index=["KNN_mean_cos"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

Here, we see that our KNN with means using cosine similarity is not able to achieve a better score than our KNN basic with Pearson's correlation coefficient. Let's see what happens when we use the Pearson correlation coefficient on KNN with means.

In [None]:
# KNN with Means with pearson correlation similarity
KNN_mean_pearson = knns.KNNWithMeans(sim_options={'name': 'pearson', \
                                                  'user_based': False})\
                       .fit(trainset)

In [None]:
# Get predictions on test data and print RMSE and MAE
predictions = KNN_mean_pearson.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with cross validated average scores
KNN_mean_pearson_dict = cross_validate(KNN_mean_pearson, data, verbose= True,\
                                       n_jobs=-1)
KNN_mean_pearson_dict = get_avg_metrics(KNN_mean_pearson_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(KNN_mean_pearson_dict, index=["KNN_mean_pearson"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

Interestingly, we still do not have a better RMSE than our KNN basic with Pearson's correlation coefficient.

#### KNN With Z-Score

This algorithm takes into account the Z-score normalization of each item's ratings.

In [None]:
# KNN with Z-score with pearson baseline correlation similarity
KNN_z_pearson = knns.KNNWithZScore(sim_options={'name': 'pearson_baseline', \
                                                'user_based': False})\
                    .fit(trainset)

In [None]:
# Get predictions on test data and print RMSE and MAE
predictions = KNN_z_pearson.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with cross validated average scores
KNN_z_pearson_dict = cross_validate(KNN_z_pearson, data, verbose= True, \
                                    n_jobs=-1)
KNN_z_pearson_dict = get_avg_metrics(KNN_z_pearson_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(KNN_z_pearson_dict, index=["KNN_z_pearson"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

KNN with Z-score using the Pearson's correlation coefficient seems to be yielding a slightly better RMSE than most models, and has a very similar score and fit time to our KNN basic with Pearson's correlation coefficient. However, KNN basic with Pearson's correlation coefficient is still our best algorithm to this point.

#### KNN Baseline

This final algorithm is a K-Nearest Neighbors algorithm that takes into account a baseline rating for each item.

In [None]:
# KNN Baseline with pearson baseline similarity
KNN_base_pearson= knns.KNNBaseline(sim_options={'name': 'pearson_baseline', \
                                                'user_based': False})\
                      .fit(trainset)

In [None]:
# Get predictions on test data and print RMSE and MAE
predictions = KNN_base_pearson.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with cross validated average scores
KNN_base_pearson_dict = cross_validate(KNN_base_pearson, data, \
                                       verbose= True, n_jobs=-1)
KNN_base_pearson_dict = get_avg_metrics(KNN_base_pearson_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(KNN_base_pearson_dict, index=["KNN_base_pearson"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

In comparison to our last KNN baseline algorithm, all other KNN algorithms seem to have a similar RMSE score across the board. Hence, we have a clear winner with our KNN baseline using Pearson's correlation coefficient having the best RMSE and MAE out of all other KNN algorithms that we have examined to this point.

### Model-Based Collaborative Filtering via Matrix Factorization

#### Singular Value Decomposition

Now, let's move onto the SVD model where we will begin with a basic model and try to improve our score by using a series of gridsearches. This model-based approach takes a sparse matrix where we have users x items, and decomposes this utility matrix into item characteristics and user preferences that correspond to those characteristics. By utilizing a gridsearch, we can determine the optimal number of factors, or characteristics/preferences, as well as adjust learning and regularization rates.

In [None]:
# Train basic SVD model
svd1 = SVD(random_state=27)
svd1.fit(trainset)

In [None]:
# Get predictions on test data and print RMSE
predictions = svd1.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with average scores
svd1_dict = cross_validate(svd1, data, verbose= True, n_jobs=-1)
svd1_dict = get_avg_metrics(svd1_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(svd1_dict, index=["svd1"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

Not a bad start for a basic SVD model. We have a slightly higher RMSE than our best KNN model. However, we should also note that our fit time is quite a bit longer than any of our memory-based models. Let's go about trying to optimize our SVD model for a better RMSE by using a series of grid searches.

In [None]:
# Gridsearch #1
param_grid = {'n_factors':[110, 130],'n_epochs': [25, 30], \
              'lr_all': [0.025, 0.05], 'reg_all': [0.1, 0.2]}
svd_grid1 = GridSearchCV(SVD,param_grid=param_grid,joblib_verbose=5, \
                         n_jobs=-1)
svd_grid1.fit(data)

In [None]:
# Print results from gridsearch #1
svd_grid1.best_params

In [None]:
# Use best params to get RMSE and MAE on test data
svd2 = SVD(n_factors=130, n_epochs=30, lr_all=0.025, reg_all=0.1, \
           random_state=27)
svd2.fit(trainset)
predictions = svd2.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with average scores
svd2_dict = cross_validate(svd2, data, verbose= True, n_jobs=-1)
svd2_dict = get_avg_metrics(svd2_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(svd2_dict, index=["svd2"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

Although we see that our fit times are becoming relatively long, after just one grid search, we already have our best RMSE out of both memory-based and model-based algorithms.

In [None]:
# Gridsearch #2
param_grid = {'n_factors':[130, 150],'n_epochs': [30, 40], \
              'lr_all': [0.01, 0.025], 'reg_all': [0.05, 0.1]}
svd_grid2 = GridSearchCV(SVD,param_grid=param_grid,joblib_verbose=5, \
                         n_jobs=-1)
svd_grid2.fit(data)

In [None]:
# Print results from gridsearch #2
svd_grid2.best_params

In [None]:
# Use best params to get RMSE and MAE on test data
svd3 = SVD(n_factors=150, n_epochs=40, lr_all=0.025, reg_all=0.1, \
           random_state=27)
svd3.fit(trainset)
predictions = svd3.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with average scores
svd3_dict = cross_validate(svd3, data, verbose= True, n_jobs=-1)
svd3_dict = get_avg_metrics(svd3_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(svd3_dict, index=["svd3"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

Our RMSE and MAE scores continue to get better with each grid search, but it looks like our RMSE is only improving marginally in comparison to the amount of additional time it is taking to fit our models. We will proceed to do one final grid search to see if we can improve our RMSE by just a bit more.

In [None]:
# Gridsearch #3
param_grid = {'n_factors':[150, 200],'n_epochs': [40, 50], 'lr_all': [0.025],
              'reg_all': [0.1]}
svd_grid_final = GridSearchCV(SVD,param_grid=param_grid,joblib_verbose=5, \
                              n_jobs=-1)
svd_grid_final.fit(data)

In [None]:
# Print results from final gridsearch
svd_grid_final.best_params

In [None]:
# Use best params to get RMSE and MAE on test data
svd_final = SVD(lr_all=0.025, n_epochs=50, n_factors=150, reg_all=0.1, \
                random_state=27)
svd_final.fit(trainset)
predictions = svd_final.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

In [None]:
# Save dictionary with average scores
svd_final_dict = cross_validate(svd_final, data, verbose= True, n_jobs=-1)
svd_final_dict = get_avg_metrics(svd_final_dict)

In [None]:
# Create df from row of mean results to append to cumulative df
row_to_df = pd.DataFrame(svd_final_dict, index=["svd_final"])
cumulative_results = cumulative_results.append(row_to_df)
cumulative_results.style.background_gradient(cmap="Blues_r")

Our final SVD model has the best RMSE and MAE to this point. Although it has a significantly longer fit time than some of the KNN models, we also see that the time it takes to get predictions is the shortest. Because we can fit our data to our final SVD prior to getting predictions in a practical use case, longer fit time will not be a problem. Hence, we will move forward with our SVD model with the best RMSE and MAE scores and the following hyperparameters:
1. lr_all=0.025
2. n_epochs=50
3. n_factors=150
4. reg_all=0.1

Let's also fit our whole dataset to the model and pickle it to easily get predictions from.

In [None]:
# Train SVD model using best hyperparameters on full dataset
svd_final = SVD(lr_all=0.025, n_epochs=50, n_factors=150, reg_all=0.1,
                random_state=27)
svd_final.fit(data.build_full_trainset())

In [None]:
# Pickle svd_final
# import pickle

# with open('svdfinal.pickle', 'wb') as f:
#     pickle.dump(svd_final, f)

## Evaluation

In this section, we will begin by evaluating our test scores and then move on to build some functions to assist the client in looking up product codes. Finally, we will build a recommender system that takes a list of preferred products and returns a list of items that the user would likely give a high rating to.

Let's compare our test scores from all of the models that we've fit to this point:

In [None]:
# Display all mean scores
cumulative_results.style.background_gradient(cmap="Blues_r")

Great! We can see that by using our gridsearches, we were able to make some improvements in the RMSE score between iterations. We also see that our final SVD model has a lower RMSE score than even our best performing KNN Baseline model, so we will move forward to building our recommender system using the SVD model with the best parameters found in our final gridsearch. We can also see that our MAE score is 0.9230, meaning that in terms of rating stars, the average error of our model is off by 0.9230 stars from the actual rating.

### Searching Product Codes

Here, we create a reduced catalog of product names with their corresponding product codes. We then build a function to search the name of a product to assist our user in looking up product codes to input into the recommender system.

In [None]:
# Set pandas options to increase max column width and row number
pd.options.display.max_colwidth = 100
pd.options.display.max_rows = 500
catalog_df

In [None]:
catalog_df['imageURLHighRes'][5000]

In [None]:
# Create lookup df to look up product codes and/or names
lookup_df = catalog_df.drop_duplicates('product_code')
lookup_df = lookup_df[['product_code', 'title', 'imageURLHighRes']]
lookup_df

In [None]:
# Create function to look up product codes
def product_search():
    """
    Prompts user to look up product name and returns product code.

    Args:
        
    Returns:
        search_results (DataFrame) : DataFrame including results of searched 
        product name
    """
    
    # Prompt user for item name
    query_product = input('Search a brand or product: ')
    
    # Prompt user for number of results desired
    num_results = int(input('Up to how many results would you like to see? '))
    
    # Slice catalog_df to return DataFrame with results containing query
    search_results = lookup_df[lookup_df['title'].str\
                            .contains(query_product, case=False, na=False)]\
                            .head(num_results)
    
    return search_results

In [None]:
# Look up sample product codes
product_search()

### Building the Recommender System

In this section, we will take the hyperparameters from our best performing SVD model to build a usable recommender system. Upon running the function, the user will be prompted to enter a list of product codes of products that they gave high ratings to, and they will be given a list of products that our algorithm would recommend.

Let's load in our pickled final model and begin by creating a function that displays Amazon's existing customers' ratings as well as our recommendations for them.

In [None]:
# # Load in pickled final model
# with open('svdfinal.pickle', 'rb') as file:
#     model = pickle.load(file)

In [None]:
# Create function to train model on full dataset and return recommendations
def existing_user_ratings(model, user_no, num_res=5):
    """
    Prompts user to enter customer's preferred product codes, models SVD
    using ideal hyperparameters, and returns however many predictions
    the user requests.

    Args:
        model : Pre-trained model to pull predictions from.
        user_no (int) : Specific user to provide recommendations for.
        num_res (int) : Number of recommendations to display. Default value is
        5 recommendations.
        
    Returns:
        
    """

    # Create total list of predictions for new user
    list_of_predictions = []
    for item in df['product_code'].unique():
        list_of_predictions.append((item, model.predict(user_no, item)[3]))
    
    # Sort predictions from high to low
    ranked_predictions = sorted(list_of_predictions, key=lambda x:x[1], \
                                reverse=True)
    
    # Create dataframe from ranked predictions
    ranked_df = pd.DataFrame(ranked_predictions, columns=['product_code', \
                                                          'rating'])
    
    # Merge predictions with lookup df to get product names
    merged_df = ranked_df.merge(lookup_df, how='inner', on='product_code')
    
    # Create dataframe with requested number of results
    rec_list = merged_df.head(num_res)
    
    # Get user's ratings
    user_rated = catalog_df[catalog_df['user']==user_no]
    display('Customer has rated the following products: ', user_rated)
    
    # Get list of user's products
    prod_list = user_rated['product_code'].tolist()
    
    # Remove products that user has already rated
    for prod in prod_list:
        rec_list = rec_list[rec_list['product_code'] != prod]
    
    display('Recommendations for customer: ', rec_list)

In [None]:
# Get recommendations for user 27000
existing_user_ratings(model, 27000)

In [None]:
# Get recommendations for user 42424
existing_user_ratings(model, 42424)

Finally, let's create a recommender system function for new users to be able to input their own product ratings, and get new recommended products from.

In [None]:
# Check last user number
df['user'].sort_values().tail()

In [None]:
# Create function to train model on full dataset and return recommendations
def user_ratings(lr_all=0.025, n_epochs=50, n_factors=150, reg_all=0.1,
                 random_state=27):
    """
    Prompts user to enter customer's preferred product codes, models SVD
    using ideal hyperparameters, and returns however many predictions
    the user requests.

    Args:
        lr_all : The learning rate for all parameters. Default is ``0.025``.
        n_epochs : The number of iteration of the SGD procedure. Default is 
            ``50``.
        n_factors : The number of factors. Default is ``150``.
        reg_all : The regularization term for all parameters. Default is 
            ``0.1``.
        random_state (int) : Determines the RNG that will be used for 
            initialization. If int, ``random_state`` will be used as a seed 
            for a new RNG. This is useful to get the same initialization over 
            multiple calls to ``fit()``.  If RandomState instance, this same 
            instance is used as RNG. If ``None``, the current RNG from numpy 
            is used.  Default is``27``.
        
    Returns:
        
    """
    
    # Prompt user for number of products they want to review
    num_ratings = int(input("How many products would you like to rate? "))
    product_ratings = []
    
    # Prompt user for product code and its rating
    for rating in range(0, num_ratings):
        ind_prod_rating = [int(x) for x in \
                       input('Enter product code followed by its rating out of 5 (separate by spaces): ')\
                       .split()]
        product_ratings.append({ind_prod_rating[0]:ind_prod_rating[1]})
    
    # Prompt user for desired number of product recommendations
    num_res = int(input('How many recommendations would you like to see? '))
    
    # Create list of ratings to add to dataset
    keys = []
    for d in product_ratings:
        keys.extend(d.keys())
        
    values = []
    for d in product_ratings:
        values.extend(d.values())
    
    user_rating_list = []
    for rating in range(0, num_ratings):
        user_rating_list.append({'user': 600000, 'product_code': keys[rating],
                                 'rating': values[rating]})
    
    # Add new ratings to full dataset
    new_ratings_df = df.append(user_rating_list, ignore_index=True)
    
    # Format dataset for modeling
    reader = Reader(line_format='user item rating')
    new_data = Dataset.load_from_df(new_ratings_df, reader)
    
    # Train model on full dataset using preset hyperparameters
    svd_ = SVD(lr_all=lr_all, n_epochs=n_epochs, n_factors=n_factors, \
               reg_all=reg_all, random_state=random_state)
    svd_.fit(new_data.build_full_trainset())
    
    # Create total list of predictions for new user
    list_of_predictions = []
    for item in df['product_code'].unique():
        list_of_predictions.append((item, svd_.predict(600000, item)[3]))
    
    # Sort predictions from high to low
    ranked_predictions = sorted(list_of_predictions, key=lambda x:x[1], \
                                reverse=True)
    
    # Create dataframe from ranked predictions
    ranked_df = pd.DataFrame(ranked_predictions, columns=['product_code', \
                                                          'rating'])
    
    # Merge predictions with lookup df to get product names
    merged_df = ranked_df.merge(lookup_df, how='inner', on='product_code')
    
    # Create dataframe with requested number of results
    rec_list = merged_df.head(num_res)
    
    # Get user's ratings and display
    user_rated = new_ratings_df[new_ratings_df['user']==600000]
    user_rated_lookup = user_rated.merge(lookup_df, how='inner', on='product_code')
    display('Customer has rated the following products: ', user_rated_lookup)
    
    # Get list of user's products
    prod_list = user_rated['product_code'].tolist()
    
    # Remove products that user has already rated
    for prod in prod_list:
        rec_list = rec_list[rec_list['product_code'] != prod]
        
    # Display recommendations
    display('Recommendations for customer: ', rec_list)

In [None]:
# Test function
user_ratings()

And there we have our product recommendations! Now, let's take a look at what the top products were by selecting the top 10 products in number of ratings.

In [None]:
# View top 10 products with most reviews
top_series = catalog_df['product_code'].value_counts().head(10)
top_df = pd.DataFrame(top_series)
top_df

In [None]:
# Create list of top 10 products with most reviews
top_list = catalog_df['product_code'].value_counts().index[:10].tolist()
top_list

In [None]:
# Merge top_df with lookup_df
new_df = top_df.merge(lookup_df, how='left', left_index=True, \
                      right_on='product_code')
new_df = new_df.groupby('title').agg({'product_code_x':'sum'})\
                                .sort_values(by='product_code_x', \
                                             ascending=False)
new_df

In [None]:
# Limit title length to 45 characters
new_df.index = new_df.index.str[:45]
new_df = new_df.reset_index()
new_df

In [None]:
# Create bar plot most popular products
fig, ax = plt.subplots(figsize=(10,7))

g = sns.barplot(data=new_df, x='title', y='product_code_x', palette='cool', \
                ci=None)

ax.set_title('Number of Users per Ratings Given')
ax.set_xlabel('Ratings Given')
ax.set_ylabel('Number of Users')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')

for p in ax.patches:
             ax.annotate("%.0f" % p.get_height(), \
                         (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', fontsize=13, color='black', \
                         xytext=(0, 5),
                 textcoords='offset points');

Assuming that our client already carries these products which are popular on Amazon, let's see what other product recommendations we can get. 

In [None]:
# Get final recommendations
user_ratings()

## Conclusions

And there we have our final product recommendations! We can see that the Singular Value Decomposition had the best performance with respect to RMSE. Upon running a series of gridsearches, we were also able to determine the optimal hyperparameters to further reduce the RMSE score. 

To interpret our error, we looked at the MAE score which was 0.9237 on our final best model, meaning that  the average error of our model is off by 0.9237 stars from the actual rating.

Finally, we built out functions to help us look up product codes to put into a recommender system which would then provide us with however many product recommendations the user desires.

The value of this project lies in the ability to use Amazon's huge amount of ratings data to identify what other products a smaller retailer might want to consider adding to their inventory. The only additional data that we would need from the retailer would be customer preferences on the products that the retailer currently carries and that the customer would give high ratings to, and we can place this information in the context of Amazon's ratings to determine what other products this customer would be likely to give high ratings to.

A limitation to this analysis is that the dataset only contains beauty products under the "Luxury Beauty" category, which is a collection of approved brands. Amazon also has a category labeled "All Beauty" whose data we have omitted in this analysis due to hardware limitations that would occur under the stress of dealing with the such a large size of these combined datasets.

To summarize, here are the final recommendations for our client:

1. In order to build a similar recommender system, SVD would be the best algorithm to use, with the following hyperparameters: lr_all=0.025, n_epochs=50, n_factors=150, reg_all=0.1
2. Client should carry the following products based on popularity on Amazon:
* TOPPIK Hair Building Fibers
* HOT TOOLS Professional 24k Gold Extra-Long Barrel Curling Iron/Wand
* Mario Badescu Facial Spray with Aloe, Herbs and Rosewater
* OPI Nail Lacquer, Cajun Shrimp
* OPI Nail Lacquer, Not So Bora-Bora-ing Pink
* BaBylissPRO Ceramix Xtreme Dryer
* OPI Nail Envy Nail Strengthener
* Proraso Shaving Soap in a Bowl, Refreshing and Toning

3. Assuming that our client's current customers would give high ratings to those products, our client should also consider carrying the following products:
* Crabtree & Evelyn - Gardener's Ultra-Moisturising Hand Therapy Pump
* Crabtree & Evelyn Hand Soap, Gardeners
* Soy Milk Hand Crme
* Paul Mitchell Shampoo One
* Glytone Rejuvenating Mask
* PCA SKIN Protecting Hydrator Broad Spectrum SPF 30
* jane iredale Amazing Base Loose Mineral Powder
* jane iredale So-Bronze, Bronzing Powder
* YU-Be: Japan’s secret for dry skin relief. Deep hydrating moisturizing cream for face, hand and body
* Calvin Klein ETERNITY Eau de Parfum



Although ALS has been proven to be an effective algorithm in recommender systems, it was surprising to see such a poor performance score with the data used in this analysis. Moving forward, it might be a worthwhile investigation to see how the model performs if we combine data from the "All Beauty" category with the data used in this analysis.