##**Mining Massive Datasets Final Project**

**Group Member 1:** Mateo Castro (mc7212) **Group Member 2:** Badrinath Gujjula (bg2310) **Group Member 3:** Yuqi Liu (yl8406)

In [1]:
import os
import json
import gzip
import pandas as pd
import numpy as np
from urllib.request import urlopen

## Dataset Description
The dataset used in this project is the Magazine Subscriptions set in Amazon review data, which is downloaded from http://deepyeti.ucsd.edu/jianmo/amazon/. The Magazine Subscriptions set contians a ratings set and a meta data set. The ratings dataset has 3 columns and 89689 rows. The rating set has three columns: Product_ID, Customer_ID, and Rating, and each row corresponds to a rating record. The meta data has 19 columns and 2320 unique rows. Each row contains the information of one item, and each column corresponds to one feature. The features we mainly used in meta data set are description, category, and brand.

## Download Ratings Dataset

In [2]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Magazine_Subscriptions.csv

--2022-05-09 17:21:30--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Magazine_Subscriptions.csv
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3743304 (3.6M) [application/octet-stream]
Saving to: ‘Magazine_Subscriptions.csv’


2022-05-09 17:21:32 (2.95 MB/s) - ‘Magazine_Subscriptions.csv’ saved [3743304/3743304]



In [3]:
### Load the csv
df = pd.read_csv('Magazine_Subscriptions.csv', usecols=[i for i in range(3)], names=['Product_ID', 'Customer_ID', 'Rating'])

# Display first 5 lines of the DataFrame
df.head()

Unnamed: 0,Product_ID,Customer_ID,Rating
0,B00005N7P0,AH2IFH762VY5U,5.0
1,B00005N7P0,AOSFI0JEYU4XM,5.0
2,B00005N7OJ,A3JPFWKS83R49V,3.0
3,B00005N7OJ,A19FKU6JZQ2ECJ,5.0
4,B00005N7P0,A25MDGOMZ2GALN,5.0


## Download Meta data for products

In [4]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Magazine_Subscriptions.json.gz

--2022-05-09 17:21:32--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Magazine_Subscriptions.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1276947 (1.2M) [application/octet-stream]
Saving to: ‘meta_Magazine_Subscriptions.json.gz’


2022-05-09 17:21:34 (1.33 MB/s) - ‘meta_Magazine_Subscriptions.json.gz’ saved [1276947/1276947]



In [5]:
### load the meta data

data = []
with gzip.open('meta_Magazine_Subscriptions.json.gz') as f:
  try:
    for l in f:
        data.append(json.loads(l.strip()))
  except:
    print("Done with error")
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

3385
{'category': ['Magazine Subscriptions', 'Professional & Educational Journals', 'Professional & Trade', 'Humanities & Social Sciences', 'Economics & Economic Theory'], 'tech1': '', 'description': ['REASON is edited for people interested in economic, social, and international issues. Viewpoint stresses individual liberty, private responsibility, and limited government. Some emphasis on Pacific Rim, local/state issues with national impact, science/technology. Regular departments include news/trends, book reviews (mostly history, politics, and economics), and cultural commentary.'], 'fit': '', 'title': '<span class="a-size-medium a-color-secondary"', 'also_buy': ['B002PXVYLE', 'B01MCU84LB', 'B000UHI2LW', 'B01AKS14AQ', 'B002PXVYPU', 'B001THPA26', '1476752842', 'B002CT515Q', 'B00KQ0HP2K', 'B00XZF1JUM', 'B0058EONOM', 'B01FV51RKA', 'B0032KHQTS', 'B002GCU2S0', 'B079JCLNZ4', 'B002PXW18E', 'B00005NIOH', 'B00005N7SD', 'B002LDA9VY', '0345816021', 'B00006KX3K', 'B002GCU2SA', 'B0047VIALE', 'B000

In [6]:
# convert list into pandas dataframe

df_meta = pd.DataFrame.from_dict(data)

print(len(df_meta))

3385


In [7]:
df_meta.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,"[Magazine Subscriptions, Professional & Educat...",,[REASON is edited for people interested in eco...,,"<span class=""a-size-medium a-color-secondary""","[B002PXVYLE, B01MCU84LB, B000UHI2LW, B01AKS14A...",,Reason Magazine,[],[],"[B002PXVYLE, B000UHI2LW, B01MCU84LB, B002PXW18...","{'Format:': 'Print Magazine', 'Shipping: ': 'C...",Magazine Subscriptions,,,,B00005N7NQ,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,"[Magazine Subscriptions, Arts, Music &amp; Pho...",,[Written by and for musicians. Covers a variet...,,"<span class=""a-size-medium a-color-secondary""","[B002PXVYGE, B0054LRNC8, B000BVEELE, B00006KC3...",,String Letter Publishers,[],742 in Magazine Subscriptions (,"[B002PXVYGE, B0054LRNC8, B00006L16A, 171906487...","{'Format:': 'Print Magazine', 'Shipping: ': 'C...",Magazine Subscriptions,,,,B00005N7OC,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
2,"[Magazine Subscriptions, Fashion &amp; Style, ...",,[Allure is the beauty expert. Every issue is f...,,"<span class=""a-size-medium a-color-secondary""","[B001THPA4O, B002PXVZWW, B001THPA1M, B001THPA1...",,Conde Nast Publications,[],[],"[B002PXVZWW, B001THPA4O, B001THPA1M, B01N819UD...","{'Format:': 'Print Magazine', 'Shipping: ': 'C...",Magazine Subscriptions,,,,B00005N7OD,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
3,"[Magazine Subscriptions, Sports, Recreation & ...",,[FLIGHT JOURNAL includes articles on aviation ...,,"<span class=""a-size-medium a-color-secondary""","[B07JVF7QW4, B00ATQ6FPY, B002G551F6, B00008CGW...",,AirAge Publishing,[],[],"[B002G551F6, B00ATQ6FPY, B00005N7PT, B001THPA2...","{'Format:': 'Print Magazine', 'Shipping: ': 'C...",Magazine Subscriptions,,,,B00005N7O9,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
4,"[Magazine Subscriptions, Professional & Educat...",,[RIDER is published for the road and street ri...,,"<span class=""a-size-medium a-color-secondary""","[B002PXVYD2, B01BM7TOU6, B000060MKJ, B000BNNIG...",,EPG Media & Specialty Information,[],[],"[B01BM7TOU6, B000060MKJ, B002PXVYD2, B000BNNIG...","{'Format:': 'Print Magazine', 'Shipping: ': 'C...",Magazine Subscriptions,,,,B00005N7O6,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In [8]:
# drop duplicated items
df_meta = df_meta.iloc[df_meta['asin'].drop_duplicates().index]

In [9]:
df_meta.shape

(2320, 19)

asin is the ProductID

## Collaborative Filtering

We used Collaborative Filtering for old users that had just purchased an item, giving them recommendations for similar items. Using the Surprise library we created a KNN Baseline model that was Item-Item, and gave us the Nearest Neighbors to the item purchased.

Installing Surprise library

In [10]:
# install surprise to build recommender in python
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 9.9 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1630197 sha256=faea5f24a6f4845df2e1ece78d90bb7ae0d3b767c18e4f6ffbb109156d4577e4
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [11]:
from surprise import KNNBaseline
from surprise import Reader
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

Converting the Pandas dataframe to a Dataset object for use with Surprise

In [12]:
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(df[['Customer_ID', 'Product_ID', 'Rating']], reader)

Finding the optimal parameters using GridSearch 5 Fold Cross-Validation

Baseline Options:

*   Method: ALS (Alternating Least Squares)
*   n_epochs: 5, 10
*   reg_u: 10
*   reg_i: 5

Sim Options:

*   Name: Cosine, Pearson_Baseline
*   min_support: 1, 2, 3'
*   user_based: False

In [13]:
ks = [10, 50, 100, 150, 200]

param_grid = {'bsl_options': {'method': ['als'], 'n_epochs': [5, 10], 
                              'reg_u': [10], 'reg_i': [5]},
              'k': ks,
              'sim_options': {'name': ['cosine', 'pearson_baseline'],
                              'min_support': [1, 2, 3],
                              'user_based': [False]}
              }
gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=5)

gs.fit(data)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Comput

Best Results for the Grid Search using RMSE as the score

RMSE: 1.3118879603008418

In [14]:
# Best RMSE score
print("Best RMSE:", gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print("Best Parameters:", gs.best_params['rmse'])

Best RMSE: 1.3102081329536612
Best Parameters: {'bsl_options': {'method': 'als', 'n_epochs': 10, 'reg_u': 10, 'reg_i': 5}, 'k': 50, 'sim_options': {'name': 'pearson_baseline', 'min_support': 3, 'user_based': False}}


All results from Cross Validation splits and parameters tested along their RMSE score

In [15]:
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,split3_test_rmse,split4_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_bsl_options,param_k,param_sim_options
0,1.31577,1.318169,1.298835,1.314931,1.305783,1.310697,0.007271,32,0.484513,0.113998,0.302482,0.063057,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",10,"{'name': 'cosine', 'min_support': 1, 'user_bas..."
1,1.315171,1.318763,1.301441,1.313346,1.307193,1.311183,0.006146,52,0.55546,0.08324,0.400185,0.123853,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",10,"{'name': 'cosine', 'min_support': 2, 'user_bas..."
2,1.314785,1.318892,1.301443,1.311885,1.307339,1.310869,0.006034,42,0.588091,0.12771,0.338071,0.078124,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",10,"{'name': 'cosine', 'min_support': 3, 'user_bas..."
3,1.31431,1.318066,1.300734,1.31242,1.306709,1.310448,0.006086,29,0.421109,0.190261,0.305316,0.226812,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",10,"{'name': 'pearson_baseline', 'min_support': 1,..."
4,1.31431,1.318066,1.300734,1.31242,1.306709,1.310448,0.006086,30,0.28048,0.003982,0.189845,0.076301,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",10,"{'name': 'pearson_baseline', 'min_support': 2,..."
5,1.315068,1.317228,1.300147,1.312189,1.306411,1.310208,0.006204,10,0.276633,0.007051,0.149002,0.002626,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",10,"{'name': 'pearson_baseline', 'min_support': 3,..."
6,1.315795,1.318178,1.298854,1.314898,1.305817,1.310709,0.007262,38,0.273446,0.012167,0.188425,0.076734,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",50,"{'name': 'cosine', 'min_support': 1, 'user_bas..."
7,1.31521,1.318767,1.301456,1.313337,1.307204,1.311195,0.006145,59,0.271189,0.0047,0.187848,0.075788,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",50,"{'name': 'cosine', 'min_support': 2, 'user_bas..."
8,1.31481,1.318893,1.301444,1.311878,1.307336,1.310872,0.006037,49,0.270293,0.003593,0.150404,0.00276,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",50,"{'name': 'cosine', 'min_support': 3, 'user_bas..."
9,1.31431,1.318066,1.300734,1.31242,1.306709,1.310448,0.006086,27,0.273826,0.006956,0.188074,0.076532,"{'bsl_options': {'method': 'als', 'n_epochs': ...","{'method': 'als', 'n_epochs': 5, 'reg_u': 10, ...",50,"{'name': 'pearson_baseline', 'min_support': 1,..."


Creating and training a Baseline KNN model with ideal parameters and item-item

In [16]:
# Using Ideal Params to build the model
model = gs.best_estimator['rmse']

# Full ratings used to build trainset
trainset = data.build_full_trainset()

model.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f883796a710>

Get N nearest neighbors for an item

This will give similar items to those purchased using the ID of the item and the number of similar items to return

In [17]:
def collaborative_recommendation(item_purchased, n=5):
  # Convert raw id (asin) to inner id
  inner_id = model.trainset.to_inner_iid(item_purchased)

  # Obtain inner ids of the n nearest neighbors of item_purchased
  neighbors = model.get_neighbors(inner_id, k=n)

  # Convert the inner ids of nearest neighbors to raw id (asin)
  neighbors = (model.trainset.to_raw_iid(inner_id) for inner_id in neighbors)
  ans = [i for i in neighbors]
  return ans

Example for item B00005N7O6 with top 5 similar items

In [18]:
item_id = "B00005N7O6"
n = 5
top_n = collaborative_recommendation(item_id)

print("Top", n, "Similar Items to", item_id)
for item in top_n:
  print(item)

Top 5 Similar Items to B00005N7O6
B00007AWME
B000UEI4JU
B00005N7P0
B00005N7OJ
B00005N7Q1


## Content-based

We modified the usual content-based recommendation system by overlapping the results with "also buy" and "also view", so that the recommendations are both similar to the product the user is searching for, and liked by other users.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics import jaccard_score

In [20]:
description = df_meta['description'].str[0]
category = df_meta['category'].str[0]
words = description + category + df_meta['brand'].astype(str)

Compute TF-IDF score

In [21]:
tfidf = TfidfVectorizer(stop_words='english')
words = words.fillna('')
tfidf_matrix = tfidf.fit_transform(words)

Compute cosine similarity score

In [22]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [23]:
indices = pd.Series(df_meta.index, index=df_meta['asin']).drop_duplicates()

In [24]:
def content_based_recomendation(asin, n):

    idx = indices[asin]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:n+1]
    item_indices = [i[0] for i in sim_scores]

    rec = df_meta['asin'].iloc[item_indices]
    # also_buy = np.intersect1d(df_meta[df_meta['asin'] == asin]['also_buy'].iloc[0], list(rec))
    # also_view = np.intersect1d(df_meta[df_meta['asin'] == asin]['also_view'].iloc[0], list(rec))
    # return np.union1d(also_buy, also_view)
    ans = [i for i in rec]
    return ans

In [25]:
content_based_recomendation('B00005N7O6', 20)

['B00007AWME',
 'B00YQH9874',
 'B015PRLQSW',
 'B000ILZ6YQ',
 'B000H4W7VO',
 'B000SKWFL4',
 'B00006AMTF',
 'B0098QNZ22',
 'B00005N7O4',
 'B00007AXX3',
 'B000UEI4JU',
 'B000H4W7YG',
 'B00006KBW2',
 'B000089G4T',
 'B00007AVYH',
 'B00007B1AI',
 'B00FW9D1ZU',
 'B000HEVUJE',
 'B00TGFD3M2',
 'B00VVGRBI6']

## Latent Factor

In [26]:
raw = pd.read_csv('Magazine_Subscriptions.csv', usecols=[i for i in range(3)], names=['Product_ID', 'Customer_ID', 'Rating'])
raw.drop_duplicates(inplace=True)
raw.columns = ['product_id','customer_id','Rating']
raw=raw[['customer_id','product_id','Rating']] 


In [27]:
import surprise  #Scikit-Learn library for recommender systems. 
from surprise.model_selection import train_test_split

In [28]:
# when importing from a DF, you only need to specify the scale of the ratings.
reader = surprise.Reader(rating_scale=(1,5)) 
#into surprise:
data = surprise.Dataset.load_from_df(raw,reader)

In [29]:
trainset, testset = train_test_split(data)

Inherit from surpise library to modify fit and estimate to implement latent factor. Defined a function called get_recommendation which takes username and gives top N recommendation's based on Latent Factoring out of all items including the ones for which this user has never given rating

In [30]:
class MatrixFactorization(surprise.AlgoBase):
# Randomly initializes two Matrices and use Stochastic Gradient Descent(SGD) to be able to optimize the best factorization for ratings.
    def __init__(self,learning_rate,num_epochs,num_factors):
       # learning rate for Stochastic Gradient Descent
        self.alpha = learning_rate 
        self.num_epochs = num_epochs
        self.num_factors = num_factors # Number of Latent Factors

    # Takes a user name converts it to inner_uid gets top n inner_iid and converts to raw_iid(Item name) and returns them
    def get_recommendation(self,u,n=5):
      ans = []
      items = self.trainset.all_items()
      u_t = self.trainset.to_inner_uid(u)
      for i in items:
        ans.append((self.trainset.to_raw_iid(i),self.estimate(u_t,i)))
      ans.sort(key=lambda x: x[1], reverse=True)
      ans = ans[:n]
      res = []
      for i in ans:
        res.append(i[0])
      return res

    def fit(self,train):
        #randomly initialize user and item factors from a Gaussian
        P = np.random.normal(0,.1,(train.n_users,self.num_factors))
        Q = np.random.normal(0,.1,(train.n_items,self.num_factors))

        for epoch in range(self.num_epochs):
            for u,i,r_ui in train.all_ratings():
                residual = r_ui - np.dot(P[u],Q[i])
                # Update them at the same time so store in a temporary variable. 
                temp = P[u,:] 
                P[u,:] +=  self.alpha * residual * Q[i]
                Q[i,:] +=  self.alpha * residual * temp 
        self.P = P
        self.Q = Q
        self.trainset = train # Will be useful later on 

    def estimate(self,u,i):
        # Returns estimated rating for user u and item i.
        # Check to see if u and i are in the train set:

        if self.trainset.knows_user(u) and self.trainset.knows_item(i):
            nanCheck = np.dot(self.P[u],self.Q[i])
            if np.isnan(nanCheck):
                return self.trainset.global_mean
            else:
                return np.dot(self.P[u,:],self.Q[i,:])
        else:
          # if its not known we'll return the general average.
          # We wont reach this stage as we will be using contenet based for this scenario
            return self.trainset.global_mean
    


Below lines are used to get Train, Test RMSE for different combinations of hyperparameters which took more than 4 hours so stored the results in github repo(https://github.com/badri449/MMDS/blob/main/LatentFactor/Train_Parameters.txt)

In [31]:
# Uncomment below lines to train for different possible hyper parametrs
# Instead of for loop we can also use GridSearchCV to get best hyper parameters
"""
COLUMN_NAMES = ['LearningRate','Epochs',"NumFactors","Train RMSE","Test RMSE"]
result = pd.DataFrame(columns=COLUMN_NAMES)
for lr in [0.005,0.05,0.1,0.5]:
    for epochs in [10,15,20,25,30,40,45,50,60]:
        for num_factors in [2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,50]:
            Alg1 = MatrixFactorization(learning_rate=lr,num_epochs=epochs,num_factors=num_factors)
            print("lr: ",lr," epoch: ",epochs,"num: ",num_factors)
            Alg1.fit(trainset)
            prediction = Alg1.test(trainset.build_testset())
            train_error = surprise.accuracy.rmse(prediction,verbose=True)
            prediction = Alg1.test(testset)
            test_error = surprise.accuracy.rmse(prediction,verbose=True)
            cur = [lr,epochs,num_factors,train_error,test_error]
            result_length = len(result)
            result.loc[result_length] = cur
"""
"""
gs = surprise.model_selection.GridSearchCV(MatrixFactorization, param_grid={'learning_rate':[0.005,0.05,0.1,0.5],
                                                                            'num_epochs':[10,15,20,25,30,40,45,50,60],
                                                                            'num_factors':[2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,50]},measures=['rmse', 'mae'], cv=2)
gs.fit(data) # Training on entire dataset
print('rmse: ',gs.best_score['rmse'],'mae: ',gs.best_score['mae'])
best_params = gs.best_params['rmse']
print('rmse: ',gs.best_params['rmse'],'mae: ',gs.best_params['mae'])
"""
# Uncoment either of the above if you want to train a different combinations of hyper parameters

"\ngs = surprise.model_selection.GridSearchCV(MatrixFactorization, param_grid={'learning_rate':[0.005,0.05,0.1,0.5],\n                                                                            'num_epochs':[10,15,20,25,30,40,45,50,60],\n                                                                            'num_factors':[2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,50]},measures=['rmse', 'mae'], cv=2)\ngs.fit(data) # Training on entire dataset\nprint('rmse: ',gs.best_score['rmse'],'mae: ',gs.best_score['mae'])\nbest_params = gs.best_params['rmse']\nprint('rmse: ',gs.best_params['rmse'],'mae: ',gs.best_params['mae'])\n"

Load the results from github repo 

In [32]:
url = 'https://github.com/badri449/MMDS/raw/main/LatentFactor/Train_Parameters.txt'
result = pd.read_csv(url,index_col=False)
result = result.drop(['index'], axis = 1)
result.columns = ['LearningRate', 'Epochs', 'LatentFactors', 'TrainRMSE','TestRMSE']

In [33]:
# Sort the results based on Train RMSE 
result.sort_values(by=['TrainRMSE'])

Unnamed: 0,LearningRate,Epochs,LatentFactors,TrainRMSE,TestRMSE
143,0.005,60.0,50.0,0.234026,1.982881
142,0.005,60.0,40.0,0.270494,1.972494
141,0.005,60.0,35.0,0.292634,1.967087
140,0.005,60.0,30.0,0.318007,1.953089
127,0.005,50.0,50.0,0.343207,1.981028
...,...,...,...,...,...
4,0.005,10.0,6.0,2.590315,2.033179
3,0.005,10.0,5.0,2.649409,2.054577
2,0.005,10.0,4.0,2.661590,2.007664
1,0.005,10.0,3.0,2.716431,2.044642


In [34]:
# Sort the results based on Test RMSE 
result.sort_values(by=['TestRMSE'])

Unnamed: 0,LearningRate,Epochs,LatentFactors,TrainRMSE,TestRMSE
208,0.050,30.0,2.0,1.392116,1.412927
400,0.100,50.0,2.0,1.399411,1.412991
160,0.050,15.0,2.0,1.398663,1.413018
575,0.500,60.0,50.0,1.416786,1.413019
479,0.500,20.0,50.0,1.416786,1.413019
...,...,...,...,...,...
8,0.005,10.0,10.0,2.496395,2.036969
11,0.005,10.0,25.0,2.319462,2.043628
1,0.005,10.0,3.0,2.716431,2.044642
3,0.005,10.0,5.0,2.649409,2.054577


From the above two sorts we can see even if the train RMSE is 0.23 we got Test Rmse as 1.98 it means overfitting as can be expected as latent factors are 50.<br>
So better to choose best hyper parameters based on Test RMSE which are:
*   Learning Rate: 0.050
*   Epochs: 30
*   NumFactors: 2
<br>
Test RMSE is 1.412927 Train RMSE is 1.392116.

In [35]:
bestVersion = MatrixFactorization(learning_rate=0.050,num_epochs=30,num_factors=2)

 Used k-fold cross validation to evaluate the best model. 

In [36]:
kSplit = surprise.model_selection.KFold(n_splits=10,shuffle=True)
for train,test in kSplit.split(data):
    bestVersion.fit(train)
    prediction = bestVersion.test(test)
    surprise.accuracy.rmse(prediction,verbose=True)



RMSE: 1.4245
RMSE: 1.4228
RMSE: 1.4187
RMSE: 1.3840
RMSE: 1.4163
RMSE: 1.4158
RMSE: 1.4360
RMSE: 1.4216
RMSE: 1.4318
RMSE: 1.4208


The above results are consistant with what we got while training

Now train the best version(optimal hyper parameters) on the entire data set

In [37]:
bestVersion = MatrixFactorization(learning_rate=0.050,num_epochs=30,num_factors=2)
data1 = data.build_full_trainset()
bestVersion.fit(data1) # Train on full data set



In [38]:
def latent_recommendation(user_name,N=5):
  return bestVersion.get_recommendation(user_name,N)

Test latent recommendation function

In [39]:
latent_recommendation("AQJDVLLWELFJ1",9)

['B00005N7P0',
 'B00005N7OJ',
 'B00005N7Q1',
 'B00005N7OD',
 'B00005N7PS',
 'B00005N7OU',
 'B00005N7QC',
 'B00005N7PA',
 'B00005N7OV']

In [40]:
# create an array of unique users - Yuqi
unique_users = df['Customer_ID'].unique()

This Function is the Final result

In [41]:
# return a list of N top recommendations
def get_recomendation(user_name,N=5,item_id=None):
  if user_name in unique_users:
    if item_id is not None:
      # Old user purchased an item
      return collaborative_recommendation(item_id,N)
    else:
      # Old user just browsing
      return latent_recommendation(user_name,N)
    # New user browsing
  return content_based_recomendation(item_id,N)
  

Test for different scenarios

In [42]:
# New User browsing
get_recomendation('xx',5,'B00005N7O6')

['B00007AWME', 'B00YQH9874', 'B015PRLQSW', 'B000ILZ6YQ', 'B000H4W7VO']

In [43]:
# Old user Browsing
get_recomendation("AQJDVLLWELFJ1",5)

['B00005N7P0', 'B00005N7OJ', 'B00005N7Q1', 'B00005N7OD', 'B00005N7PS']

In [44]:
# Old user Purchased an item
get_recomendation("AQJDVLLWELFJ1",3,"B00005N7O6")

['B00007AWME', 'B000UEI4JU', 'B00005N7P0']