### Recommendation Engine using Amazon Beauty Products Rating ###
#### This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on their website. It contains : ####

1. the unique UserId (Customer Identification),
2. the product ASIN (Amazon's unique product identification code for each product),
3. Ratings (ranging from 1-5 based on customer satisfaction) and
4. the Timestamp of the rating (in UNIX time)

#### Source : http://jmcauley.ucsd.edu/data/amazon/ #### 

### Objective: ###
#### The objective of the current analysis is to see if a good recommendation engine can be made out of this minimal dataset using different recommenders like popularity based, collaborative filtering based etc. from the Turicreate package ####

### Importing all the necessary libraries ###

In [1]:
import numpy as np
import pandas as pd
import turicreate as tc
from sklearn.model_selection import StratifiedShuffleSplit

### Preprocessing and Data Cleaning ###

In [2]:
df = pd.read_csv("AmazonRatings_Beauty.csv")

In [3]:
df.shape

(2023070, 4)

> One can see that the dataset is huge with over 2 Million Product Ratings

In [4]:
df.head()

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A39HTATAQ9V7YF,205616461,5.0,1369699200
1,A3JM6GV9MNOF9X,558925278,3.0,1355443200
2,A1Z513UWSAAO0F,558925278,5.0,1404691200
3,A1WMRR494NWEWV,733001998,4.0,1382572800
4,A3IAAVS479H7M7,737104473,1.0,1274227200


In [5]:
df['UserId'].nunique()

1210271

In [6]:
df['ProductId'].nunique()

249274

In [7]:
df['Rating'].value_counts(normalize=True)

5.0    0.617241
4.0    0.152115
1.0    0.090844
3.0    0.083927
2.0    0.055873
Name: Rating, dtype: float64

#### For the ease of our initial analysis here, a smaller subset of the data with random 500000 Ratings will be selected
#### For that, a function is defined to randomly split the dataset into a smaller one

In [8]:
# Function to randomly select 500000 product ratings from entire DF
def randomSplitDF(dataframe):
    obs = dataframe.shape[0] # number of rows in the DF
    rng = np.random.default_rng(seed=42)  # random number generator object
    rints = rng.integers(low=0, high=obs, size=500000) # generating 500000 random indexes between 0 and highest row in the DF
    dataf = dataframe.iloc[rints]   # create a DF based on the random indexes 
    print('The distribution of ratings after compressing the DF : \n{}'.format(dataf['Rating'].value_counts(normalize=True)))
    return dataf  

In [9]:
df_x = randomSplitDF(df)

The distribution of ratings after compressing the DF : 
5.0    0.618344
4.0    0.152040
1.0    0.090434
3.0    0.083886
2.0    0.055296
Name: Rating, dtype: float64


#### One can see above that the ratio of the ratings is almost identical to the original dataset

In [10]:
df_x.shape

(500000, 4)

In [11]:
df_x = df_x.drop(columns="Timestamp")
df_x.head()

Unnamed: 0,UserId,ProductId,Rating
180560,AGBO140NR1GUJ,B000BGRC5E,5.0
1565767,A30QS6TJJDEFY0,B007AAUXI2,5.0
1324244,A2C7JDUMF5E450,B004XL4IN2,5.0
887881,A13E5WBNXFFD6X,B002M3NV2C,5.0
876020,AXTN7YDAXMPAE,B002KANWU8,5.0


> The 'Timestamp' column is dropped as it is not important to the recommendation analysis

#### For the analysis with Turicreate recommender packages, the 'UserId' and 'ProductId' columns should be either 'str' or 'int' type and the 'Rating' should be 'float' or 'int' type

In [12]:
## changing the columns to necessary data types for Turicreate
df_x['UserId'] = df_x['UserId'].astype(str)
df_x['ProductId'] = df_x['ProductId'].astype(str)
df_x['ProductId'].dtype

dtype('O')

In [13]:
featureCols = [x for x in df_x.columns if x!='Rating']
featureCols

['UserId', 'ProductId']

### Splitting the data ###
#### Stratified Shuffle Split with a train:test ratio of 75:25 will be used ####

In [14]:
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
train_index, test_index = next(sss.split(df_x[featureCols].values, df_x['Rating'].values))
train_index

array([182051, 375438, 309569, ..., 364790, 272890, 211956])

In [15]:
test_index

array([228836, 166764, 378702, ..., 368656, 446051, 272490])

#### For recommendation with Turicreate, the train and test data must be stored as SFrames ####

In [16]:
trainData = tc.SFrame(df_x.iloc[train_index])
testData = tc.SFrame(df_x.iloc[test_index])
print('Shape of train data is : {}'.format(trainData.shape))
print('Shape of test data is : {}'.format(testData.shape))

Shape of train data is : (375000, 3)
Shape of test data is : (125000, 3)


In [17]:
trainData.head()

UserId,ProductId,Rating
ASCQOFWO81XYN,B001A3Z72W,5.0
A3KXTRGKMHT22V,B003BPJP6G,5.0
A13LSEX4IFY0PV,B005XG2P0E,5.0
A2I5JMVDTAYLCA,B007K1OB0M,5.0
ACKK4BGIUUXO9,B000052YOL,4.0
AF3KFPN0XF44G,B0083GKAEE,5.0
ALTF30OWFS9SU,B00H28JKO0,5.0
A10VJAQED3UDX6,B001ARSVT4,5.0
A3BO85ZORXIIS1,B000T26QXY,5.0
A2Y9SBIU0OBPXF,B001TMXAR8,5.0


In [18]:
testData.head()

UserId,ProductId,Rating
A2J3C8J062VLVH,B00A51LI1O,2.0
AEB7C76SO19DE,B00A35XSAG,3.0
A3NKNWK1JWWJQ5,B000GI0RWW,5.0
A2R82IY983X7D5,B0009FHJRS,1.0
A2HI1ILEKLI1YO,B003Z7XHMI,2.0
ARBMNF9NTVO31,B002EDXXD2,2.0
A3T8R8QJJV8VW7,B00DQ2I0G0,5.0
A2CA9VUU7O5NDP,B00B2SV21A,5.0
A2EUY4UCJIRFDT,B006Z9UPQO,5.0
AOF5GANWARSQF,B008GY3X74,5.0


### Recommendation Engine ###

#### Here four types of recommendation methods would be used from the Turicreate package:
1. __Popularity recommender__ : It creates recommendations based on the most popular items and recommends the same list to every user
2. __Item-Item__ similarity based Collaborative Filtering :
    >a. __Cosine similarity__ : It suggests items based on the cosine similarity to each user.
    
    >b. __Pearson similarity__ : It suggests items based on the pearson similarity to each user.
3. __Factorization Model__ : It learns latent factors for each user and item and uses them to make rating predictions. (This includes both standard matrix factorization as well as factorization machines models)

#### Defining some variables for the analysis

In [19]:
# define constants
user_id = 'UserId'
item_id = 'ProductId'
target = 'Rating'
n_items = 10 # number of items to recommend
usersToRecommend = list(df_x[user_id])

#### A function is defined to let user input the recommendation method he wants to use

In [20]:
def recommendUser(train_data,name,user_id,item_id,target,usersToRecommend,n_items):
    if name=='popularity':
        model = tc.popularity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target)
    elif name=='cosine':
        model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target,
                                                 similarity_type='cosine')
    elif name=='pearson':
        model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target,
                                                similarity_type='pearson')
    elif name=='factor_recommender':
        model = tc.factorization_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target)
    recomModel = model.recommend(users = usersToRecommend, k=n_items)
    recomModel.print_rows(30)
    return model

#### Performance metric :
1. Initially the four models are trained __using a Target column__('Ratings'), and hence the metric to be used is RMSE (refer the Turicreate API docs)
2. For the second part, they are trained __without a Target column__ to see if it improves the results, and hence the metric to be used is precision_recall *(Here factorization model is not used because it needs a target column)*

In [21]:
# Testing the Model with RMSE Values
def calculateRMSE(test_data,models,names):
    eval_counts = tc.recommender.util.compare_models(test_data,models,model_names = names)

### Training the models

#### 1.Popularity Model

In [22]:
name = 'popularity'
popularModel = recommendUser(trainData,name,user_id,item_id,target,usersToRecommend,n_items)

+----------------+------------+-------+------+
|     UserId     | ProductId  | score | rank |
+----------------+------------+-------+------+
| AGBO140NR1GUJ  | B001B2GGCC |  5.0  |  1   |
| AGBO140NR1GUJ  | B007WXQOJO |  5.0  |  2   |
| AGBO140NR1GUJ  | B001D06WO4 |  5.0  |  3   |
| AGBO140NR1GUJ  | B001B1OPKS |  5.0  |  4   |
| AGBO140NR1GUJ  | B0020MMDNS |  5.0  |  5   |
| AGBO140NR1GUJ  | B000ESEQ7G |  5.0  |  6   |
| AGBO140NR1GUJ  | B000T26QXY |  5.0  |  7   |
| AGBO140NR1GUJ  | B001ARSVT4 |  5.0  |  8   |
| AGBO140NR1GUJ  | B0083GKAEE |  5.0  |  9   |
| AGBO140NR1GUJ  | B003BPJP6G |  5.0  |  10  |
| A30QS6TJJDEFY0 | B001B2GGCC |  5.0  |  1   |
| A30QS6TJJDEFY0 | B007WXQOJO |  5.0  |  2   |
| A30QS6TJJDEFY0 | B001D06WO4 |  5.0  |  3   |
| A30QS6TJJDEFY0 | B001B1OPKS |  5.0  |  4   |
| A30QS6TJJDEFY0 | B0020MMDNS |  5.0  |  5   |
| A30QS6TJJDEFY0 | B000ESEQ7G |  5.0  |  6   |
| A30QS6TJJDEFY0 | B000T26QXY |  5.0  |  7   |
| A30QS6TJJDEFY0 | B001ARSVT4 |  5.0  |  8   |
| A30QS6TJJDE

> The popularity Model as suggested gives the most popular items recommendations (here with Rating 5) to all the users and does not have a personalization for each user. Basically not such a good recommendation engine

#### 2. Item-Item based with Cosine Similarity

In [23]:
name = 'cosine'
cosineModel = recommendUser(trainData,name,user_id,item_id,target,usersToRecommend,n_items)

+----------------+------------+-----------------------+------+
|     UserId     | ProductId  |         score         | rank |
+----------------+------------+-----------------------+------+
| AGBO140NR1GUJ  | B006MAPN4K |  0.004279214143753052 |  1   |
| AGBO140NR1GUJ  | B00CSSYJOK | 0.0036653661727905273 |  2   |
| AGBO140NR1GUJ  | B00DITZGE0 | 0.0030473506450653075 |  3   |
| AGBO140NR1GUJ  | B003DWG2FE |  0.002824716567993164 |  4   |
| AGBO140NR1GUJ  | B002IY7IC4 |  0.002706197500228882 |  5   |
| AGBO140NR1GUJ  | B0032GZM1Q |  0.002502964735031128 |  6   |
| AGBO140NR1GUJ  | B003WZ2FUI |  0.002502964735031128 |  7   |
| AGBO140NR1GUJ  | B00B594KDS | 0.0024298667907714845 |  8   |
| AGBO140NR1GUJ  | B0069FDR96 |  0.002332979440689087 |  9   |
| AGBO140NR1GUJ  | B00DHBR2IM | 0.0022696173191070557 |  10  |
| A30QS6TJJDEFY0 | B000FBF5AY |   0.9965935349464417  |  1   |
| A30QS6TJJDEFY0 | B0008394GA |   0.9479495882987976  |  2   |
| A30QS6TJJDEFY0 | B0031W8Y3E |   0.8579859137535095  |

> The Item-Item with cosine similarity gives a more personalized recommendation for each user and is faster in recommending than the popularity based model

#### 3. Item-Item based with Pearson Similarity

In [24]:
name = 'pearson'
pearsonModel = recommendUser(trainData,name,user_id,item_id,target,usersToRecommend,n_items)

+----------------+------------+-------+------+
|     UserId     | ProductId  | score | rank |
+----------------+------------+-------+------+
| AGBO140NR1GUJ  | B001B2GGCC |  5.0  |  1   |
| AGBO140NR1GUJ  | B007WXQOJO |  5.0  |  2   |
| AGBO140NR1GUJ  | B001D06WO4 |  5.0  |  3   |
| AGBO140NR1GUJ  | B001B1OPKS |  5.0  |  4   |
| AGBO140NR1GUJ  | B0020MMDNS |  5.0  |  5   |
| AGBO140NR1GUJ  | B000ESEQ7G |  5.0  |  6   |
| AGBO140NR1GUJ  | B000T26QXY |  5.0  |  7   |
| AGBO140NR1GUJ  | B001ARSVT4 |  5.0  |  8   |
| AGBO140NR1GUJ  | B0083GKAEE |  5.0  |  9   |
| AGBO140NR1GUJ  | B003BPJP6G |  5.0  |  10  |
| A30QS6TJJDEFY0 | B001B2GGCC |  5.0  |  1   |
| A30QS6TJJDEFY0 | B007WXQOJO |  5.0  |  2   |
| A30QS6TJJDEFY0 | B001D06WO4 |  5.0  |  3   |
| A30QS6TJJDEFY0 | B001B1OPKS |  5.0  |  4   |
| A30QS6TJJDEFY0 | B0020MMDNS |  5.0  |  5   |
| A30QS6TJJDEFY0 | B000ESEQ7G |  5.0  |  6   |
| A30QS6TJJDEFY0 | B000T26QXY |  5.0  |  7   |
| A30QS6TJJDEFY0 | B001ARSVT4 |  5.0  |  8   |
| A30QS6TJJDE

> The Item-Item with pearson similarity also gives actually the same recommendations for each user exactly like the popularity model. Unfortunately it does not give a personalized list for each user. But it is faster in recommending than the cosine model

#### 4. Factorization Recommender model

In [25]:
name = 'factor_recommender'
factorRecommender = recommendUser(trainData,name,user_id,item_id,target,usersToRecommend,n_items)

+----------------+------------+--------------------+------+
|     UserId     | ProductId  |       score        | rank |
+----------------+------------+--------------------+------+
| AGBO140NR1GUJ  | B001V9LUYE |  8.39701511336581  |  1   |
| AGBO140NR1GUJ  | B006QAESSI |  8.08099104834811  |  2   |
| AGBO140NR1GUJ  | B00315915G | 8.034387130272549  |  3   |
| AGBO140NR1GUJ  | B003765DLK | 8.020184773934048  |  4   |
| AGBO140NR1GUJ  | B005FCFNY6 | 8.006399173271816  |  5   |
| AGBO140NR1GUJ  | B007QYVP7U | 7.899783629906338  |  6   |
| AGBO140NR1GUJ  | B00BXTY6B6 | 7.855474490654629  |  7   |
| AGBO140NR1GUJ  | B001SZ1UN2 |  7.82000472022311  |  8   |
| AGBO140NR1GUJ  | B008ERA1D2 | 7.818207282555264  |  9   |
| AGBO140NR1GUJ  | B003658S6O | 7.815052289498013  |  10  |
| A30QS6TJJDEFY0 | B001V9LUYE | 8.617019224894207  |  1   |
| A30QS6TJJDEFY0 | B006QAESSI | 8.300851393473309  |  2   |
| A30QS6TJJDEFY0 | B00315915G | 8.253853131067913  |  3   |
| A30QS6TJJDEFY0 | B003765DLK | 8.237837

> The Factorization model gives a different list of recommendations as compared to pearson and popularity models, but again the same recommendations for each user rather than a personalized list. But it is faster in recommending than the cosine and popularity model

### Testing the models (Part 1 with RMSE)

In [26]:
models = [popularModel,cosineModel,pearsonModel,factorRecommender]
names = ['Popularity Model based on Rating','cosine similarity based on Rating','pearson similarity based on Rating',
        'factorization Recommender based on Rating']

In [27]:
#Calculate for all the Models the RMSE values
calculateRMSE(testData,models,names)

PROGRESS: Evaluate model Popularity Model based on Rating



Precision and recall summary statistics by cutoff
+--------+------------------------+-----------------------+
| cutoff |     mean_precision     |      mean_recall      |
+--------+------------------------+-----------------------+
|   1    |          0.0           |          0.0          |
|   2    |          0.0           |          0.0          |
|   3    |          0.0           |          0.0          |
|   4    |          0.0           |          0.0          |
|   5    |          0.0           |          0.0          |
|   6    |          0.0           |          0.0          |
|   7    | 1.2822304655394129e-06 | 8.975613258775907e-06 |
|   8    | 1.1219516573469884e-06 | 8.975613258775907e-06 |
|   9    | 9.972903620862205e-07  | 8.975613258775907e-06 |
|   10   | 8.975613258775927e-07  | 8.975613258775907e-06 |
+--------+------------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.3186194777425513

Per User RMSE (best)
+---------------+------+------


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 0.00031414646405715556 | 0.00025236432612591613 |
|   2    | 0.00026029278450450034 | 0.0004154213003270124  |
|   3    | 0.00022738220255565615 | 0.0005384085724800003  |
|   4    | 0.0002266342347840917  | 0.0006763552000642818  |
|   5    | 0.00021900496351413148 | 0.0008169731411184375  |
|   6    | 0.00020195129832245798 | 0.0008958089442413537  |
|   7    | 0.00019489903076199102 | 0.0010184756587779643  |
|   8    | 0.00018624397511959992 | 0.0011065221507449963  |
|   9    | 0.00022439033146939678 | 0.0015687662335719546  |
|   10   | 0.00021092691158123428 |  0.001617112897663641  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 4.328898412827936

Per User RMSE (best)
+----------------


Precision and recall summary statistics by cutoff
+--------+------------------------+-----------------------+
| cutoff |     mean_precision     |      mean_recall      |
+--------+------------------------+-----------------------+
|   1    |          0.0           |          0.0          |
|   2    |          0.0           |          0.0          |
|   3    |          0.0           |          0.0          |
|   4    |          0.0           |          0.0          |
|   5    |          0.0           |          0.0          |
|   6    |          0.0           |          0.0          |
|   7    | 1.2822304655394165e-06 | 8.975613258775907e-06 |
|   8    | 1.1219516573469884e-06 | 8.975613258775907e-06 |
|   9    | 9.972903620862033e-07  | 8.975613258775907e-06 |
|   10   |  8.97561325877596e-07  | 8.975613258775907e-06 |
+--------+------------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 2.0449740406306693

Per User RMSE (best)
+---------------+------+------


Precision and recall summary statistics by cutoff
+--------+------------------------+-----------------------+
| cutoff |     mean_precision     |      mean_recall      |
+--------+------------------------+-----------------------+
|   1    |          0.0           |          0.0          |
|   2    | 4.487806629387954e-06  | 8.975613258775907e-06 |
|   3    | 2.991871086258641e-06  | 8.975613258775907e-06 |
|   4    | 2.243903314693972e-06  | 8.975613258775889e-06 |
|   5    | 1.7951226517551912e-06 | 8.975613258775846e-06 |
|   6    | 1.4959355431293225e-06 | 8.975613258775892e-06 |
|   7    | 1.282230465539415e-06  |  8.9756132587759e-06  |
|   8    | 1.121951657346979e-06  | 8.975613258775833e-06 |
|   9    | 9.972903620862078e-07  | 8.975613258775867e-06 |
|   10   | 1.7951226517551859e-06 | 1.795122651755175e-05 |
+--------+------------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.2415211135295434

Per User RMSE (best)
+----------------+------------

### Best Model:
1. The Popularity Model has a very good overall RMSE value of 1.32. 
2. The Cosine Model has a much higher RMSE value of 4.33 than the popularity model.
3. The Pearson Model has a RMSE value of 2.04, which is closer to the popularity model.
4. The Factorization Model has the lowest RMSE value of 1.24 among all models.

> The precision and recall values are for all the models are almost equal to zero because the trained data has a __Target__ column specified. Hence this metric is not relevant for this analysis.

*Based on the RMSE results above, the best model for the recommendation engine with a target column would be __Factorization Model__, although it does not give not very personalized results for each user.*

### Training the models (without Target column)

In [28]:
### Here one would use training and test data without the target ratings
trainData_wr = tc.SFrame(df_x[featureCols].iloc[train_index])
testData_wr = tc.SFrame(df_x[featureCols].iloc[test_index])
print('Shape of train data is : {}'.format(trainData_wr.shape))
print('Shape of test data is : {}'.format(testData_wr.shape))

Shape of train data is : (375000, 2)
Shape of test data is : (125000, 2)


In [29]:
trainData_wr

UserId,ProductId
ASCQOFWO81XYN,B001A3Z72W
A3KXTRGKMHT22V,B003BPJP6G
A13LSEX4IFY0PV,B005XG2P0E
A2I5JMVDTAYLCA,B007K1OB0M
ACKK4BGIUUXO9,B000052YOL
AF3KFPN0XF44G,B0083GKAEE
ALTF30OWFS9SU,B00H28JKO0
A10VJAQED3UDX6,B001ARSVT4
A3BO85ZORXIIS1,B000T26QXY
A2Y9SBIU0OBPXF,B001TMXAR8


In [30]:
testData_wr

UserId,ProductId
A2J3C8J062VLVH,B00A51LI1O
AEB7C76SO19DE,B00A35XSAG
A3NKNWK1JWWJQ5,B000GI0RWW
A2R82IY983X7D5,B0009FHJRS
A2HI1ILEKLI1YO,B003Z7XHMI
ARBMNF9NTVO31,B002EDXXD2
A3T8R8QJJV8VW7,B00DQ2I0G0
A2CA9VUU7O5NDP,B00B2SV21A
A2EUY4UCJIRFDT,B006Z9UPQO
AOF5GANWARSQF,B008GY3X74


#### 5. Popularity Model without Rating

In [31]:
name = 'popularity'
popularModel_wr = recommendUser(trainData_wr,name,user_id,item_id,None,usersToRecommend,n_items)

+----------------+------------+--------+------+
|     UserId     | ProductId  | score  | rank |
+----------------+------------+--------+------+
| AGBO140NR1GUJ  | B001MA0QY2 | 1375.0 |  1   |
| AGBO140NR1GUJ  | B0009V1YR8 | 514.0  |  2   |
| AGBO140NR1GUJ  | B0043OYFKU | 432.0  |  3   |
| AGBO140NR1GUJ  | B004OHQR1Q | 416.0  |  4   |
| AGBO140NR1GUJ  | B0000YUXI0 | 415.0  |  5   |
| AGBO140NR1GUJ  | B003V265QW | 405.0  |  6   |
| AGBO140NR1GUJ  | B000ZMBSPE | 397.0  |  7   |
| AGBO140NR1GUJ  | B003BQ6QXK | 354.0  |  8   |
| AGBO140NR1GUJ  | B00121UVU0 | 347.0  |  9   |
| AGBO140NR1GUJ  | B000142FVW | 316.0  |  10  |
| A30QS6TJJDEFY0 | B001MA0QY2 | 1375.0 |  1   |
| A30QS6TJJDEFY0 | B0009V1YR8 | 514.0  |  2   |
| A30QS6TJJDEFY0 | B0043OYFKU | 432.0  |  3   |
| A30QS6TJJDEFY0 | B004OHQR1Q | 416.0  |  4   |
| A30QS6TJJDEFY0 | B0000YUXI0 | 415.0  |  5   |
| A30QS6TJJDEFY0 | B003V265QW | 405.0  |  6   |
| A30QS6TJJDEFY0 | B000ZMBSPE | 397.0  |  7   |
| A30QS6TJJDEFY0 | B003BQ6QXK | 354.0  |

#### 6. Cosine Similarity without Rating

In [32]:
name = 'cosine'
cosineModel_wr = recommendUser(trainData_wr,name,user_id,item_id,None,usersToRecommend,n_items)

+----------------+------------+-----------------------+------+
|     UserId     | ProductId  |         score         | rank |
+----------------+------------+-----------------------+------+
| AGBO140NR1GUJ  | B006MAPN4K | 0.0008231699466705322 |  1   |
| AGBO140NR1GUJ  | B00DITZGE0 | 0.0007875442504882812 |  2   |
| AGBO140NR1GUJ  | B00CSSYJOK | 0.0006654369831085205 |  3   |
| AGBO140NR1GUJ  | B000052Y2X | 0.0005618774890899658 |  4   |
| AGBO140NR1GUJ  | B0047XFPTM | 0.0005618774890899658 |  5   |
| AGBO140NR1GUJ  | B000EVGQAI | 0.0005618774890899658 |  6   |
| AGBO140NR1GUJ  | B003VFUI58 | 0.0005618774890899658 |  7   |
| AGBO140NR1GUJ  | B003D159OK | 0.0005618774890899658 |  8   |
| AGBO140NR1GUJ  | B00FPZF5WS | 0.0005618774890899658 |  9   |
| AGBO140NR1GUJ  | B0074EGK40 | 0.0005618774890899658 |  10  |
| A30QS6TJJDEFY0 | B0031W8Y3E |  0.19245010614395142  |  1   |
| A30QS6TJJDEFY0 | B0008394GA |   0.1666666865348816  |  2   |
| A30QS6TJJDEFY0 | B000FBF5AY |  0.14907121658325195  |

#### 7. Pearson Similarity without Rating

In [33]:
name = 'pearson'
pearsonModel_wr = recommendUser(trainData_wr,name,user_id,item_id,None,usersToRecommend,n_items)

+----------------+------------+---------------------+------+
|     UserId     | ProductId  |        score        | rank |
+----------------+------------+---------------------+------+
| AGBO140NR1GUJ  | B001MA0QY2 |         1.0         |  1   |
| AGBO140NR1GUJ  | B0009V1YR8 | 0.37336244541484714 |  2   |
| AGBO140NR1GUJ  | B0043OYFKU | 0.31368267831149926 |  3   |
| AGBO140NR1GUJ  | B004OHQR1Q |  0.302037845705968  |  4   |
| AGBO140NR1GUJ  | B0000YUXI0 | 0.30131004366812225 |  5   |
| AGBO140NR1GUJ  | B003V265QW |  0.2940320232896652 |  6   |
| AGBO140NR1GUJ  | B000ZMBSPE | 0.28820960698689957 |  7   |
| AGBO140NR1GUJ  | B003BQ6QXK | 0.25691411935953423 |  8   |
| AGBO140NR1GUJ  | B00121UVU0 | 0.25181950509461426 |  9   |
| AGBO140NR1GUJ  | B000142FVW |  0.2292576419213974 |  10  |
| A30QS6TJJDEFY0 | B001MA0QY2 |         1.0         |  1   |
| A30QS6TJJDEFY0 | B0009V1YR8 | 0.37336244541484714 |  2   |
| A30QS6TJJDEFY0 | B0043OYFKU | 0.31368267831149926 |  3   |
| A30QS6TJJDEFY0 | B004O

### Testing the models (Part 2 with precision recall)

In [34]:
models2 = [popularModel_wr,cosineModel_wr,pearsonModel_wr]
names2 = ['Popularity Model without Rating','cosine similarity without Rating','pearson similarity without Rating']
calculateRMSE(testData_wr,models2,names2)

PROGRESS: Evaluate model Popularity Model without Rating



Precision and recall summary statistics by cutoff




+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    | 0.0031504402538303443 | 0.0029873832796292487 |
|   2    | 0.0021765862152531355 |  0.004130278034580029 |
|   3    | 0.0017741795541513835 |  0.005033374321967211 |
|   4    | 0.0015370737705653764 |  0.005762642899242764 |
|   5    | 0.0014127615269313226 |  0.006615326158826467 |
|   6    | 0.0013313826333850925 |  0.007464269579552362 |
|   7    | 0.0013065928443846489 |  0.00854281773564084  |
|   8    |  0.001227415113137605 |  0.009177094405927675 |
|   9    | 0.0011728134658134032 |  0.009860736949137762 |
|   10   | 0.0011318248319316239 |  0.010534955098426089 |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model cosine similarity without Rating



Precision and recall summary statistics by cutoff




+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 0.0003051708507983802  | 0.00024137988513779377 |
|   2    | 0.00024682936461633675 |  0.000379540217799667  |
|   3    | 0.0002303740736419148  | 0.0005366989318592812  |
|   4    | 0.00021317081489592802 | 0.0006357298648144408  |
|   5    |  0.000213619595558866  |  0.000783613778506655  |
|   6    | 0.00019447162060681135 | 0.0008412072969171345  |
|   7    |  0.000183358956572136  | 0.0009421829460783632  |
|   8    | 0.0001795122651755172  |  0.001046898434097413  |
|   9    | 0.00017352852300300055 | 0.0011575976642889893  |
|   10   | 0.00016425372263559926 |  0.001209207440526952  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model pearson similarity without Rating



Precision and recall summary statistics by cutoff




+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    | 0.0031504402538303357 |  0.002987383279629254 |
|   2    |  0.002176586215253171 | 0.0041302780345800335 |
|   3    |  0.001774179554151381 |  0.005033374321967208 |
|   4    | 0.0015370737705653731 |  0.005762642899242765 |
|   5    | 0.0014127615269313235 |  0.006615326158826473 |
|   6    | 0.0013313826333851008 |  0.007464269579552359 |
|   7    | 0.0013065928443846506 |  0.008542817735640857 |
|   8    | 0.0012274151131376051 |  0.00917709440592767  |
|   9    |  0.001172813465813407 |  0.009860736949137715 |
|   10   | 0.0011318248319316221 |  0.010534955098426143 |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]



### Best Model, without Target column:
1. The Popularity and Pearson models have precision and recall values higher by a factor of 10 than the cosine model. Still all the models have very low values for both the precision and recall. Hence these models trained without a target column wouldn't be good recommendation engines. 

*Based on all the results above, the best model for the recommendation engine would be the one trained with a target column, which is the __Factorization Model__, although it does not give not very personalized results for each user.*

### Scope for future work:
1. For this dataset, the models trained with a target column would be preferred due to better RMSE values. For further work, the best model could be hypertuned for choosing the best parameters.
2. Also, Metadata could be additionally used to get context of the products.
3. Deep Learning could be used to see if it can improve the results.