![example](images/director_shot.jpeg)

# Project Title

**Authors:** Student 1, Student 2, Student 3
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
df = pd.read_csv('data/Luxury_Beauty.csv', names=['asin', 'user', 'rating', 'timestamp'])
df

Unnamed: 0,asin,user,rating,timestamp
0,B00004U9V2,A1Q6MUU0B2ZDQG,2.0,1276560000
1,B00004U9V2,A3HO2SQDCZIE9S,5.0,1262822400
2,B00004U9V2,A2EM03F99X3RJZ,5.0,1524009600
3,B00004U9V2,A3Z74TDRGD0HU,5.0,1524009600
4,B00004U9V2,A2UXFNW9RTL4VM,5.0,1523923200
...,...,...,...,...
574623,B01HIQEOLO,AHYJ78MVF4UQO,5.0,1489968000
574624,B01HIQEOLO,A1L2RT7KBNK02K,5.0,1477440000
574625,B01HIQEOLO,A36MLXQX9WPPW9,5.0,1475193600
574626,B01HJ2UY0W,A23DRCOMC2RIXF,1.0,1480896000


In [3]:
asin_list = df['asin'].unique()

In [4]:
np.arange(len(asin_list))

array([    0,     1,     2, ..., 12117, 12118, 12119])

In [5]:
asin_lookup = dict(zip(np.arange(len(asin_list)), asin_list))

In [6]:
asin_map = dict(zip(asin_list, np.arange(len(asin_list))))

In [7]:
asin_map

{'B00004U9V2': 0,
 'B00005A77F': 1,
 'B00005NDTD': 2,
 'B00005V50C': 3,
 'B00005V50B': 4,
 'B000066SYB': 5,
 'B000068DWY': 6,
 'B00008WFSM': 7,
 'B0000Y3NO6': 8,
 'B0000ZREXG': 9,
 'B0000ZREXQ': 10,
 'B00011JU6I': 11,
 'B00011QUKW': 12,
 'B00012C5RS': 13,
 'B000142FVW': 14,
 'B000141PIG': 15,
 'B00014351Q': 16,
 'B0001433OU': 17,
 'B000141PYK': 18,
 'B00014340I': 19,
 'B0001435D4': 20,
 'B00014353E': 21,
 'B0001432PK': 22,
 'B00014GT8W': 23,
 'B0001EKVCW': 24,
 'B0001EKVGS': 25,
 'B0001EKTTC': 26,
 'B0001EL5KO': 27,
 'B0001EL5JA': 28,
 'B0001EL9BO': 29,
 'B0001EL0WC': 30,
 'B0001EL5OU': 31,
 'B0001EL59K': 32,
 'B0001EL5R2': 33,
 'B0001EL4DC': 34,
 'B0001EL39C': 35,
 'B0001EL5Q8': 36,
 'B0001EL4M8': 37,
 'B0001EKYEM': 38,
 'B0001F3QV4': 39,
 'B0001NAYOI': 40,
 'B0001QNLNG': 41,
 'B0001UWRCI': 42,
 'B0001XDU2Q': 43,
 'B0001XDTYA': 44,
 'B0001XDUBC': 45,
 'B0001XDTWM': 46,
 'B0001Y74TA': 47,
 'B0001Y74TU': 48,
 'B0001Y74XG': 49,
 'B0001Y74H2': 50,
 'B0001Y74KO': 51,
 'B0001Y74SG': 52,
 'B

In [8]:
df['asin'] = df['asin'].map(asin_map)

In [9]:
df

Unnamed: 0,asin,user,rating,timestamp
0,0,A1Q6MUU0B2ZDQG,2.0,1276560000
1,0,A3HO2SQDCZIE9S,5.0,1262822400
2,0,A2EM03F99X3RJZ,5.0,1524009600
3,0,A3Z74TDRGD0HU,5.0,1524009600
4,0,A2UXFNW9RTL4VM,5.0,1523923200
...,...,...,...,...
574623,6012,AHYJ78MVF4UQO,5.0,1489968000
574624,6012,A1L2RT7KBNK02K,5.0,1477440000
574625,6012,A36MLXQX9WPPW9,5.0,1475193600
574626,12118,A23DRCOMC2RIXF,1.0,1480896000


In [10]:
# df['asin'].map(asin_lookup)

In [11]:
user_list = df['user'].unique()

In [12]:
np.arange(len(user_list))

array([     0,      1,      2, ..., 416171, 416172, 416173])

In [13]:
user_lookup = dict(zip(np.arange(len(user_list)), user_list))

In [14]:
user_map = dict(zip(user_list, np.arange(len(user_list))))

In [15]:
user_map

{'A1Q6MUU0B2ZDQG': 0,
 'A3HO2SQDCZIE9S': 1,
 'A2EM03F99X3RJZ': 2,
 'A3Z74TDRGD0HU': 3,
 'A2UXFNW9RTL4VM': 4,
 'AXX5G4LFF12R6': 5,
 'A7GUKMOJT2NR6': 6,
 'A3FU4L59BHA9FY': 7,
 'A1AMNMIPQMXH9M': 8,
 'A3DMBDTA8VGWSX': 9,
 'A160DTI3H7VHLQ': 10,
 'A1H41DKPDPVA0R': 11,
 'A2BDI7THUMJ8V': 12,
 'AM7EBP5TRX7AC': 13,
 'A31FOVCS3WTWPT': 14,
 'AXUU8F9EM6U3E': 15,
 'A24B46V78ATNRP': 16,
 'ABUBKML2EONCG': 17,
 'A2UA6E1RVG3C1I': 18,
 'A1TRMJHEDGX0HF': 19,
 'A2TTJS62322SXW': 20,
 'AX2K33SNI3WHN': 21,
 'ALX99DYO827ZK': 22,
 'A3PVVQ9MHYFTV9': 23,
 'A22NEUQTKWQM98': 24,
 'A1TQQZ6NVDTPNL': 25,
 'A32E3RVLI6D4TM': 26,
 'A3KUYXBMJ8AVIX': 27,
 'A3TMPSQ7X4M9LO': 28,
 'AUEUNR2AQQ0SY': 29,
 'A2P5MRZ68JX8EE': 30,
 'A8VD1E2O6N2KO': 31,
 'A1Q3N7GU27KGMA': 32,
 'A3QEV9GSI4HPA5': 33,
 'A2FMDHT0HNA3WY': 34,
 'A2QPHVVXS9FUBS': 35,
 'AL63CNA6X6IX8': 36,
 'A2N6AACMA6WOMN': 37,
 'A35I4FD5EARKTS': 38,
 'A2N8V79LWVR8F2': 39,
 'A2R9R1DJ9RHXOX': 40,
 'A60EV0X26JNB3': 41,
 'A3CG9DJUY5F2UY': 42,
 'AEDOSTGV48XO9': 43,
 'A21BQWP17Y

In [16]:
df['user'] = df['user'].map(user_map)

In [17]:
df

Unnamed: 0,asin,user,rating,timestamp
0,0,0,2.0,1276560000
1,0,1,5.0,1262822400
2,0,2,5.0,1524009600
3,0,3,5.0,1524009600
4,0,4,5.0,1523923200
...,...,...,...,...
574623,6012,194479,5.0,1489968000
574624,6012,175357,5.0,1477440000
574625,6012,416172,5.0,1475193600
574626,12118,416173,1.0,1480896000


In [18]:
df.dtypes

asin           int64
user           int64
rating       float64
timestamp      int64
dtype: object

In [19]:
df['asin'].nunique()

12120

In [20]:
df['user'].nunique()

416174

In [21]:
df['rating']=df['rating'].astype(np.int8)

In [22]:
df['asin']=df['asin'].astype(np.int32)

In [23]:
df['user']=df['user'].astype(np.int32)

In [24]:
df.dtypes

asin         int32
user         int32
rating        int8
timestamp    int64
dtype: object

In [25]:
df.drop('timestamp', axis=1, inplace=True)
df

Unnamed: 0,asin,user,rating
0,0,0,2
1,0,1,5
2,0,2,5
3,0,3,5
4,0,4,5
...,...,...,...
574623,6012,194479,5
574624,6012,175357,5
574625,6012,416172,5
574626,12118,416173,1


In [26]:
df[df.duplicated(keep=False)==True].head(20)

Unnamed: 0,asin,user,rating
25,0,25,5
26,0,25,5
49,0,48,5
50,0,48,5
118,0,116,5
147,0,145,5
148,0,145,5
157,0,154,5
158,0,154,5
161,0,157,5


In [27]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,asin,user,rating
0,0,0,2
1,0,1,5
2,0,2,5
3,0,3,5
4,0,4,5
...,...,...,...
574623,6012,194479,5
574624,6012,175357,5
574625,6012,416172,5
574626,12118,416173,1


In [28]:
df.dtypes

asin      int32
user      int32
rating     int8
dtype: object

In [29]:
df['asin'].nunique()

12120

In [30]:
df['user'].nunique()

416174

In [31]:
# df.to_csv(r'data/Luxury_Beauty_reduced.csv', index=False)

In [32]:
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise.model_selection import cross_validate, train_test_split
from surprise.prediction_algorithms import SVD
from surprise.model_selection import GridSearchCV

In [33]:
data= df[['user', 'asin', 'rating']]
reader= Reader(line_format= 'user item rating', sep= ',')
data= Dataset.load_from_df(data, reader=reader)

In [34]:
trainset, testset= train_test_split(data, test_size=0.25, random_state=42)

In [35]:
testset

[(188795, 2024, 1.0),
 (311026, 4453, 5.0),
 (16548, 200, 5.0),
 (256627, 3212, 5.0),
 (263708, 3354, 5.0),
 (9734, 44, 5.0),
 (410889, 5538, 4.0),
 (39708, 271, 5.0),
 (124912, 3208, 5.0),
 (262784, 3327, 5.0),
 (200905, 2194, 5.0),
 (210733, 2376, 5.0),
 (203708, 2234, 5.0),
 (255719, 3208, 1.0),
 (13935, 2254, 1.0),
 (297383, 4088, 1.0),
 (531, 5552, 4.0),
 (207075, 2300, 5.0),
 (330737, 5080, 5.0),
 (340, 3868, 5.0),
 (42474, 272, 5.0),
 (267228, 3436, 1.0),
 (346456, 5603, 1.0),
 (290318, 3898, 5.0),
 (371574, 1604, 2.0),
 (21658, 129, 5.0),
 (111074, 1039, 5.0),
 (166514, 1769, 5.0),
 (356537, 5972, 5.0),
 (42651, 272, 3.0),
 (60984, 443, 5.0),
 (45118, 299, 5.0),
 (120528, 1233, 5.0),
 (273370, 3576, 3.0),
 (415031, 5911, 5.0),
 (280341, 3728, 5.0),
 (409937, 11159, 5.0),
 (408430, 5317, 5.0),
 (345435, 5544, 1.0),
 (175310, 3494, 5.0),
 (237293, 2839, 5.0),
 (154660, 4342, 5.0),
 (118028, 1194, 4.0),
 (53847, 379, 5.0),
 (234990, 2805, 5.0),
 (164117, 1752, 5.0),
 (63655, 478, 

## SVD

In [36]:
svd = SVD()

In [37]:
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9874b27fd0>

In [38]:
predictions= svd.test(testset)
accuracy.rmse(predictions)

RMSE: 1.2395


1.2395232610327893

In [39]:
accuracy.mae(predictions)

MAE:  0.9542


0.9542442257060224

In [40]:
param_grid = {'n_factors':[20, 100],'n_epochs': [5, 10], 'lr_all': [0.005, 0.02],
              'reg_all': [0.4, 0.6]}
gs_model = GridSearchCV(SVD,param_grid=param_grid,joblib_verbose=5, n_jobs=-1)
gs_model.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  3.3min finished


In [41]:
gs_model.best_params

{'rmse': {'n_factors': 100, 'n_epochs': 10, 'lr_all': 0.02, 'reg_all': 0.4},
 'mae': {'n_factors': 20, 'n_epochs': 10, 'lr_all': 0.02, 'reg_all': 0.4}}

In [42]:
# use best params
svd = SVD()
svd.fit(trainset)
predictions = svd.test(testset)
print(accuracy.rmse(predictions))

RMSE: 1.2396
1.2395808478756805


In [None]:
df['asin'].nunique()

In [None]:
df['user'].nunique()

In [None]:
df=df.sample(frac=1)

In [None]:
df['rating'].value_counts()

In [None]:
df.dtypes

In [None]:
df.isna().sum()

In [None]:
df[df.duplicated(keep=False)==True].head(20)

In [None]:
df[(df['user']=='AF3EVH5OFWIQN') & (df['asin']=='1300450991')]

In [None]:
df[df.duplicated(keep=False)==False].head(20)

In [None]:
df.drop_duplicates(inplace=True)
df

In [None]:
df['rating'].value_counts(normalize=True).sort_index(ascending=False)

In [None]:
meta_df = pd.read_json('data/meta_Luxury_Beauty.json.gz', lines=True)
meta_df

In [None]:
meta_df.head(20)

In [None]:
meta_df = meta_df[['title', 'asin']]

In [None]:
merged_df = df.merge(meta_df, how='inner', on='asin')
merged_df

In [None]:
merged_df.tail(20)

In [None]:
merged_df['user'].nunique()

In [None]:
merged_df['title'].nunique()

In [None]:
merged_df.isna().sum()

In [None]:
merged_df[merged_df.duplicated(keep=False)==True].head(20)

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***

## KNN Basic

In [None]:
KNN_model= knns.KNNBasic(sim_options={'name': 'cosine', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model2= knns.KNNBasic(sim_options={'name': 'msd', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model2, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model3= knns.KNNBasic(sim_options={'name': 'pearson', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model3, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model4= knns.KNNBasic(sim_options={'name': 'pearson_baseline', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model4, data, verbose= True, n_jobs=-1)

## KNN With Means

In [None]:
KNN_model= knns.KNNWithMeans(sim_options={'name': 'cosine', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model2= knns.KNNWithMeans(sim_options={'name': 'msd', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model2, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model3= knns.KNNWithMeans(sim_options={'name': 'pearson', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model3, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model4= knns.KNNWithMeans(sim_options={'name': 'pearson_baseline', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model4, data, verbose= True, n_jobs=-1)