![example](images/director_shot.jpeg)

# Project Title

**Authors:** Student 1, Student 2, Student 3
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
df = pd.read_csv('data/Pet_Supplies.csv', names=['asin', 'user', 'rating', 'timestamp'])
df

Unnamed: 0,asin,user,rating,timestamp
0,0972585419,A13K4OZKAAHOXS,3.0,1190851200
1,0972585419,A1DWYEX4P7GB7Z,4.0,1188000000
2,0972585419,A3NVN97YJSKEPC,4.0,1171929600
3,0972585419,A1PDMES1LYA0DP,1.0,1483056000
4,0972585419,AT6BH0TQLZS5X,1.0,1482451200
...,...,...,...,...
6542478,B01HJDIJQ2,A26N76ZOU621RL,5.0,1498003200
6542479,B01HJDIJQ2,A1GRVX5Y1L702P,5.0,1494201600
6542480,B01HJDIJQ2,A2OQBT92X1CZ6D,4.0,1485820800
6542481,B01HJDIJQ2,A2O6GB287JBIGP,5.0,1483315200


In [3]:
asin_list = df['asin'].unique()

In [4]:
np.arange(len(asin_list))

array([     0,      1,      2, ..., 198399, 198400, 198401])

In [5]:
asin_lookup = dict(zip(np.arange(len(asin_list)), asin_list))

In [6]:
asin_map = dict(zip(asin_list, np.arange(len(asin_list))))

In [7]:
asin_map

{'0972585419': 0,
 '0978619404': 1,
 '1223000893': 2,
 '1300450991': 3,
 '130045136X': 4,
 '1300451467': 5,
 '1417084871': 6,
 '1440572828': 7,
 '1300451238': 8,
 '1563834340': 9,
 '1590150007': 10,
 '1612231977': 11,
 '1882330919': 12,
 '1890948217': 13,
 '3293277470': 14,
 '4121689569': 15,
 '4208413697': 16,
 '4208413379': 17,
 '420841328X': 18,
 '4847611102': 19,
 '6025002517': 20,
 '6041027472': 21,
 '6041026514': 22,
 '6041026492': 23,
 '6041027634': 24,
 '6041027626': 25,
 '6041026425': 26,
 '6041026433': 27,
 '6041026565': 28,
 '604102645X': 29,
 '6041026573': 30,
 '605400123X': 31,
 '6162622851': 32,
 '7310172001': 33,
 '8029311600': 34,
 '9579882983': 35,
 '9575877594': 36,
 '979243724X': 37,
 '9822497938': 38,
 '9828377306': 39,
 '9828377403': 40,
 '9828377357': 41,
 '9822497466': 42,
 '9822497490': 43,
 '9980452196': 44,
 'B00000IRNW': 45,
 'B00001P503': 46,
 'B00002N7PJ': 47,
 'B00004T2WR': 48,
 'B00004TVUM': 49,
 'B00004YYEL': 50,
 'B00004YYEY': 51,
 'B00004ZB4I': 52,
 'B

In [8]:
df['asin'] = df['asin'].map(asin_map)

In [9]:
df

Unnamed: 0,asin,user,rating,timestamp
0,0,A13K4OZKAAHOXS,3.0,1190851200
1,0,A1DWYEX4P7GB7Z,4.0,1188000000
2,0,A3NVN97YJSKEPC,4.0,1171929600
3,0,A1PDMES1LYA0DP,1.0,1483056000
4,0,AT6BH0TQLZS5X,1.0,1482451200
...,...,...,...,...
6542478,49503,A26N76ZOU621RL,5.0,1498003200
6542479,49503,A1GRVX5Y1L702P,5.0,1494201600
6542480,49503,A2OQBT92X1CZ6D,4.0,1485820800
6542481,49503,A2O6GB287JBIGP,5.0,1483315200


In [10]:
# df['asin'].map(asin_lookup)

In [11]:
user_list = df['user'].unique()

In [12]:
np.arange(len(user_list))

array([      0,       1,       2, ..., 3085588, 3085589, 3085590])

In [13]:
user_lookup = dict(zip(np.arange(len(user_list)), user_list))

In [14]:
user_map = dict(zip(user_list, np.arange(len(user_list))))

In [15]:
user_map

{'A13K4OZKAAHOXS': 0,
 'A1DWYEX4P7GB7Z': 1,
 'A3NVN97YJSKEPC': 2,
 'A1PDMES1LYA0DP': 3,
 'AT6BH0TQLZS5X': 4,
 'A2SQLP4B8T8V0V': 5,
 'A2KN4FJVI2TZSF': 6,
 'A3RMA1DD66JDRV': 7,
 'AAANYRIEOIT3R': 8,
 'A3W44VX0LXAOHU': 9,
 'A1NVF62DA1ABQ6': 10,
 'A16ZDBZGKYDRSU': 11,
 'A20M3TKXKB1M1T': 12,
 'A3UCTUHXHE36IP': 13,
 'AKA4DSUU1ZYKQ': 14,
 'A1OD7257M9EOG4': 15,
 'A32B3TG1HZ08HM': 16,
 'A2Z3Q1UIVRSIL5': 17,
 'A3EZ18NUU4AJ0B': 18,
 'A2O305WS6Z96X': 19,
 'A2C26KQVC1SMHZ': 20,
 'A1FC1W6PY7DOKN': 21,
 'A1GLI6UI8A6886': 22,
 'AIDP1FBL2ZHAB': 23,
 'A1JQ3UJNVZDS68': 24,
 'A2SK9QS3N1PLWN': 25,
 'A3FHNC8N9KXOIF': 26,
 'AWWQJMNA8MYX1': 27,
 'A17VKRN7PMK2Q1': 28,
 'A1SR90QMEF7G5D': 29,
 'A39RBE0IAAUCYN': 30,
 'ALF0MRD3LRTPN': 31,
 'A3DQ9Q7AE4JVO8': 32,
 'A113OJ6JP6LG24': 33,
 'A2S46S9V4B8AGG': 34,
 'A2AH7BGGAZ5ZAZ': 35,
 'A3UFN1BMVVGACY': 36,
 'A1R0AH3FGQ2UX2': 37,
 'A3HD145P7JBF4E': 38,
 'A3FJ677K8F3JUP': 39,
 'A1Q8K65RU18AIX': 40,
 'AN4BBV6PR8L1Q': 41,
 'A3TQUJQFJ6PN4R': 42,
 'A1V6GOX1LSVVDH': 43,
 'AD4Z

In [16]:
df['user'] = df['user'].map(user_map)

In [17]:
df

Unnamed: 0,asin,user,rating,timestamp
0,0,0,3.0,1190851200
1,0,1,4.0,1188000000
2,0,2,4.0,1171929600
3,0,3,1.0,1483056000
4,0,4,1.0,1482451200
...,...,...,...,...
6542478,49503,2401392,5.0,1498003200
6542479,49503,1406069,5.0,1494201600
6542480,49503,3085589,4.0,1485820800
6542481,49503,3085590,5.0,1483315200


In [18]:
df.dtypes

asin           int64
user           int64
rating       float64
timestamp      int64
dtype: object

In [19]:
df['asin'].nunique()

198402

In [20]:
df['user'].nunique()

3085591

In [21]:
df['rating']=df['rating'].astype(np.int8)

In [22]:
df['asin']=df['asin'].astype(np.int32)

In [23]:
df['user']=df['user'].astype(np.int32)

In [24]:
df.dtypes

asin         int32
user         int32
rating        int8
timestamp    int64
dtype: object

In [25]:
df.drop('timestamp', axis=1, inplace=True)
df

Unnamed: 0,asin,user,rating
0,0,0,3
1,0,1,4
2,0,2,4
3,0,3,1
4,0,4,1
...,...,...,...
6542478,49503,2401392,5
6542479,49503,1406069,5
6542480,49503,3085589,4
6542481,49503,3085590,5


In [26]:
df[df.duplicated(keep=False)==True].head(20)

Unnamed: 0,asin,user,rating
99,3,99,5
670,11,665,5
689,11,684,5
721,11,665,5
731,11,684,5
1196,11,1188,5
1197,11,1188,5
1903,26,1888,4
1904,26,1888,4
2015,30,1993,5


In [27]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,asin,user,rating
0,0,0,3
1,0,1,4
2,0,2,4
3,0,3,1
4,0,4,1
...,...,...,...
6542478,49503,2401392,5
6542479,49503,1406069,5
6542480,49503,3085589,4
6542481,49503,3085590,5


In [28]:
df.dtypes

asin      int32
user      int32
rating     int8
dtype: object

In [29]:
df['asin'].nunique()

198402

In [30]:
df['user'].nunique()

3085591

In [34]:
df.to_csv(r'data/Pet_Supplies_reduced.csv', index=False)

In [31]:
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise.model_selection import cross_validate, train_test_split
from surprise.prediction_algorithms import SVD
from surprise.model_selection import GridSearchCV

In [None]:
data= df[['user', 'asin', 'rating']]
reader= Reader(line_format= 'user item rating', sep= ',')
data= Dataset.load_from_df(data, reader=reader)

In [None]:
trainset, testset= train_test_split(data, test_size=0.25, random_state=42)

In [None]:
testset

## SVD

In [None]:
svd = SVD()

In [None]:
svd.fit(trainset)

In [None]:
predictions= svd.test(testset)
accuracy.rmse(predictions)

In [None]:
accuracy.mae(predictions)

In [None]:
param_grid = {'n_factors':[20, 100],'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs_model = GridSearchCV(SVD,param_grid=param_grid,n_jobs = -1,joblib_verbose=5)
gs_model.fit(jokes)

In [None]:
df['asin'].nunique()

In [None]:
df['user'].nunique()

In [None]:
df=df.sample(frac=1)

In [None]:
df['rating'].value_counts()

In [None]:
df.dtypes

In [None]:
df.isna().sum()

In [None]:
df[df.duplicated(keep=False)==True].head(20)

In [None]:
df[(df['user']=='AF3EVH5OFWIQN') & (df['asin']=='1300450991')]

In [None]:
df[df.duplicated(keep=False)==False].head(20)

In [None]:
df.drop_duplicates(inplace=True)
df

In [None]:
df['rating'].value_counts(normalize=True).sort_index(ascending=False)

In [None]:
meta_df = pd.read_json('data/meta_Video_Games.json.gz', lines=True)
meta_df

In [None]:
meta_df = meta_df[['title', 'asin']]

In [None]:
merged_df = df.merge(meta_df, how='inner', on='asin')
merged_df

In [None]:
merged_df.tail(20)

In [None]:
merged_df['user'].nunique()

In [None]:
merged_df['title'].nunique()

In [None]:
merged_df.isna().sum()

In [None]:
merged_df[merged_df.duplicated(keep=False)==True].head(20)

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***

## KNN Basic

In [None]:
KNN_model= knns.KNNBasic(sim_options={'name': 'cosine', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model2= knns.KNNBasic(sim_options={'name': 'msd', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model2, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model3= knns.KNNBasic(sim_options={'name': 'pearson', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model3, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model4= knns.KNNBasic(sim_options={'name': 'pearson_baseline', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model4, data, verbose= True, n_jobs=-1)

## KNN With Means

In [None]:
KNN_model= knns.KNNWithMeans(sim_options={'name': 'cosine', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model2= knns.KNNWithMeans(sim_options={'name': 'msd', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model2, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model3= knns.KNNWithMeans(sim_options={'name': 'pearson', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model3, data, verbose= True, n_jobs=-1)

In [None]:
KNN_model4= knns.KNNWithMeans(sim_options={'name': 'pearson_baseline', 'user_based': False}).fit(trainset)

In [None]:
cross_validate(KNN_model4, data, verbose= True, n_jobs=-1)