# Product Recommender System
- **Module 1**: Simple Recommender System (Chai Wei Qi)
- **Module 2**: Content-Based Filtering Recommender System (Oh Boon Suen)
- **Module 3**: Collaborative Filtering Recommender System (Tan Cherng Ming)

The project is using datasets of Amazon electronic products.<br>
Source: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ 

## Importing Libraries

In [1]:
# Import library to be used in the project
import pandas as pd
import numpy as np
import html
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

## Importing Dataset

1. electronic_products.json 
2. user_ratings.csv

In [2]:
# Read the electronic products file
products_dataset_path = 'dataset/electronic_products.json'
global_products = pd.read_json(products_dataset_path, lines=True)

# Read the ratings file
ratings_dataset_path = 'dataset/user_ratings.csv'
global_ratings = pd.read_csv(ratings_dataset_path, names=['user_id', 'product_id','rating','timestamp'], index_col=False)

# Simple Recommender System
Done by Chai Wei Qi

## 2. File Reading and Features Engineering: products

In [3]:
# Read the electronic products file
products = global_products

# Output the first 10 rows
products.head(10)

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,"[Electronics, Camera &amp; Photo, Video Survei...",,[The following camera brands and models have b...,,Genuine Geovision 1 Channel 3rd Party NVR IP S...,[],,GeoVision,"[Genuine Geovision 1 Channel NVR IP Software, ...","[>#3,092 in Tools &amp; Home Improvement &gt; ...",[],Camera &amp; Photo,,"January 28, 2014",$65.00,11300000,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,"[Electronics, Camera &amp; Photo]",,[This second edition of the Handbook of Astron...,,"Books ""Handbook of Astronomical Image Processi...",[0999470906],,33 Books Co.,[Detailed chapters cover these fundamental top...,"[>#55,933 in Camera &amp; Photo (See Top 100 i...","[0943396670, 1138055360, 0999470906]",Camera &amp; Photo,,"June 17, 2003",,43396828,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,"[Electronics, eBook Readers &amp; Accessories,...",,[A zesty tale. (Publishers Weekly)<br /><br />...,,One Hot Summer,"[0425167798, 039914157X]",,Visit Amazon's Carolina Garcia Aguilera Page,[],"3,105,177 in Books (",[],Books,,,$11.49,60009810,[],[],
3,"[Electronics, eBook Readers & Accessories, eBo...",,[],,Hurray for Hattie Rabbit: Story and pictures (...,"[0060219521, 0060219580, 0060219394]",,Visit Amazon's Dick Gackenbach Page,[],"2,024,298 in Books (","[0060219521, 0060219475, 0060219394]",Books,,,.a-section.a-spacing-mini{margin-bottom:6px!im...,60219602,[],[],
4,"[Electronics, eBook Readers & Accessories, eBo...",,[&#8220;sex.lies.murder.fame. is brillllli&#82...,,sex.lies.murder.fame.: A Novel,[],,Visit Amazon's Lolita Files Page,[],"3,778,828 in Books (",[],Books,,,$13.95,60786817,[],[],
5,"[Electronics, eBook Readers &amp; Accessories,...",,"[, ]",,College Physics,"[0073049557, 0134454170, 1118142063, 007733968...",,Visit Amazon's Alan Giambattista Page,[],"3,330,771 in Books (","[0073512141, 0077339681, 0073049557, 007304956...",Books,,,,70524076,[],[],
6,"[Electronics, eBook Readers & Accessories, eBo...",,[GIRL WITH A ONE-TRACK MIND: CONFESSIONS OF TH...,,Girl with a One-track Mind: Confessions of the...,[0330509691],,ABBY LEE,[],"3,304,037 in Books (",[B0719LDQR1],Books,,,$4.76,91912407,[],[],
7,"[Electronics, Portable Audio & Video, MP3 & MP...",,[Support system: Windows XP/Vsita/7 * SNR: 85d...,,abcGoodefg&reg; 4GB USB 2.0 Mp3 Music Player w...,"[B01NAJ3KQB, B00WYSPT0C, B00AF40U5G, B00OFVNM4...",,Crazy Cart,[Package Content: 1 x Display MP3 Player 1 x E...,"[>#177,454 in Electronics (See Top 100 in Elec...","[B01NAJ3KQB, B00OFVNM4G, B00L41WY8K, B07F34PNP...",All Electronics,"class=""a-bordered a-horizontal-stripes a-spa...","December 28, 2012",,101635370,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
8,"[Electronics, Headphones, Earbud Headphones]",,"[, <b>True High Definition Sound:</b><br>With ...",,Wireless Bluetooth Headphones Earbuds with Mic...,[],,Enter The Arena,[Superb Sound Quality: Plays crystal clear aud...,[>#950 in Cell Phones & Accessories (See Top 1...,[],Home Audio & Theater,,"October 23, 2017",$7.99,132492776,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
9,"[Electronics, Computers &amp; Accessories, Com...",,[],,Kelby Training DVD: Mastering Blend Modes in A...,[],,Kelby Training,[],"[>#932,732 in Computers &amp; Accessories &gt;...",[],Computers,,"December 9, 2011",,132793040,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


In [4]:
print(products.shape)
# output: (rows, columns)

(104802, 19)


In [5]:
# retrieving column name
products.columns

Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'tech2',
       'brand', 'feature', 'rank', 'also_view', 'main_cat', 'similar_item',
       'date', 'price', 'asin', 'imageURL', 'imageURLHighRes', 'details'],
      dtype='object')

### 2.1 General Cleansing

#### 2.1.1 Drop Unnecessary columns

In [6]:
# keep: asin, title, brand, main_cat, and price (some columns are for filtering usage)
products = products.drop(columns=['category', 'tech1', 'description', 'fit', 'also_buy', 'tech2',
       'feature', 'rank', 'also_view', 'similar_item', 'date', 'imageURL', 'imageURLHighRes', 'details'], errors='ignore')

products.columns

Index(['title', 'brand', 'main_cat', 'price', 'asin'], dtype='object')

In [7]:
products = products[['asin', 'title', 'brand', 'price', 'main_cat']]
products.columns = ['product_id', 'product_name', 'brand_or_author', 'price', 'main_category']

In [8]:
products.head(10)

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category
0,11300000,Genuine Geovision 1 Channel 3rd Party NVR IP S...,GeoVision,$65.00,Camera &amp; Photo
1,43396828,"Books ""Handbook of Astronomical Image Processi...",33 Books Co.,,Camera &amp; Photo
2,60009810,One Hot Summer,Visit Amazon's Carolina Garcia Aguilera Page,$11.49,Books
3,60219602,Hurray for Hattie Rabbit: Story and pictures (...,Visit Amazon's Dick Gackenbach Page,.a-section.a-spacing-mini{margin-bottom:6px!im...,Books
4,60786817,sex.lies.murder.fame.: A Novel,Visit Amazon's Lolita Files Page,$13.95,Books
5,70524076,College Physics,Visit Amazon's Alan Giambattista Page,,Books
6,91912407,Girl with a One-track Mind: Confessions of the...,ABBY LEE,$4.76,Books
7,101635370,abcGoodefg&reg; 4GB USB 2.0 Mp3 Music Player w...,Crazy Cart,,All Electronics
8,132492776,Wireless Bluetooth Headphones Earbuds with Mic...,Enter The Arena,$7.99,Home Audio & Theater
9,132793040,Kelby Training DVD: Mastering Blend Modes in A...,Kelby Training,,Computers


#### 2.1.2 Remove Duplicates

In [9]:
products.shape

(104802, 5)

In [10]:
products.duplicated().sum()

30368

In [11]:
products = products.drop_duplicates()

products.shape

(74434, 5)

#### 2.1.3 Format Strings

In [12]:
# Defining text cleaning function

def text_cleaning(text):
    # 1. convert any HTML entities in the text to their corresponding characters
    # e.g. &amp; to &, &quot; to ", &reg; to ®
    text = html.unescape(text)
    
    # 2. convert to lower case
    return text.lower()

text = "&amp; &quot &reg;"
cleaned_text = text_cleaning(text)
print(cleaned_text)

& " ®


In [13]:
# List of columns to be cleaned
cols_to_clean = ['product_name', 'brand_or_author', 'price', 'main_category']

# Apply the text cleaning function to each column
for col in cols_to_clean:
    products[col] = products[col].apply(lambda x: text_cleaning(x))

In [14]:
products

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category
0,0011300000,genuine geovision 1 channel 3rd party nvr ip s...,geovision,$65.00,camera & photo
1,0043396828,"books ""handbook of astronomical image processi...",33 books co.,,camera & photo
2,0060009810,one hot summer,visit amazon's carolina garcia aguilera page,$11.49,books
3,0060219602,hurray for hattie rabbit: story and pictures (...,visit amazon's dick gackenbach page,.a-section.a-spacing-mini{margin-bottom:6px!im...,books
4,0060786817,sex.lies.murder.fame.: a novel,visit amazon's lolita files page,$13.95,books
...,...,...,...,...,...
104797,B000Q6NSAM,precision design ew-78bii lens hood for canon ...,precision design,,camera & photo
104798,B000Q6KSTG,netgear rangemax next 802.11n (draft) wireless...,netgear,,all electronics
104799,B000Q6MYB6,"polaroid 7"" digital photo frame",polaroid,$1.54,camera & photo
104800,B000Q6REMK,monoprice 102081 hdmi female to dvi-d single l...,monoprice,$4.18,computers


### 2.2 Cleaning 'main_category' column

In [15]:
main_category_df = products.groupby('main_category').size().reset_index(name='count')
main_category_df

Unnamed: 0,main_category,count
0,"<img src=""https://images-na.ssl-images-amazon....",38
1,"<img src=""https://images-na.ssl-images-amazon....",129
2,"<img src=""https://m.media-amazon.com/images/g/...",1
3,"<img src=""https://m.media-amazon.com/images/g/...",1
4,all beauty,30
5,all electronics,23269
6,amazon devices,37
7,amazon home,428
8,appliances,2
9,"arts, crafts & sewing",110


In [16]:
# The four image HTML elements are actually belongs to 'amazon fashion'
for i in (main_category_df.loc[0:3, 'main_category']):
    print(i)

# There are total 169 products for amazon fashion
total_af = {'main_category': 'Total', 'count': len(products.loc[products['main_category'].str.contains('amazon fashion')])}
main_category_df.loc[main_category_df['main_category'].str.contains('amazon fashion')].append(total_af, ignore_index=True)

<img src="https://images-na.ssl-images-amazon.com/images/g/01/nav2/images/gui/amazon-fashion-store-new._cb520838675_.png" class="nav-categ-image" alt="amazon fashion" />
<img src="https://images-na.ssl-images-amazon.com/images/g/01/nav2/images/gui/amazon-fashion-store-new._cb520838675_.png" class="nav-categ-image" alt="amazon fashion"/>
<img src="https://m.media-amazon.com/images/g/01/nav2/images/gui/amazon-fashion-store-new._cb520838675_.png" class="nav-categ-image" alt="amazon fashion" />
<img src="https://m.media-amazon.com/images/g/01/nav2/images/gui/amazon-fashion-store-new._cb520838675_.png" class="nav-categ-image" alt="amazon fashion"/>


Unnamed: 0,main_category,count
0,"<img src=""https://images-na.ssl-images-amazon....",38
1,"<img src=""https://images-na.ssl-images-amazon....",129
2,"<img src=""https://m.media-amazon.com/images/g/...",1
3,"<img src=""https://m.media-amazon.com/images/g/...",1
4,Total,169


In [17]:
# assign the four image HTML elements to 'amazon fashion' in main_category_df data frame
main_category_df.loc[main_category_df['main_category'].str.contains('amazon fashion'), 'main_category'] = 'amazon fashion'

main_category_df.groupby('main_category').sum()

Unnamed: 0_level_0,count
main_category,Unnamed: 1_level_1
all beauty,30
all electronics,23269
amazon devices,37
amazon fashion,169
amazon home,428
appliances,2
"arts, crafts & sewing",110
automotive,417
baby,19
books,335


In [18]:
# assign the four image HTML elements to 'amazon fashion' in products data frame
products.loc[products['main_category'].str.contains('amazon fashion'), 'main_category'] = 'amazon fashion'

products.groupby('main_category').size().reset_index(name='count')

Unnamed: 0,main_category,count
0,all beauty,30
1,all electronics,23269
2,amazon devices,37
3,amazon fashion,169
4,amazon home,428
5,appliances,2
6,"arts, crafts & sewing",110
7,automotive,417
8,baby,19
9,books,335


### 2.3 Cleaning 'brand_or_author' column

In [19]:
# The visit amazon's .... page pattern is actually the author.
pattern = "^visit amazon's (.*) page$"
mask = products['brand_or_author'].str.contains(pattern)
selected_columns = ['brand_or_author', 'main_category']
visit_amazon_pattern = products.loc[mask, selected_columns]

visit_amazon_pattern

Unnamed: 0,brand_or_author,main_category
2,visit amazon's carolina garcia aguilera page,books
3,visit amazon's dick gackenbach page,books
4,visit amazon's lolita files page,books
5,visit amazon's alan giambattista page,books
10,visit amazon's claire messud page,books
...,...,...
1403,visit amazon's dan wells page,books
1405,visit amazon's ismael cala page,books
1439,visit amazon's maría nuñez quesada page,books
68972,visit amazon's karin slaughter page,books


In [20]:
visit_amazon_pattern.groupby('main_category').size().reset_index(name='count')

Unnamed: 0,main_category,count
0,books,181


In [21]:
# Clean the brand or author column
def clean_author_string(author_string):
    author_string = author_string.replace("visit amazon's ", "")
    author_string = author_string.replace(" page", "")
    return author_string

products['brand_or_author'] = products['brand_or_author'].apply(clean_author_string)

In [22]:
# result
products.head(10)

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category
0,11300000,genuine geovision 1 channel 3rd party nvr ip s...,geovision,$65.00,camera & photo
1,43396828,"books ""handbook of astronomical image processi...",33 books co.,,camera & photo
2,60009810,one hot summer,carolina garcia aguilera,$11.49,books
3,60219602,hurray for hattie rabbit: story and pictures (...,dick gackenbach,.a-section.a-spacing-mini{margin-bottom:6px!im...,books
4,60786817,sex.lies.murder.fame.: a novel,lolita files,$13.95,books
5,70524076,college physics,alan giambattista,,books
6,91912407,girl with a one-track mind: confessions of the...,abby lee,$4.76,books
7,101635370,abcgoodefg® 4gb usb 2.0 mp3 music player with ...,crazy cart,,all electronics
8,132492776,wireless bluetooth headphones earbuds with mic...,enter the arena,$7.99,home audio & theater
9,132793040,kelby training dvd: mastering blend modes in a...,kelby training,,computers


### 2.4 Cleaning 'price' column

In [23]:
dirty_price_df = products[~products['price'].str.contains('^\$')]

dirty_price_type_df = dirty_price_df.groupby('price').size().reset_index(name='count')
dirty_price_type_df

Unnamed: 0,price,count
0,,50162
1,"\n\t\t\t\t\t\t\t\t\t\t\t\t<span class=""vertica...",10
2,\n\t\t ...,7
3,\n\t\t ...,20
4,\n\n\n<script,48
5,.a-box-inner{background-color:#fff}#alohabuybo...,2090
6,.a-section.a-spacing-mini{margin-bottom:6px!im...,9


In [24]:
for i in (dirty_price_type_df.loc[1:6, 'price']):
    print(i)


												<span class="verticalalign a-size-large"

		                                            





    
    
        
        











    
    
    
       
           <div class="a-section a-spacing-none"

		                                            





    
    
        
        











    
    
    
       
           <div class="a-section a-spacing-none"



<script
.a-box-inner{background-color:#fff}#alohabuyboxwidget .selected{background-color:#fffbf3;border-color:#e77600;box-shadow:0 0 3px rgba(228,121,17,.5)}#alohabuyboxwidget .contract-not-available{color:gray}#aloha-cart-popover .aloha-cart{height:auto;overflow:hidden}#aloha-cart-popover #aloha-cartinfo{float:left}#aloha-cart-popover #aloha-cart-details{float:right;margin-top:1em}#aloha-cart-popover .devicecontainer{width:160px;float:left;padding-right:10px;border-right:1px solid #ddd}#aloha-cart-popover li:last-child{border-right:0}#aloha-cart-popover .aloha-device-title{height:3em;overflow:hidden}#aloha-

In [25]:
# Since all the dirty price data have no price information (are all css or html code)
# We can set them all to '0'
def clean_price_string (price_string):
    if price_string.startswith('$'):
        price_string = price_string.replace('$', '')
    else:
        price_string = '0'
    try:
        price_float = float(price_string)
    except ValueError:
        price_float = 0.0
    return price_float

products['price'] = products['price'].apply(clean_price_string)

In [26]:
# result
products.head(10)

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category
0,11300000,genuine geovision 1 channel 3rd party nvr ip s...,geovision,65.0,camera & photo
1,43396828,"books ""handbook of astronomical image processi...",33 books co.,0.0,camera & photo
2,60009810,one hot summer,carolina garcia aguilera,11.49,books
3,60219602,hurray for hattie rabbit: story and pictures (...,dick gackenbach,0.0,books
4,60786817,sex.lies.murder.fame.: a novel,lolita files,13.95,books
5,70524076,college physics,alan giambattista,0.0,books
6,91912407,girl with a one-track mind: confessions of the...,abby lee,4.76,books
7,101635370,abcgoodefg® 4gb usb 2.0 mp3 music player with ...,crazy cart,0.0,all electronics
8,132492776,wireless bluetooth headphones earbuds with mic...,enter the arena,7.99,home audio & theater
9,132793040,kelby training dvd: mastering blend modes in a...,kelby training,0.0,computers


## 3. File Reading and Features Engineering: ratings

In [27]:
# Read the ratings file
ratings = global_ratings

In [28]:
# Output the first 10 rows
ratings.head(10)

Unnamed: 0,user_id,product_id,rating,timestamp
0,AKM1MP6P0OYPR,132793040,5.0,1365811200
1,A2CX7LUOHB2NDG,321732944,5.0,1341100800
2,A2NWSAGRHCP8N5,439886341,1.0,1367193600
3,A2WNBOD3WNDNKT,439886341,3.0,1374451200
4,A1GI0U4ZRJA8WN,439886341,1.0,1334707200
5,A1QGNMC6O1VW39,511189877,5.0,1397433600
6,A3J3BRHTDRFJ2G,511189877,2.0,1397433600
7,A2TY0BTJOTENPG,511189877,5.0,1395878400
8,A34ATBPOK6HCHY,511189877,5.0,1395532800
9,A89DO69P0XZ27,511189877,5.0,1395446400


In [29]:
ratings.columns

Index(['user_id', 'product_id', 'rating', 'timestamp'], dtype='object')

### 3.1 Features Engineering

#### 3.1.1 Check Duplicates ( no duplicated rows )

In [30]:
ratings.shape

(7824482, 4)

In [31]:
# must check duplicates first before dropping the columns
# because all the four columns, especially the timestamp, are determining the duplication

ratings.duplicated().sum()

# no need to drop duplicates
# no need: ratings = ratings.drop_duplicates()#### 3.1.2 Drop Unnecessary Columns

0

#### 3.1.2 Drop Unnecessary Columns

In [32]:
# keep: product_id and rating  (each rating is valuable only)
# drop: user_id and timestamp
ratings = ratings.drop(columns=['user_id', 'timestamp'], errors='ignore')

ratings.columns

Index(['product_id', 'rating'], dtype='object')

In [33]:
ratings.head(10)

Unnamed: 0,product_id,rating
0,132793040,5.0
1,321732944,5.0
2,439886341,1.0
3,439886341,3.0
4,439886341,1.0
5,511189877,5.0
6,511189877,2.0
7,511189877,5.0
8,511189877,5.0
9,511189877,5.0


#### 3.1.3 Create  'rating_average' and 'rating_count' from 'rating'

In [34]:
# Calculate the average rating and count of ratings for each product_id
ratings = ratings.groupby('product_id').agg(rating_average=('rating', 'mean'), rating_count=('rating', 'count'))

# use (0,1,2,3,4,5,6 as row indication instead of product_id)
ratings = ratings.reset_index()

In [35]:
ratings.head(10)

Unnamed: 0,product_id,rating_average,rating_count
0,0132793040,5.0,1
1,0321732944,5.0,1
2,0439886341,1.666667,3
3,0511189877,4.5,6
4,0528881469,2.851852,27
5,0558835155,3.0,1
6,059400232X,5.0,3
7,0594012015,2.0,8
8,0594017343,1.0,1
9,0594017580,3.0,1


In [36]:
ratings.shape

(476002, 3)

In [37]:
# Top 10 products with the highest rating_count first then highest rating_average
ratings.sort_values(by=['rating_count','rating_average'], ascending=[False, False]).head(10)

Unnamed: 0,product_id,rating_average,rating_count
308398,B0074BW614,4.491504,18244
429572,B00DR0PDNE,3.93102,16454
327308,B007WTAJTO,4.424005,14172
102804,B0019EHU8G,4.754497,12285
296625,B006GWO5WK,4.314657,12226
178601,B003ELYQGG,4.392528,11617
178813,B003ES5ZUU,4.704749,10276
323013,B007R5YDYA,4.690926,9907
289775,B00622AG6S,4.420136,9823
30276,B0002L5R78,4.448614,9487


## 4. Merge 'products' and 'ratings' into 'products_merge'

In [38]:
# Merge the products and ratings dataframes (keep all the products records)
products_merge = pd.merge(products, ratings, on='product_id', how='left')

# Output the first 10 rows
products_merge.head(10)

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category,rating_average,rating_count
0,11300000,genuine geovision 1 channel 3rd party nvr ip s...,geovision,65.0,camera & photo,,
1,43396828,"books ""handbook of astronomical image processi...",33 books co.,0.0,camera & photo,,
2,60009810,one hot summer,carolina garcia aguilera,11.49,books,,
3,60219602,hurray for hattie rabbit: story and pictures (...,dick gackenbach,0.0,books,,
4,60786817,sex.lies.murder.fame.: a novel,lolita files,13.95,books,,
5,70524076,college physics,alan giambattista,0.0,books,,
6,91912407,girl with a one-track mind: confessions of the...,abby lee,4.76,books,,
7,101635370,abcgoodefg® 4gb usb 2.0 mp3 music player with ...,crazy cart,0.0,all electronics,,
8,132492776,wireless bluetooth headphones earbuds with mic...,enter the arena,7.99,home audio & theater,,
9,132793040,kelby training dvd: mastering blend modes in a...,kelby training,0.0,computers,5.0,1.0


In [39]:
products_merge.shape

(74434, 7)

### 4.1 Replace NaN

In [40]:
products_merge.isnull().sum()

product_id             0
product_name           0
brand_or_author        0
price                  0
main_category          0
rating_average     14656
rating_count       14656
dtype: int64

In [41]:
products_merge[products_merge['main_category'] == 'all electronics'].count()

product_id         23269
product_name       23269
brand_or_author    23269
price              23269
main_category      23269
rating_average     19834
rating_count       19834
dtype: int64

In [42]:
products_merge[products_merge['main_category'] == 'all electronics'].isnull().sum()

product_id            0
product_name          0
brand_or_author       0
price                 0
main_category         0
rating_average     3435
rating_count       3435
dtype: int64

In [43]:
products_merge.fillna({'rating_average': 0.0, 'rating_count': 0}, inplace=True)

In [44]:
products_merge.isnull().sum()

product_id         0
product_name       0
brand_or_author    0
price              0
main_category      0
rating_average     0
rating_count       0
dtype: int64

## 5. Simple Recommender System

### 5.1 Simple Rating Sort

▪ Sorting 'products_merge' by multiple columns.

▪ Issue: Even if the **rating_count** for a product is very high, but it can have a lower **rating_average**.

In [45]:
# Top 10 products with the highest rating_count first then highest rating_average
products_merge.sort_values(by=['rating_count','rating_average'], ascending=[False, False]).head(20)

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category,rating_average,rating_count
66353,B000LRMS66,garmin portable friction mount - frustration f...,garmin,18.5,cell phones & accessories,4.756627,8715.0
25220,B0001FTVEK,sennheiser rs120 on-ear wireless rf headphones...,sennheiser,3.76,home audio & theater,4.007109,5345.0
60552,B000I68BD4,jlab jbuds hi-fi noise-reducing ear buds,jlab,0.0,home audio & theater,3.50153,4903.0
46294,B000BQ7GW8,sandisk 2gb class 4 sd flash memory card- sdsd...,sandisk,4.99,all electronics,4.553216,4275.0
15080,B00007E7JU,canon ef 50mm f/1.8 ii camera lens - fixed (di...,canon,0.0,camera & photo,4.565995,3523.0
63061,B000JMJWV2,transcend 4 gb class 6 sdhc flash memory card ...,transcend,0.0,all electronics,4.248114,3446.0
45652,B000BKJZ9Q,garmin nuvi 350 3.5-inch portable gps navigato...,garmin,0.0,home audio & theater,4.440509,3219.0
42919,B000A6PPOK,microsoft natural ergonomic keyboard 4000,microsoft,0.0,all electronics,3.950495,2828.0
15649,B00007M1TZ,panasonic kx-tca60 hands-free headset with com...,panasonic,17.49,all electronics,3.97661,2608.0
48733,B000CSWCQA,garmin forerunner 305 gps receiver with heart ...,garmin,17.47,home audio & theater,4.43261,2441.0


### 5.2 Weighted Rating

▪ A *weighted rating* that takes into account the **rating_average** and the **rating_count** it has accumulated.

▪ We can calculate the Weighted Rating Score into a new 'score' column.

▪ The formula of weighted rating is as follows:

<img src="weighted_rating.png" width="600">

\>>> **v** is the number of rating for the product (represented by **rating_count**)

\>>> **m** is the **minimum rating count** required to be listed in the chart (to be calculated)

\>>> **R** is the average rating of the product (represented by **rating_average**)

\>>> **C** is the **mean of rating average** across the whole dataframe (to be calculated)

In [46]:
# describe() - can analyse the dataframe overallly
# C can be viewed using describe().

products_merge.describe()

# From the output:
# get C: mean of **rating_average** of product

Unnamed: 0,price,rating_average,rating_count
count,74434.0,74434.0,74434.0
mean,8.630102,3.081758,14.14754
std,38.781412,1.796436,79.988388
min,0.0,0.0,0.0
25%,0.0,2.0,1.0
50%,0.0,3.75,2.0
75%,3.76,4.5,7.0
max,999.99,5.0,8715.0


In [47]:
# C, the mean of rating average across the whole dataframe
C = products_merge['rating_average'].mean()
C

3.081757800732317

In [48]:
# m, minimum rating count required to be listed in the chart

# consider the 90th percentile.

# for a product to be recommended, it must have more votes than at least 90% of the products.

m = products_merge['rating_count'].quantile(0.90) 
m

24.0

In [49]:
q_products = products_merge.copy().loc[products_merge['rating_count'] >= m]

q_products

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category,rating_average,rating_count
40,0528881469,rand mcnally 528881469 7-inch intelliroute tnd...,rand mcnally,0.00,all electronics,2.851852,27.0
201,0972683275,"videosecu 24"" long arm tv wall mount low profi...",videosecu,34.99,all electronics,4.470980,1051.0
460,1400501741,"nook hd+ 9"" 16gb wi-fi color tablet",barnes & noble,0.00,computers,3.846154,26.0
461,1400599997,barnes & noble nook ebook reader (wifi + 3g)[b&w],barnes & noble,0.00,all electronics,3.490991,222.0
463,140053271X,barnes & noble nook simple touch ebook reader ...,barnes & noble,0.00,home audio & theater,3.900232,431.0
...,...,...,...,...,...,...,...
74405,B000Q6CWA4,pioneer multi code region free dvd player - wo...,pioneer,0.00,all electronics,4.080808,99.0
74406,B000Q6EH1Q,precision design ew-60c lens hood for canon le...,precision design,0.00,camera & photo,4.396226,53.0
74408,B000Q6EKZY,camera lens hood ew-73b,hongdak,0.00,camera & photo,4.000000,27.0
74429,B000Q6NSAM,precision design ew-78bii lens hood for canon ...,precision design,0.00,camera & photo,3.672131,61.0


In [50]:
# Function that computes the weighted rating of each product
def weighted_rating(x, m = m, C = C):
    
    v = x['rating_count']
    R = x['rating_average']
    
    # Calculation based on the IMDB formula
    return (v / (v + m) * R) + (m / (m + v) * C)

In [51]:
q_products['score'] = q_products.apply(weighted_rating, axis=1)

q_products.head(20)

Unnamed: 0,product_id,product_name,brand_or_author,price,main_category,rating_average,rating_count,score
40,0528881469,rand mcnally 528881469 7-inch intelliroute tnd...,rand mcnally,0.0,all electronics,2.851852,27.0,2.960043
201,0972683275,"videosecu 24"" long arm tv wall mount low profi...",videosecu,34.99,all electronics,4.47098,1051.0,4.439965
460,1400501741,"nook hd+ 9"" 16gb wi-fi color tablet",barnes & noble,0.0,computers,3.846154,26.0,3.479244
461,1400599997,barnes & noble nook ebook reader (wifi + 3g)[b&w],barnes & noble,0.0,all electronics,3.490991,222.0,3.451066
463,140053271X,barnes & noble nook simple touch ebook reader ...,barnes & noble,0.0,home audio & theater,3.900232,431.0,3.85706
464,1400501466,"barnes & noble nook tablet 16gb (color, bntv250)",barnes & noble,0.0,computers,3.56,250.0,3.51811
470,1400699169,barnes & noble nook hd+ tablet 32gb slate (bnt...,barnes & noble,89.55,computers,4.319149,47.0,3.900876
471,1400532620,barnes and noble nook ebook reader (wifi only)...,barnes & noble,0.0,all electronics,3.684211,171.0,3.610062
612,1615527613,barnes & noble bn-adp-h01 power kit,barnes & noble,58.88,portable audio & accessories,3.875,32.0,3.535039
621,161552763X,barnes & noble 5010490303 lautner e-reader cover,barnes & noble,0.0,portable audio & accessories,4.807692,26.0,3.979244


### 5.3 Result using Weighted Rating

#### 5.3.1 Recommending Top 20 Products

▪ Sort q_proucts in descending order based on the score feature column.

▪ Output the product_name, main_category, brand_or_author, price, rating_average, rating_count, and weighted rating (score) of the top 20 products.

In [52]:
# Sort products based on 'score' and recommend the top 20 products
top_20_proucts = q_products.sort_values('score', ascending = False).head(20).reset_index()

columns = ['product_id', 'product_name', 'main_category','brand_or_author', 'price',  'rating_average', 'rating_count', 'score']
top_20_proucts = top_20_proucts [columns]

top_20_proucts.index = top_20_proucts.index + 1

top_20_proucts

Unnamed: 0,product_id,product_name,main_category,brand_or_author,price,rating_average,rating_count,score
1,B000LRMS66,garmin portable friction mount - frustration f...,cell phones & accessories,garmin,18.5,4.756627,8715.0,4.752027
2,B000053HH5,canon ef 70-200mm f/4l usm telephoto zoom lens...,camera & photo,canon,45.12,4.841499,347.0,4.727661
3,B00007GQLU,canon ef 85mm f/1.8 usm medium telephoto lens ...,camera & photo,canon,35.86,4.787934,547.0,4.716221
4,B000I1X3W8,canon ef 70-200mm f/4 l is usm lens for canon ...,camera & photo,canon,59.82,4.869565,253.0,4.714665
5,B000053HC5,canon ef 135mm f/2l usm lens for canon slr cam...,camera & photo,canon,66.78,4.945783,166.0,4.710327
6,B00009UT9B,pelican 1450 case with foam (silver),camera & photo,pelican,124.95,4.786307,482.0,4.705459
7,B00006I53X,canon ef 70-200mm f/2.8l is usm telephoto zoom...,camera & photo,canon,958.0,4.83908,261.0,4.691095
8,B000CKVOOY,arkon folding tablet stand for ipad air ipad m...,computers,arkon,12.95,4.707955,1873.0,4.687381
9,B000092TT0,polk audio psw505 12-inch powered subwoofer (s...,home audio & theater,polk audio,7.69,4.720169,947.0,4.679673
10,B00020M1U0,sanus vmpl50b vision mount tilting mount for 3...,home audio & theater,twowings,0.0,4.767442,387.0,4.669008


#### 5.3.2 Recommending Top 20 Products According to product_name

In [53]:
# E.g. speaker
product_name = input("Enter the product name : ")

Enter the product name : speaker


In [54]:
top_20_product_name = q_products[q_products['product_name'].str.contains(product_name.lower())]

top_20_product_name = top_20_product_name.sort_values('score', ascending = False).reset_index()[columns]
top_20_product_name.index = top_20_product_name.index + 1

top_20_product_name.head(20)

Unnamed: 0,product_id,product_name,main_category,brand_or_author,price,rating_average,rating_count,score
1,B00005T3C8,"polk audio rc65i 2-way premium in-wall 6.5"" sp...",home audio & theater,polk audio,7.67,4.694656,262.0,4.559308
2,B0002WPSBC,logitech z-5500 thx-certified 5.1 digital surr...,computers,logitech,0.0,4.590796,804.0,4.547056
3,B00005T3BD,"polk audio rc60i 2-way premium in-ceiling 6.5""...",home audio & theater,polk audio,3.78,4.627848,395.0,4.539289
4,B0002SQ2P2,logitech z-2300 thx-certified 2.1 speaker syst...,all electronics,logitech,0.0,4.551916,1435.0,4.527733
5,B000OG6I6A,"sony ss-b3000 bookshelf speakers (pair, black)",home audio & theater,sony,0.0,4.616188,383.0,4.525706
6,B0001VGFKW,yamaha ns-aw150bl 2-way indoor/outdoor speaker...,home audio & theater,yamaha audio,4.37,4.576923,546.0,4.513969
7,B00006JPDI,"bic america dv62si bookshelf speakers (pair, b...",home audio & theater,bic america,8.18,4.634146,246.0,4.496156
8,B0002SQ0A4,logitech x-230 2.1 2-piece dual drive speakers...,all electronics,logitech,0.0,4.552743,474.0,4.481852
9,B000JNA4LS,harman kardon go + play portable speakers syst...,home audio & theater,harman kardon,0.0,4.693333,150.0,4.471047
10,B000OG4E1G,loopilops bluetooth soundbar audio tv speaker...,home audio & theater,cowin,2.0,4.558642,324.0,4.456788


#### 5.3.3 Recommending Top 20 Products According to main_category

In [55]:
# e.g camera
main_category = input("Enter the main category : ")

Enter the main category : camera


In [56]:
top_20_main_category = q_products[q_products['main_category'].str.contains(main_category.lower())]

top_20_main_category = top_20_main_category [columns].sort_values('score', ascending = False).reset_index()
top_20_main_category.index = top_20_main_category.index + 1

top_20_main_category.head(20)

Unnamed: 0,index,product_id,product_name,main_category,brand_or_author,price,rating_average,rating_count,score
1,7115,B000053HH5,canon ef 70-200mm f/4l usm telephoto zoom lens...,camera & photo,canon,45.12,4.841499,347.0,4.727661
2,15388,B00007GQLU,canon ef 85mm f/1.8 usm medium telephoto lens ...,camera & photo,canon,35.86,4.787934,547.0,4.716221
3,60041,B000I1X3W8,canon ef 70-200mm f/4 l is usm lens for canon ...,camera & photo,canon,59.82,4.869565,253.0,4.714665
4,7117,B000053HC5,canon ef 135mm f/2l usm lens for canon slr cam...,camera & photo,canon,66.78,4.945783,166.0,4.710327
5,19194,B00009UT9B,pelican 1450 case with foam (silver),camera & photo,pelican,124.95,4.786307,482.0,4.705459
6,13386,B00006I53X,canon ef 70-200mm f/2.8l is usm telephoto zoom...,camera & photo,canon,958.0,4.83908,261.0,4.691095
7,5267,B00004XOM3,canon ef 100mm f/2.8 macro usm fixed lens for ...,camera & photo,canon,49.23,4.794613,297.0,4.666549
8,70340,B000NP3DJW,canon speedlite 580ex ii flash for canon eos d...,camera & photo,canon,189.98,4.749441,447.0,4.664463
9,13422,B00006I53W,canon ef 70-200mm f/2.8l usm telephoto zoom le...,camera & photo,canon,58.6,4.859551,178.0,4.648328
10,8155,B00005LEN4,nikon af fx nikkor 50mm f/1.8d lens for nikon ...,camera & photo,nikon,11.99,4.665763,1107.0,4.63215


#### 5.3.4 Recommending Top 20 Products According to brand_or_author

In [57]:
# e.g. microsoft
brand_or_author = input("Enter the brand or author : ")

Enter the brand or author : microsoft


In [58]:
top_20_brand_or_author = q_products[q_products['brand_or_author'].str.contains(brand_or_author.lower())]

top_20_brand_or_author = top_20_brand_or_author [columns].sort_values('score', ascending = False).reset_index()
top_20_brand_or_author.index = top_20_brand_or_author.index + 1

top_20_brand_or_author.head(20)

Unnamed: 0,index,product_id,product_name,main_category,brand_or_author,price,rating_average,rating_count,score
1,9390,B00005TQ08,microsoft intellimouse optical mouse,all electronics,microsoft,1.98,4.514894,235.0,4.382093
2,11884,B00006B7HB,microsoft wheel mouse optical,all electronics,microsoft,0.0,4.403315,181.0,4.248596
3,7244,B00005853Z,microsoft trackball explorer,all electronics,microsoft,8.82,4.417808,146.0,4.229189
4,4100,B00004S9AK,microsoft d58-00002 intellimouse optical,all electronics,microsoft,0.0,4.436508,126.0,4.219748
5,28642,B0002CPBUK,microsoft digital media pro keyboard,all electronics,microsoft,0.0,4.374269,171.0,4.215191
6,72455,B000OY71LS,"microsoft 15.6"" neoprene laptop sleeve",all electronics,microsoft,0.0,4.511364,88.0,4.20502
7,10213,B0000642RX,microsoft natural keyboard elite,all electronics,microsoft,4.27,4.240084,479.0,4.184815
8,21235,B0000AOWVN,microsoft natural multimedia keyboard,all electronics,microsoft,2.72,4.33945,109.0,4.112498
9,27648,B00025O7FC,microsoft wheel optical mouse,all electronics,microsoft,9.75,4.246377,138.0,4.073841
10,67800,B000MKKTKE,"microsoft impact messenger bag for 17.3"" lapto...",computers,samsill/microsoft,0.0,4.347222,72.0,4.030856


#### 5.3.5 Recommending Top 20 Products According to price range

In [59]:
# e.g. 10, 100
min_price = int(input("Enter the minimum price : "))
max_price = int(input("Enter the maximum price : "))

Enter the minimum price : 10
Enter the maximum price : 100


In [60]:
top_20_within_price_range = q_products[(q_products['price'] >= min_price) & (q_products['price'] <= max_price)]

top_20_within_price_range = top_20_within_price_range [columns].sort_values('score', ascending = False).reset_index()
top_20_within_price_range.index = top_20_within_price_range.index + 1

top_20_within_price_range.head(20)

Unnamed: 0,index,product_id,product_name,main_category,brand_or_author,price,rating_average,rating_count,score
1,66353,B000LRMS66,garmin portable friction mount - frustration f...,cell phones & accessories,garmin,18.5,4.756627,8715.0,4.752027
2,7115,B000053HH5,canon ef 70-200mm f/4l usm telephoto zoom lens...,camera & photo,canon,45.12,4.841499,347.0,4.727661
3,15388,B00007GQLU,canon ef 85mm f/1.8 usm medium telephoto lens ...,camera & photo,canon,35.86,4.787934,547.0,4.716221
4,60041,B000I1X3W8,canon ef 70-200mm f/4 l is usm lens for canon ...,camera & photo,canon,59.82,4.869565,253.0,4.714665
5,7117,B000053HC5,canon ef 135mm f/2l usm lens for canon slr cam...,camera & photo,canon,66.78,4.945783,166.0,4.710327
6,48068,B000CKVOOY,arkon folding tablet stand for ipad air ipad m...,computers,arkon,12.95,4.707955,1873.0,4.687381
7,5267,B00004XOM3,canon ef 100mm f/2.8 macro usm fixed lens for ...,camera & photo,canon,49.23,4.794613,297.0,4.666549
8,30238,B0002JY712,panasonic cordless telephone battery (hhr-p104a),office products,panasonic,13.09,4.711515,825.0,4.665444
9,13422,B00006I53W,canon ef 70-200mm f/2.8l usm telephoto zoom le...,camera & photo,canon,58.6,4.859551,178.0,4.648328
10,8155,B00005LEN4,nikon af fx nikkor 50mm f/1.8d lens for nikon ...,camera & photo,nikon,11.99,4.665763,1107.0,4.63215


# Content-Based Filtering Recommender System
Done by Oh Boon Suen

# Collaborative Filtering Product Recommendation System
Done by Tan Cherng Ming

In [None]:
#Read the json file with dataframe
product_title = pd.read_json(r'datasets\subset_meta_Electronics.json', lines=True)
product_title = pd.DataFrame(product_title)
product_title

In [None]:
#Print column
product_title.columns

In [None]:
#Drop unneccesary column
product_title = product_title.drop(['category', 'tech1', 'description', 'fit', 'also_buy', 'tech2', 'brand', 'feature', 'rank', 
              'also_view', 'similar_item', 'date', 'price', 'imageURL','imageURLHighRes', 'details'], axis=1)
product_title

In [None]:
#Check for missing values
print('Number of missing values across columns:')
print(product_title.isnull().sum())

In [None]:
#Rename the column from default column name
product_title.rename(columns = {'asin':'productId'}, inplace = True)
product_title

In [None]:
#Read the second csv file and print the row
user_product_ratings = pd.read_csv(r'C:\Users\Wilson Tan\Downloads\AI assignment\ratings_Electronics.csv', names=['userId', 'productId','Rating','timestamp'])
user_product_ratings.head()

In [None]:
#Shape of the data
user_product_ratings.shape

In [None]:
#Extract 100000 data from the dataset(7824482)
user_product_ratings = user_product_ratings.iloc[:100000,0:]

In [None]:
#Drop the unnecessary column
user_product_ratings = user_product_ratings.drop('timestamp', axis=1)

In [None]:
#Shape of the data
user_product_ratings.shape

In [None]:
user_product_ratings.info()

In [None]:
#Check the datatypes
user_product_ratings.dtypes

In [None]:
#Check for missing values
print('Number of missing values across columns:')
print(user_product_ratings.isnull().sum())

In [None]:
# Check with the ratings distribution
with sns.axes_style('white'):
    g = sns.catplot(x ="Rating", data = user_product_ratings, kind ='count')
    g.set_ylabels("Total number of ratings")

In [None]:
#Print the number of ratings, user and product
print("Total number of ratings  :",user_product_ratings.shape[0])
print("Total number of user     :", len(np.unique(user_product_ratings.userId)))
print("Total number of products :", len(np.unique(user_product_ratings.productId)))

In [None]:
#Merge two files based on productId then drop the duplication
product_ratings = pd.merge(user_product_ratings, product_title, on='productId').drop_duplicates()
product_ratings

In [None]:
#Check for missing values
print('Number of missing values across columns:')
print(product_ratings.isnull().sum())

In [None]:
#Show the product with the highest number of rating
total_rate_of_a_product = product_ratings.groupby(by='title')['Rating'].count().sort_values(ascending=False)
total_rate_of_a_product.head(10)

In [None]:
#Summarize data to userID and title with Pivot Table
user_product_matrix = pd.pivot_table(product_ratings, index='userId', columns='title', values ='Rating').fillna(0)
user_product_matrix

## Item-based Filtering

In [None]:
#One item is selected
users_ratings = user_product_matrix['Koss Porta Pro On Ear Headphones with Case, Black / Silver']
users_ratings.head(10)

In [None]:
#Calculate the correlation
similar_product = user_product_matrix.corrwith(users_ratings)
similar_product

In [None]:
# Create a dataframe
similar_product = pd.DataFrame(similar_product, columns = ['Correlation'])
similar_product.head(10)

In [None]:
#Sort the product with correlation descendingly
similar_product.sort_values(by = 'Correlation', ascending = False).head(10)

In [None]:
#Count number of rating for the title
df_rating = pd.DataFrame(product_ratings.groupby('title')['Rating'].count())

In [None]:
recommend_product = similar_product.join(df_rating['Rating']).sort_values(by = 'Correlation', ascending = False)
recommend_product

In [None]:
# Recommend top 20 product that has > 50 ratings
recommend_product = recommend_product[recommend_product['Rating'] > 50].sort_values(by = 'Correlation', ascending = False)
recommend_product.head(20)

In [None]:
#Extract 20 product and make the recommended items a list
recommend_product = recommend_product.iloc[1:21]
products = recommend_product.index.values.tolist()
products

## User-based Filtering

In [None]:
#Transpose the pivot table
product_user_matrix = user_product_matrix.transpose()
product_user_matrix.head()

In [None]:
# One user is selected, A2BGZ52M908MJY
user_title_ratings = product_user_matrix['A231WM2Z2JL0U3']
user_title_ratings.head(5)

In [None]:
#Calculate the correlation
similar_users = product_user_matrix.corrwith(user_title_ratings)

# Create a dataframe
similar_users = pd.DataFrame(similar_users, columns = ['Correlation'])
similar_users.head(10)

In [None]:
#Sort the user with correlation descendingly
most_similar_users = similar_users.sort_values(by = 'Correlation', ascending = False).iloc[1:21]
most_similar_users

In [None]:
#Extract the first most similar user 
user_list = most_similar_users.index.values.tolist()
user_list[0]

In [None]:
#Product that are rated the user
recommendation = product_ratings[product_ratings['userId'] == user_list[0]]
recommendation

In [None]:
#DataFrame slicing : product with the rating > 3.0
recommendation = product_ratings.loc[(product_ratings['userId'] == user_list[0]) & 
                                   (product_ratings['Rating'] > 3), 
                                   ['title', 'Rating']]
recommendation

In [None]:
recommendation = recommendation.set_index('title')
recommendation_list = recommendation.index.values.tolist()
print('List to recommend')
recommendation_list