### Come back to [Introduction](Introduction.ipynb)

# Amazon Customer Reviews Dataset
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. Over 130+ million customer reviews are available to researchers as part of this dataset.

### Documentation: 
https://s3.amazonaws.com/amazon-reviews-pds/readme.html

### List of databases link: 
https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt

### Database's link: 
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz

In [1]:
import numpy as np
import pandas as pd
import datetime
import pickle

In [2]:
PATH_DATA = "V:/Programmazione/Amazon/"

We recognize that we have some data which are "corrupted". For those data we are goning to set the date to 1900-01-01. Then we are going to delete them.

In [3]:
def parserDate(x):
    try:
        return datetime.datetime.strptime(x, '%Y-%m-%d')
    except ValueError:
        return datetime.datetime.strptime("1900-01-01", '%Y-%m-%d')

In [4]:
df =  pd.read_csv(
    "%sdata.tsv" % PATH_DATA, 
    encoding="utf-8", 
    header=0,
    sep='\t',
    usecols=['customer_id', 'product_id', 'product_title','product_category','star_rating','review_headline','review_body','review_date'],
    dtype = {
        #'marketplace':str,
        'customer_id':str,
        #'review_id':str,
        'product_id':str,
        #'product_parent':str,
        'product_title':str,
        'product_category':str,
        'star_rating': np.uint8,
        #'helpful_votes': np.uint16,
        #'total_votes': np.uint16,
        #'vine':str,
        #'verified_purchase':bool,
        'review_headline':str,
        'review_body':str,
        },
    parse_dates = ['review_date'],
    true_values = ["Y"],
    false_values  = ["N"],
    skipinitialspace = True,
    date_parser=parserDate,
    infer_datetime_format = True,
    error_bad_lines = False,
    warn_bad_lines = True,
    engine='c',
    )

In [5]:
df.index.names = ['review_id']

In [6]:
df.head()

Unnamed: 0_level_0,customer_id,product_id,product_title,product_category,star_rating,review_headline,review_body,review_date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,10349,B00MWK7BWG,My Favourite Faded Fantasy,Music,5,Five Stars,The best album ever!,2014-12-29
1,10629,B006CHML4I,Seiko 5 Men's Automatic Watch with Black Dial ...,Watches,4,Great watch from casio.,What a great watch. Both watches and strap is ...,2013-10-24
2,12136,B00IIFCJX0,Dexter Season 8,Digital_Video_Download,5,fantastic,"love watching all the episodes of Dexter, when...",2014-05-09
3,12268,B000W7JWUA,The Settlers of Catan Board Game - discontinue...,Toys,5,Five Stars,Excellent game!!!,2014-09-19
4,12677,B005JTAP4S,Peter: A Darkened Fairytale (Vol 1),Digital_Ebook_Purchase,5,A twist on Tales,"This cute, quick read is very different to say...",2013-09-18


In [7]:
df['review_date'] = pd.to_datetime(df['review_date'])

In [8]:
df.dtypes

customer_id                 object
product_id                  object
product_title               object
product_category            object
star_rating                  uint8
review_headline             object
review_body                 object
review_date         datetime64[ns]
dtype: object

In [9]:
print("There are %d of corrupted data" % df[df.review_date=="1900-01-01"].count()[0])

There are 55 of corrupted data


In [10]:
df = df.drop(df[df.review_date=="1900-01-01"].index)

In [11]:
print("Now, There are %d of corrupted data" % df[df.review_date=="1900-01-01"].count()[0])

Now, There are 0 of corrupted data


Check if there are nan values

In [12]:
print("There are %d of nan values" % df.isnull().values.sum())

There are 21 of nan values


Removing rows which contain nan values

In [13]:
df = df.dropna()

In [14]:
print("Now, There are %d of nan values" % df.isnull().values.sum())

Now, There are 0 of nan values


# Filtering of categories with less then 1000 reviews

In [15]:
categories = sorted(df.product_category.unique().tolist())

In [16]:
tempCategories = []
delCategories = []
N = 1000 #inferior limit
for c in categories:
    if(df[df.product_category==c].count()[0])>=N:
        tempCategories.append(c)
    else:
        delCategories.append(c)
tot = df.count()[0]
NKeep = df[df['product_category'].isin(tempCategories)].count()[0]
NDisc = tot-NKeep
print("CATEGORIES KEEPED (%d): %s" % (len(tempCategories),", ".join(tempCategories)))
print("WE ARE KEEPING %d/%d (%.2f%%) REVIEWS" % (NKeep,tot,NKeep/float(tot)*100))
print("\nCATEGORIES REMOVED (%d): %s" % (len(delCategories),", ".join(delCategories)))
print("WE ARE DISCARDING %d/%d (%.2f%%) REVIEWS" % (NDisc,tot,NDisc/float(tot)*100))

CATEGORIES KEEPED (20): Baby, Books, Camera, Digital_Ebook_Purchase, Digital_Music_Purchase, Digital_Video_Download, Electronics, Home, Mobile_Apps, Music, Musical Instruments, PC, Shoes, Sports, Toys, Video, Video DVD, Video Games, Watches, Wireless
WE ARE KEEPING 1702443/1705765 (99.81%) REVIEWS

CATEGORIES REMOVED (13): Apparel, Automotive, Beauty, Health & Personal Care, Home Entertainment, Home Improvement, Kitchen, Lawn and Garden, Luggage, Office Products, Personal_Care_Appliances, Pet Products, Software
WE ARE DISCARDING 3322/1705765 (0.19%) REVIEWS


In [17]:
categories = tempCategories
df = df[df['product_category'].isin(categories)]

In [None]:
t = PT(['Stars', 'Number of reviews','% respect to the total'])
for i in range(1,6):
    count = df[df.star_rating==i].count()[0]
    t.add_row(['*'*i,count,"%.2f%%" % percent(count,tot)])
print("Number of total reviews = %.0f" % tot)
print(t.get_string(title="Number of reviews divided by rating value"))

### Number of reviews

In [None]:
t = PT(['Category', 'Number of reviews','% respect to the total'])
tot = float(df.count()[0])
for c in categories:
    count = df[df.product_category==c].count()[0]
    t.add_row([c,count,"%.2f%%" % percent(count,tot)])
t.align["Category"] = 'l'
print("Number of total reviews = %.0f" % tot)
print(t.get_string(title="Number of reviews divided by category"))

# Structures 

In [18]:
customersDict = dict()
for index,row in df.iterrows():
    try:
        customersDict[row.customer_id].append((row.product_id, index))
    except KeyError as k:
        customersDict[row.customer_id] = [(row.product_id, index)]

In [19]:
with open('%scustomersDict.pickle'%PATH_DATA, 'wb') as handle:
    pickle.dump(customersDict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [20]:
productsDict = dict()
for index,row in df.iterrows():
    try:
        productsDict[row.product_id].append((row.customer_id, index))
    except KeyError as k:
        productsDict[row.product_id] = [(row.customer_id, index)]

In [21]:
with open('%sproductsDict.pickle'%PATH_DATA, 'wb') as handle:
    pickle.dump(productsDict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [22]:
df.to_pickle("%sAmazonDataProject.pkl" % PATH_DATA)

In [23]:
df_no_text = (df.drop(labels=["review_headline","review_body"],axis=1))

In [24]:
df_no_text.head()

Unnamed: 0_level_0,customer_id,product_id,product_title,product_category,star_rating,review_date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,10349,B00MWK7BWG,My Favourite Faded Fantasy,Music,5,2014-12-29
1,10629,B006CHML4I,Seiko 5 Men's Automatic Watch with Black Dial ...,Watches,4,2013-10-24
2,12136,B00IIFCJX0,Dexter Season 8,Digital_Video_Download,5,2014-05-09
3,12268,B000W7JWUA,The Settlers of Catan Board Game - discontinue...,Toys,5,2014-09-19
4,12677,B005JTAP4S,Peter: A Darkened Fairytale (Vol 1),Digital_Ebook_Purchase,5,2013-09-18


In [25]:
df_no_text.dtypes

customer_id                 object
product_id                  object
product_title               object
product_category            object
star_rating                  uint8
review_date         datetime64[ns]
dtype: object

In [26]:
df_no_text.to_pickle("%sAmazonDataProjectNoText.pkl" % PATH_DATA)