# Amazon Customer Reviews Dataset
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. Over 130+ million customer reviews are available to researchers as part of this dataset.

### Documentation: 
https://s3.amazonaws.com/amazon-reviews-pds/readme.html

### List of databases link: 
https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt

### Database's link: 
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz

In [1]:
import numpy as np
import pandas as pd
import datetime
import pickle

In [2]:
PATH_DATA = "V:/Programmazione/Amazon/"

We recognize that we have some data which are "corrupted". For those data we are goning to set the date to 1900-01-01. Then we are going to delete them.

In [3]:
def parserDate(x):
    try:
        return datetime.datetime.strptime(x, '%Y-%m-%d')
    except ValueError:
        return datetime.datetime.strptime("1900-01-01", '%Y-%m-%d')

In [4]:
df =  pd.read_csv(
    "%sdata.tsv" % PATH_DATA, 
    encoding="utf-8", 
    header=0,
    sep='\t',
    usecols=['review_id','customer_id', 'product_id', 'product_title','product_category','star_rating','helpful_votes','total_votes','verified_purchase','review_headline','review_body','review_date'],
    index_col = 'review_id',
    dtype = {
        #'marketplace':str,
        'customer_id':str,
        'review_id':str,
        'product_id':str,
        #'product_parent':str,
        'product_title':str,
        'product_category':str,
        'star_rating': np.uint8,
        'helpful_votes': np.uint16,
        'total_votes': np.uint16,
        #'vine':str,
        'verified_purchase':bool,
        'review_headline':str,
        'review_body':str,
        },
    parse_dates = ['review_date'],
    true_values = ["Y"],
    false_values  = ["N"],
    skipinitialspace = True,
    date_parser=parserDate,
    infer_datetime_format = True,
    error_bad_lines = False,
    warn_bad_lines = True,
    engine='c',
    )

In [5]:
df.head()

Unnamed: 0_level_0,customer_id,product_id,product_title,product_category,star_rating,helpful_votes,total_votes,verified_purchase,review_headline,review_body,review_date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
R2YVNBBMXD8KVJ,10349,B00MWK7BWG,My Favourite Faded Fantasy,Music,5,0,0,True,Five Stars,The best album ever!,2014-12-29
R2K4BOL8MN1TTY,10629,B006CHML4I,Seiko 5 Men's Automatic Watch with Black Dial ...,Watches,4,0,0,True,Great watch from casio.,What a great watch. Both watches and strap is ...,2013-10-24
R3P40IEALROVCH,12136,B00IIFCJX0,Dexter Season 8,Digital_Video_Download,5,0,0,True,fantastic,"love watching all the episodes of Dexter, when...",2014-05-09
R25XL1WWYRDLA9,12268,B000W7JWUA,The Settlers of Catan Board Game - discontinue...,Toys,5,0,0,True,Five Stars,Excellent game!!!,2014-09-19
RVTVB9YDXSFYH,12677,B005JTAP4S,Peter: A Darkened Fairytale (Vol 1),Digital_Ebook_Purchase,5,12,12,False,A twist on Tales,"This cute, quick read is very different to say...",2013-09-18


In [6]:
df['review_date'] = pd.to_datetime(df['review_date'])

In [7]:
df.dtypes

customer_id                  object
product_id                   object
product_title                object
product_category             object
star_rating                   uint8
helpful_votes                uint16
total_votes                  uint16
verified_purchase              bool
review_headline              object
review_body                  object
review_date          datetime64[ns]
dtype: object

In [8]:
df[df.review_date=="1900-01-01"].count()

customer_id          55
product_id           55
product_title        55
product_category     55
star_rating          55
helpful_votes        55
total_votes          55
verified_purchase    55
review_headline      55
review_body          55
review_date          55
dtype: int64

In [9]:
df = df.drop(df[df.review_date=="1900-01-01"].index)

In [10]:
df[df.review_date=="1900-01-01"].count()

customer_id          0
product_id           0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
dtype: int64

In [11]:
customersDict = dict()
for index,row in df.iterrows():
    try:
        customersDict[row.customer_id].append(row.product_id)
    except KeyError as k:
        customersDict[row.customer_id] = [row.product_id]

In [12]:
with open('%scustomersDict.pickle'%PATH_DATA, 'wb') as handle:
    pickle.dump(customersDict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [14]:
productsDict = dict()
for index,row in df.iterrows():
    try:
        productsDict[row.product_id].append(row.customer_id)
    except KeyError as k:
        productsDict[row.product_id] = [row.customer_id]

In [15]:
with open('%sproductsDict.pickle'%PATH_DATA, 'wb') as handle:
    pickle.dump(productsDict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [16]:
df.to_pickle("%sAmazonDataProject.pkl" % PATH_DATA)

In [17]:
df_no_text = (df.drop(labels=["review_headline","review_body"],axis=1))

In [18]:
df_no_text.to_pickle("%sAmazonDataProjectNoText.pkl" % PATH_DATA)

In [19]:
df_no_text.head()

Unnamed: 0_level_0,customer_id,product_id,product_title,product_category,star_rating,helpful_votes,total_votes,verified_purchase,review_date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
R2YVNBBMXD8KVJ,10349,B00MWK7BWG,My Favourite Faded Fantasy,Music,5,0,0,True,2014-12-29
R2K4BOL8MN1TTY,10629,B006CHML4I,Seiko 5 Men's Automatic Watch with Black Dial ...,Watches,4,0,0,True,2013-10-24
R3P40IEALROVCH,12136,B00IIFCJX0,Dexter Season 8,Digital_Video_Download,5,0,0,True,2014-05-09
R25XL1WWYRDLA9,12268,B000W7JWUA,The Settlers of Catan Board Game - discontinue...,Toys,5,0,0,True,2014-09-19
RVTVB9YDXSFYH,12677,B005JTAP4S,Peter: A Darkened Fairytale (Vol 1),Digital_Ebook_Purchase,5,12,12,False,2013-09-18


In [20]:
df_no_text.dtypes

customer_id                  object
product_id                   object
product_title                object
product_category             object
star_rating                   uint8
helpful_votes                uint16
total_votes                  uint16
verified_purchase              bool
review_date          datetime64[ns]
dtype: object