# Datasets
http://jmcauley.ucsd.edu/data/amazon/

# Data Cleaning Steps:

The review data is divided in several json files according to the Amazon category of the corresponding products (books, videos, … ). For now, we are going to focus on the Amazon_instant_video category - which is small enough to run on our computers- before extending our analysis to the others. We note that we use the 5-core dataset which means that the original dataset has been reduced so that each of the remaining users and items have 5 reviews each.

We start by downloading the file. Then we store it into a pandas dataframe which will enable us to have a clear overview of the data.

## Import data

In [1]:
from __future__ import absolute_import, division, print_function
from datetime import datetime

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

ds_folder = "../dataspace/amazon-reviews/"
df = getDF(ds_folder + 'reviews_Cell_Phones_and_Accessories_5.json.gz')
df.head()



Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


Both the reviewer and the product are identified by codes, respectively `reviewerID` and `asin`.We also have the actual name of the reviewer.
The text of the reviews as well as the corresponding summaries are represented by strings.
Each review is characterized by: 
- the name and the id of the reviewer
- the identification of the product
- a text and a summary 
- an overall rating score
- a list of two items giving thethe number of people that found the review usefuland the total number of people who read it.
- the review time , given both in unix format and as a string.

###### Our first step is to change the type of review time to datetime format.

In [2]:
df.reviewTime=pd.to_datetime(df.reviewTime)

In [3]:
df.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03


### Enrich with new variables

We will now enrich the dataframe with new variables, that we give us a more meaningful insight into the reviews

In [4]:
df['year']=list(map(lambda x:x.year,df.reviewTime))
df['length_review']=list(map(lambda x:len(x.split(' ')),df.reviewText))

In [5]:
df.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21,2014,37
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14,2014,32
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26,2014,34
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21,2013,51
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03,2013,24


Instead of keeping a list $[a,b]$ giving the number of people who found the review helpful $(a)$ and total number of people who rated this review as helpful or not $(b)$, we compute a helpfulness score as $\frac{a}{b}$.
if nobody rated the helpfulness of the review (i.e when $b=0$) we set the score to None.

In [6]:
l=[]
for i in df.index:
    if df.helpful[i][1]==0:
        l.append(0)
    else:
        l.append(df.helpful[i][0]/df.helpful[i][1])
df['helpfulness']=l

In [7]:
df.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21,2014,37,0.0
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14,2014,32,0.0
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26,2014,34,0.0
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21,2013,51,1.0
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03,2013,24,0.666667


## Clean missing data
https://machinelearningmastery.com/handle-missing-data-python/

Having missing values in a dataset can cause errors with some machine learning algorithms.

#### Check miss values

In [8]:
df.isnull().sum()

reviewerID           0
asin                 0
reviewerName      3519
helpful              0
reviewText           0
overall              0
summary              0
unixReviewTime       0
reviewTime           0
year                 0
length_review        0
helpfulness          0
dtype: int64

We notice that some of the reviewer names are missing but since the reviewer Id are available anyway, this is not an issue.

#### Method 1: Remove Rows With Missing Values

The simplest strategy for handling missing data is to remove records that contain a missing value.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.

Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:

In [9]:
df['reviewerName'].dropna()

0                                 christina
1                                  emily l.
2                                     Erica
3                                        JM
4                          patrice m rogoza
5                                       RLH
6                               Tyler Evans
7                          Abdullah Albyati
8                                      Adam
9                           Agata Majchrzak
10                            Alex Maslakov
11                                Baja Alan
12                            Olivia ysiak
13                             Sasha Malkin
14                                    tim g
15                                Viktoriya
16          Zonaldo Reefey "Zonaldo Reefey"
17        Alexander Graham Bell Very-Junior
18                               amazonfan1
19                                   Barbie
20            Bernadette Mitchell "Lady Di"
21                                      Bob
22                              

#### Method 2: Impute Missing Values with 

Imputing refers to using a model to replace missing values.

There are many options we could consider when replacing a missing value, for example:

- A constant value that has meaning within the domain, such as 0, distinct from all other values.
- A value from another randomly selected record.
- A mean, median or mode value for the column.
- A value estimated by another predictive model.

Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.

Pandas provides the fillna() function for replacing missing values with a specific value.

For example, we can use fillna() to replace missing values with the "Unknow" for each column, as follows:

In [10]:
df.isnull().sum()

reviewerID           0
asin                 0
reviewerName      3519
helpful              0
reviewText           0
overall              0
summary              0
unixReviewTime       0
reviewTime           0
year                 0
length_review        0
helpfulness          0
dtype: int64

In [11]:
df.fillna('Unknown',inplace=True)

In [12]:
df.isnull().sum()

reviewerID        0
asin              0
reviewerName      0
helpful           0
reviewText        0
overall           0
summary           0
unixReviewTime    0
reviewTime        0
year              0
length_review     0
helpfulness       0
dtype: int64

## Clean malicious data

## Clean erroneous data

#### Check overall must between 0 - 5

In [13]:
df.loc[(df['overall']<0) & (df['overall']>5)]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness


#### reviewTime must be between May 1996 - July 2014

In [14]:
df.loc[(df['reviewTime']<'1996-05-01') & (df['reviewTime']>'2014-07-31')]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness


In [15]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21,2014,37,0.0
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14,2014,32,0.0
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26,2014,34,0.0
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21,2013,51,1.0
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03,2013,24,0.666667


## Clean irrelevant data

The unixReviewTime is the same meaning with reviewTime in datasets, so we will remove this feature.

In [16]:
del df['unixReviewTime']

In [17]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,reviewTime,year,length_review,helpfulness
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,2014-05-21,2014,37,0.0
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,2014-01-14,2014,32,0.0
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,2014-06-26,2014,34,0.0
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,2013-10-21,2013,51,1.0
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,2013-02-03,2013,24,0.666667


## Clean inconsistent data

#### Sentiment analysis by review 

In [22]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# import nltk
# nltk.download('vader_lexicon')

sentimentAnalyzer = SentimentIntensityAnalyzer()
sentence = "This camera worked qute well, I am really happy with its image quality and ease-of-use."
sentimentAnalyzer.polarity_scores(sentence)


{'compound': 0.7264, 'neg': 0.0, 'neu': 0.663, 'pos': 0.337}

In [29]:
df['score_review']=list(map(lambda x:sentimentAnalyzer.polarity_scores(x)['compound'],df.reviewText))
df['score_summary']=list(map(lambda x:sentimentAnalyzer.polarity_scores(x)['compound'],df.summary))

In [32]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,reviewTime,year,length_review,helpfulness,score_review,score_summary
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,2014-05-21,2014,37,0.0,-0.1808,0.4404
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,2014-01-14,2014,32,0.0,0.9403,0.659
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,2014-06-26,2014,34,0.0,0.8852,0.9274
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,2013-10-21,2013,51,1.0,0.9625,0.5093
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,2013-02-03,2013,24,0.666667,0.902,0.0


In [36]:
# Bad sumary but best rating
df.loc[(df['overall']>4) & (df['score_summary']<0)]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,reviewTime,year,length_review,helpfulness,score_review,score_summary
223,AH0DGDY46OY13,9983798883,"Dobx ""Dobx""","[0, 0]",This charger does the trick. The charger in m...,5.0,Why pay more?,2011-09-18,2011,73,0.000000,0.0000,-0.1027
379,A2PZ6Z9D2CT27P,B00002X29G,Philip E Gregg,"[0, 0]","Great tools, I still use them and I like the q...",5.0,Not bad quality,2013-03-07,2013,19,0.000000,0.8934,-0.5423
478,A1Q3B4Q6W4FX46,B0000AKAJL,"Richard A Drennan ""R Drennan""","[0, 0]",I bought this to try and repair a POS dash cam...,5.0,Lame attempt to repair a camera,2013-04-19,2013,31,0.000000,0.9169,-0.4215
701,A1ATWEANLT57YG,B0002SYC5O,Van Lam,"[0, 0]",My LG HBS-730 that I use for bike rides with n...,5.0,No interference at all,2014-06-26,2014,347,0.000000,0.9836,-0.2960
707,A2SXHZT52Y6LJ5,B0002WRGH6,H. Winterfield,"[0, 0]",I was very happy with my HS820 for the first 3...,5.0,pleased but disappointed,2006-06-25,2006,43,0.000000,0.8122,-0.4939
724,A10TF6TUAV0XZP,B0002WRGHG,"Dave CR ""Dave""","[2, 2]",This headset works fine. It fits perfectly ove...,5.0,Not sure what all the negative reviews are ab...,2007-01-09,2007,38,1.000000,0.8221,-0.3400
780,A2Q8J8IL6V9M5T,B000652QNS,"K. Lechliter ""Walker Boh""","[0, 0]",I've been a loyal BoxWave customer for years. ...,5.0,"Insane Price, BoxWave Quality",2013-05-30,2013,50,0.000000,0.9047,-0.4019
840,A33FS5H3CPDR6D,B0006FLA80,"Miguel Ali ""Film Director & Political Pundit""","[20, 24]","I am in love with this phone!First off, the Pa...",5.0,Don't listen to the complaints - this phone ro...,2005-01-11,2005,174,0.833333,0.9769,-0.4574
842,A2KTK8C503CYEG,B0006GFARG,C. Rosa,"[2, 4]","Solid bluetooth headset, particularly consider...",5.0,hard to beat for the price,2005-05-10,2005,75,0.500000,0.9035,-0.1027
897,A3T3KPW2QE866W,B0006I2E1O,leelee,"[19, 24]",I have been a Nextel user for years and their ...,5.0,Cutting edge phone!!!!!!!,2005-04-16,2005,247,0.791667,0.9857,-0.3956


In [39]:
df.sort_values(by=['score_summary'],ascending=False)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,reviewTime,year,length_review,helpfulness,score_review,score_summary
48936,A1NI0G1SKRF8S8,B0059DLLJC,jrcolon,"[0, 1]",very good very good very good very good very g...,5.0,very good very good very good very good very g...,2013-01-10,2013,82,0.000000,0.9983,0.9813
35977,A1NI0G1SKRF8S8,B004I58ZVY,jrcolon,"[0, 1]",very good very good very good very good very g...,5.0,very good very good very good very good very g...,2013-01-10,2013,40,0.000000,0.9931,0.9765
10308,A1NI0G1SKRF8S8,B002BSO4ZG,jrcolon,"[0, 0]",very good very good very good very good very g...,5.0,very good very good very good very good very g...,2013-01-10,2013,60,0.000000,0.9971,0.9765
63559,A1NI0G1SKRF8S8,B005XGXFQ2,jrcolon,"[0, 1]",very good very good very good very good very g...,5.0,very good very good very good very good very g...,2013-01-10,2013,80,0.000000,0.9982,0.9765
81289,AC3VHAO5ZVJTF,B007J7IKVI,Rachel Munoz,"[0, 0]",great quality and i love it i will be buyi...,5.0,good quality good good good love love love,2013-01-18,2013,46,0.000000,0.9451,0.9756
171075,A1GU6PSZG155ND,B00DVRIUN8,"Ram Wats ""Ram Wat""","[33, 42]",I have to say that I own a Lumia 928 and loved...,5.0,Best phone ever!!!! LOVE LOVE LOVE it!,2013-08-10,2013,177,0.785714,0.9914,0.9725
72243,A2NUWASJTPN7RO,B006QYI6LO,eire1274,"[2, 2]","Having changed phones, the case for my old Gal...",5.0,LOVE LOVE LOVE LOVE this case!,2012-02-21,2012,156,1.000000,0.8455,0.9720
74450,A3LS4QXRTFJWUI,B0070QML5E,BebaDesire2Shop,"[0, 1]",I feel madly in love with this case from the s...,5.0,LOVE LOVE LOVE LOVE LOVE,2013-03-10,2013,99,0.000000,0.9802,0.9719
162882,A1GU6PSZG155ND,B00COYOAYW,"Ram Wats ""Ram Wat""","[8, 13]",Cannot believe what I got.This is like a dream...,5.0,"WOW, WOW, WOW, Seriuosly, WOW!!!",2013-05-18,2013,41,0.615385,0.9716,0.9683
56344,A3VS7IZZVC6337,B005LKB0IU,Ted Pavlic,"[0, 0]","So far, I've really enjoyed these relatively l...",5.0,"So far so good. Comfortable, good battery life...",2013-05-22,2013,629,0.000000,0.9713,0.9677


In [41]:
# good sumary but low rating
df.loc[(df['overall']<2) & (df['score_summary']>0.8)]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,reviewTime,year,length_review,helpfulness,score_review,score_summary
20159,A3PLGU40324Y8V,B003TZI39I,NewEra2012,"[0, 0]",Ya $1.90 I got raped. If you want to get rape...,1.0,peace of crapp no good trust me,2012-08-04,2012,83,0.0,-0.8768,0.8176
30830,A3HC55XYJEMQP6,B004877KOK,Robo21,"[9, 11]","I tried 2 different brand new Roadsters, both ...",1.0,Great Device IF You Do Not Care About FM Perfo...,2011-03-12,2011,251,0.818182,0.9605,0.8074
31709,A1SWHCGTIGUIYN,B004A83PE6,KDupuy,"[0, 0]",The price for the 6-Pack Mirror Screen Protect...,1.0,Great Price Not So Great Product,2012-10-09,2012,23,0.0,0.624,0.8481
32549,A33AFYGGVBNTJI,B004BY6AFU,Keysha,"[0, 0]",Omgsh...this was a total waste of money!! I wi...,1.0,"Lol it's so funny, that's its not even funny!!",2011-12-13,2011,146,0.0,0.8306,0.868
35704,A16GBXDS51J019,B004I3F7YO,A. Neilll,"[0, 1]",At first this battery was great. I got it bef...,1.0,"Worked great at first, but now erratic at best",2011-05-04,2011,436,0.0,0.6869,0.8537
35751,ABHQKTNNQDQ4P,B004I49JG0,VampireLiebchen,"[0, 0]",this si a sad pathetic disappointment. I hate ...,1.0,Not even worth 1 star...DOES NOT FIT THE DESIR...,2013-04-01,2013,109,0.0,-0.3855,0.8208
39120,A1ROJ461BOF9BP,B004O7S7Z0,CONSUMER REPORTER,"[1, 1]",The car mount doesn't stay mounted. Keeps fall...,1.0,Great price. Not so great design and quality.,2013-01-14,2013,24,1.0,-0.1531,0.8845
45291,A1K36YU6AFCBN4,B0052NNXDG,Larry Smith,"[0, 0]",The protectors are a bit smaller than the actu...,1.0,Not quite the best fit,2013-02-11,2013,22,0.0,0.0,0.8043
57327,A3TI7VYNVZ8Q2E,B005NC86BU,"Alessandra J. ""AJ""","[0, 0]","I bought this battery to have as a spare, and ...",1.0,"OK, but NOT great after a few months",2013-01-17,2013,57,0.0,0.4939,0.8232
79984,A3EHJ5519LG7TR,B007FHX9OK,"Irene ""Live in his way""","[4, 5]","First, I like it. It stick to the window reall...",1.0,Don't BUY!!! It stick like super glue. I can'...,2013-12-22,2013,75,0.8,0.7473,0.8061


## Clean outliers

## Formatting issues

# References

[1] https://github.com/sourabhlal/ada2017/blob/master/Project/Main_Notebook_Amazon.ipynb