# Datasets
http://jmcauley.ucsd.edu/data/amazon/

# Data Cleaning Steps:

The review data is divided in several json files according to the Amazon category of the corresponding products (books, videos, … ). For now, we are going to focus on the Amazon_instant_video category - which is small enough to run on our computers- before extending our analysis to the others. We note that we use the 5-core dataset which means that the original dataset has been reduced so that each of the remaining users and items have 5 reviews each.

We start by downloading the file. Then we store it into a pandas dataframe which will enable us to have a clear overview of the data.

## Import data

In [112]:
from __future__ import absolute_import, division, print_function
from datetime import datetime

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

ds_folder = "../dataspace/amazon-reviews/"
df = getDF(ds_folder + 'reviews_Cell_Phones_and_Accessories_5.json.gz')
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


Both the reviewer and the product are identified by codes, respectively `reviewerID` and `asin`.We also have the actual name of the reviewer.
The text of the reviews as well as the corresponding summaries are represented by strings.
Each review is characterized by: 
- the name and the id of the reviewer
- the identification of the product
- a text and a summary 
- an overall rating score
- a list of two items giving thethe number of people that found the review usefuland the total number of people who read it.
- the review time , given both in unix format and as a string.

###### Our first step is to change the type of review time to datetime format.

In [113]:
df.reviewTime=pd.to_datetime(df.reviewTime)

In [114]:
df.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03


### Enrich with new variables

We will now enrich the dataframe with new variables, that we give us a more meaningful insight into the reviews

In [115]:
# Enrich with new columns "year" and "length_review"
df['year']=list(map(lambda x:x.year,df.reviewTime))
df['length_review']=list(map(lambda x:len(x.split(' ')),df.reviewText))

In [116]:
df.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21,2014,37
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14,2014,32
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26,2014,34
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21,2013,51
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03,2013,24


Instead of keeping a list $[a,b]$ giving the number of people who found the review helpful $(a)$ and total number of people who rated this review as helpful or not $(b)$, we compute a helpfulness score as $\frac{a}{b}$.
if nobody rated the helpfulness of the review (i.e when $b=0$) we set the score to None.

In [117]:
l=[]
for i in df.index:
    if df.helpful[i][1]==0:
        l.append(0)
    else:
        l.append(df.helpful[i][0]/df.helpful[i][1])
df['helpfulness']=l

In [118]:
df.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21,2014,37,0.0
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14,2014,32,0.0
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26,2014,34,0.0
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21,2013,51,1.0
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03,2013,24,0.666667


## Clean missing data
https://machinelearningmastery.com/handle-missing-data-python/

Having missing values in a dataset can cause errors with some machine learning algorithms.

#### Check miss values

In [119]:
df.isnull().sum()

reviewerID           0
asin                 0
reviewerName      3519
helpful              0
reviewText           0
overall              0
summary              0
unixReviewTime       0
reviewTime           0
year                 0
length_review        0
helpfulness          0
dtype: int64

We notice that some of the reviewer names are missing but since the reviewer Id are available anyway, this is not an issue.

#### Method 1: Remove Rows With Missing Values

The simplest strategy for handling missing data is to remove records that contain a missing value.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.

Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:

In [120]:
df['reviewerName'].dropna()

0                                 christina
1                                  emily l.
2                                     Erica
3                                        JM
4                          patrice m rogoza
5                                       RLH
6                               Tyler Evans
7                          Abdullah Albyati
8                                      Adam
9                           Agata Majchrzak
10                            Alex Maslakov
11                                Baja Alan
12                            Olivia ysiak
13                             Sasha Malkin
14                                    tim g
15                                Viktoriya
16          Zonaldo Reefey "Zonaldo Reefey"
17        Alexander Graham Bell Very-Junior
18                               amazonfan1
19                                   Barbie
20            Bernadette Mitchell "Lady Di"
21                                      Bob
22                              

#### Method 2: Impute Missing Values with 

Imputing refers to using a model to replace missing values.

There are many options we could consider when replacing a missing value, for example:

- A constant value that has meaning within the domain, such as 0, distinct from all other values.
- A value from another randomly selected record.
- A mean, median or mode value for the column.
- A value estimated by another predictive model.

Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.

Pandas provides the fillna() function for replacing missing values with a specific value.

For example, we can use fillna() to replace missing values with the "Unknow" for each column, as follows:

In [121]:
df.isnull().sum()

reviewerID           0
asin                 0
reviewerName      3519
helpful              0
reviewText           0
overall              0
summary              0
unixReviewTime       0
reviewTime           0
year                 0
length_review        0
helpfulness          0
dtype: int64

In [122]:
df.fillna('Unknown',inplace=True)

In [123]:
df.isnull().sum()

reviewerID        0
asin              0
reviewerName      0
helpful           0
reviewText        0
overall           0
summary           0
unixReviewTime    0
reviewTime        0
year              0
length_review     0
helpfulness       0
dtype: int64

## Clean malicious data

## Clean erroneous data

#### Check overall must between 0 - 5

In [124]:
df.loc[(df['overall']<0) & (df['overall']>5)]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness


#### reviewTime must be between May 1996 - July 2014

In [125]:
df.loc[(df['reviewTime']<'1996-05-01') & (df['reviewTime']>'2014-07-31')]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness


In [126]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,helpfulness
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,2014-05-21,2014,37,0.0
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,2014-01-14,2014,32,0.0
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,2014-06-26,2014,34,0.0
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,2013-10-21,2013,51,1.0
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,2013-02-03,2013,24,0.666667


## Clean irrelevant data

In [129]:
del df['unixReviewTime']

KeyError: 'unixReviewTime'

In [130]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,reviewTime,year,length_review,helpfulness
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,2014-05-21,2014,37,0.0
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,2014-01-14,2014,32,0.0
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,2014-06-26,2014,34,0.0
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,2013-10-21,2013,51,1.0
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,2013-02-03,2013,24,0.666667


## Clean inconsistent data

#### Sentiment analysis by review 

In [148]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

sentimentAnalyzer = SentimentIntensityAnalyzer()
sentence = "This camera worked qute well, I am really happy with its image quality and ease-of-use."
sentimentAnalyzer.polarity_scores(sentence)['compound']


[nltk_data] Downloading package vader_lexicon to C:\Users\VAN
[nltk_data]     VAN\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


0.7264

In [149]:
l_sentiment=[]
for i in df.index:
    l_sentiment.append(sentimentAnalyzer.polarity_scores(df.helpful[i]['compound'])
df['score_sentiment_com']=l_sentiment

SyntaxError: invalid syntax (<ipython-input-149-a826a4d30918>, line 4)

In [147]:
df['score_sentiment']=list(map(lambda x:sentimentAnalyzer.polarity_scores(x),df.reviewText))

KeyboardInterrupt: 

In [145]:
sentimentAnalyzer.polarity_scores(sentence)['compound']

0.7264

## Clean outliers

## Formatting issues

# References

[1] https://github.com/sourabhlal/ada2017/blob/master/Project/Main_Notebook_Amazon.ipynb