# Amazon Fine Food Reviews: EDA

## Introduction

<pre>
<b>Dataset statistics</b>
Number of reviews 	568,454
Number of users 	256,059
Number of products 	74,258
Users with > 50 reviews 	260
Median no. of words per review 	56
Timespan 	Oct 1999 - Oct 2012
</pre>
<br />
<b>Data Fields Explanation</b><br />

The Amazon Fine Food Reviews dataset consists of 568,454 food reviews. This dataset consists of a single CSV file, Reviews.csv. The columns in the table are:
<pre>
    Id - Unique row number
    ProductId - unique identifier for the product
    UserId - unqiue identifier for the user
    ProfileName
    HelpfulnessNumerator - number of users who found the review helpful
    HelpfulnessDenominator - number of users who indicated whether they found the review helpful
    Score - rating between 1 and 5
    Time - timestamp for the review
    Summary - brief summary of the review
    Text - text of the review
</pre>

<img src="AmazonReview.png">

In [0]:
#from IPython.display import Image
#Image(filename='AmazonReview.png')

## Objective

Analysing the data & plot the required graphs to show that these conclusions are true:
<pre>
a. Positive reviews are very common.
b. Positive reviews are shorter.
c. Longer reviews are more helpful.
d. Despite being more common and shorter, positive reviews are found more helpful.
e. Frequent reviewers are more discerning in their ratings, write longer reviews, and write more helpful reviews
</pre>
<b>Note:</b> This notebook is highly inspired from the <a href="http://blog.nycdatascience.com/student-works/amazon-fine-foods-visualization/">Exploratory visualization of Amazon fine food reviews by Rob Castellano.</a>

## Loading the Data

In [0]:
#Let's import pandas to read the csv file.
import pandas as pd
df = pd.read_csv("../input/Reviews.csv")

In [0]:
#Printing first 5 columns from our data frame
df.head()

In [0]:
#Observing the lables of each column
print(df.keys())

In [0]:
#Observing the shape of our data frame.
df.shape
# Note: We have 10 features and 568454 data points.

In [0]:
#Lets check for missing values
df.info()
#Observe that there are some missing values in 'PROFILENAME' & 'SUMMARY' column.

In [0]:
df.describe()
#Observe that more than 75% of our data is belonging to positive /
#class, i.e. we have imbalanced dataset.

In [0]:
#Lets do the value count on 'Scores'.
df.Score.value_counts()

## Exploratory Data Analysis

Till now we saw that 5-star reviews constitute a large proportion (64%) of all reviews. The next most prevalent rating is 4-stars(14%), followed by 1-star (9%), 3-star (8%), and finally 2-star reviews (5%).<br />
Note that we have 10 features and 568454 data points. There are some missing values in 'PROFILENAME' & 'SUMMARY' column. More than 75% of our data is belonging to positive class(Score=4,5), i.e. we have imbalanced dataset.

In [0]:
#Importing Seaborn and Matplotlib for graphical effects.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
sns.countplot(x='Score', data=df, palette='RdBu')
plt.xlabel('Score (Rating)')
plt.show()

## Creating a new dataframe

In [0]:
#copying the original dataframe to 'temp_df'.
temp_df = df[['UserId','HelpfulnessNumerator','HelpfulnessDenominator', 'Summary', 'Text','Score']].copy()

#Adding new features to dataframe.
temp_df["Sentiment"] = temp_df["Score"].apply(lambda score: "positive" if score > 3 else \
                                              ("negative" if score < 3 else "not defined"))
temp_df["Usefulness"] = (temp_df["HelpfulnessNumerator"]/temp_df["HelpfulnessDenominator"]).apply\
(lambda n: ">75%" if n > 0.75 else ("<25%" if n < 0.25 else ("25-75%" if n >= 0.25 and\
                                                                        n <= 0.75 else "useless")))

temp_df.loc[temp_df.HelpfulnessDenominator == 0, 'Usefulness'] = ["useless"]
# Removing all rows where 'Score' is equal to 3
#temp_df = temp_df[temp_df.Score != 3]
#Lets now observe the shape of our new dataframe.
temp_df.shape

In [0]:
temp_df.describe()

In [0]:
temp_df.info()

In [0]:
#Lets view the dataframe when Score=5
temp_df[temp_df.Score == 5].head(10)

## Positive reviews are very common

In [0]:
sns.countplot(x='Sentiment', order=["positive", "negative"], data=temp_df, palette='RdBu')
plt.xlabel('Sentiment')
plt.show()

In [0]:
temp_df.Sentiment.value_counts()

Therefore we could conclude that the positive reviews are way more than the negative reviews.

## Popular words in Review

A look at the post popular words in positive (4-5 stars) and negative (1-2 stars) reviews shows that both positive and negative reviews share many popular words, such as "coffee", "taste", "flavor", "price", "good", and "product." The words "good", "great", "love", "favorite", and "find" are indicative of positive reviews, while negative reviews contain words such as "didn't" and "disappointed", but these distinguishing words appear less frequently than distinguishing words in positive reviews.

In [0]:
pos = temp_df.loc[temp_df['Sentiment'] == 'positive']
pos = pos[0:25000]

neg = temp_df.loc[temp_df['Sentiment'] == 'negative']
neg = neg[0:25000]

In [0]:
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
import string
import matplotlib.pyplot as plt


def create_Word_Corpus(temp):
    words_corpus = ''
    for val in temp["Summary"]:
        text = str(val).lower()
        #text = text.translate(trantab)
        tokens = nltk.word_tokenize(text)
        tokens = [word for word in tokens if word not in stopwords.words('english')]
        for words in tokens:
            words_corpus = words_corpus + words + ' '
    return words_corpus
        
# Generate a word cloud image
pos_wordcloud = WordCloud(width=900, height=500).generate(create_Word_Corpus(pos))
neg_wordcloud = WordCloud(width=900, height=500).generate(create_Word_Corpus(neg))

In [0]:
# Plot cloud
def plot_Cloud(wordCloud):
    plt.figure( figsize=(20,10), facecolor='w')
    plt.imshow(wordCloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()

In [0]:
#Visuallizing popular positive words
plot_Cloud(pos_wordcloud)

In [0]:
#Visuallizing popular negative words
plot_Cloud(neg_wordcloud)

## Helpfulness

### How many reviews are helpful?

Among all reviews, almost half (50%) are not voted on at all.<br />
Among reviews that are voted on, helpful reviews(>75%) are the most common

In [0]:
#Checking the value count for 'Usefulness'
temp_df.Usefulness.value_counts()

In [0]:
sns.countplot(x='Usefulness', order=['useless', '>75%', '25-75%', '<25%'], data=temp_df, palette='RdBu')
plt.xlabel('Usefulness')
plt.show()

### Positive reviews are found more helpful

As the rating becomes more positive, the reviews become more helpful (and less unhelpful).

In [0]:
temp_df[temp_df.Score==5].Usefulness.value_counts()

In [0]:
temp_df[temp_df.Score==2].Usefulness.value_counts()

In [0]:
sns.countplot(x='Sentiment', hue='Usefulness', order=["positive", "negative"], \
              hue_order=['>75%', '25-75%', '<25%'], data=temp_df, palette='RdBu')
plt.xlabel('Sentiment')
plt.show()

Therefore positive reviews are more helpful.

## Word Count

In [0]:
temp_df["text_word_count"] = temp_df["Text"].apply(lambda text: len(text.split()))

In [0]:
temp_df.head()

In [0]:
temp_df[temp_df.Score==5].text_word_count.median()

In [0]:
temp_df[temp_df.Score==4].text_word_count.median()

In [0]:
temp_df[temp_df.Score==3].text_word_count.median()

In [0]:
temp_df[temp_df.Score==2].text_word_count.median()

In [0]:
temp_df[temp_df.Score==1].text_word_count.median()

In [0]:
sns.boxplot(x='Score',y='text_word_count', data=temp_df, palette='RdBu', showfliers=False)
plt.show()

Observations: 5-star reviews had the lowest median word count (52 words), while 3-star reviews had the largest median word count (70 words).

### How does word count relate to helpfulness?

The word counts for helpful reviews and not helpful reviews have a similar distribution with the greatest concentration of reviews of approximately 25 words. However, not helpful reviews have a larger concentration of reviews with low word count and helpful reviews have more longer reviews. Helpful reviews have a higher median word count (67 words) than not helpful reviews (54 words).

In [0]:
sns.violinplot(x='Usefulness', y='text_word_count', order=[">75%", "<25%"], \
               data=temp_df, palette='RdBu')
plt.ylim(-50, 400)
plt.show()

## Frequency of reviewers

Using User IDs, one can recognize repeat reviewers. Reviewers that have reviewed over 50 products account for over 5% of all reviews in the database. We will call such reviewers frequent reviewers. (The cutoff choice of 50, as opposed to another choice, seemed to not have a larger impact on the results.) I asked: Does the behavior of frequent reviewers differ from that of infrequent reviewers?

In [0]:
x = temp_df.UserId.value_counts()
x.to_dict()
print("converted Series to dictionary")

In [0]:
temp_df["reviewer_freq"] = temp_df["UserId"].apply(lambda counts: "Frequent (>50 reviews)" \
                                                                 if x[counts]>50 else "Not Frequent (1-50)")

In [0]:
temp_df.head()

### Are frequent reviewers more discerning?

The distribution of ratings among frequent reviewers is similar to that of all reviews. However, we can see that frequent reviewers give less 5-star reviews and less 1-star review. Frequent users appear to be more discerning in the sense that they give less extreme reviews than infrequent reviews.

In [0]:
ax = sns.countplot(x='Score', hue='reviewer_freq', data=temp_df, palette='RdBu')
ax.set_xlabel('Score (Rating)')
plt.show()

In [0]:
y = temp_df[temp_df.reviewer_freq=="Frequent (>50 reviews)"].Score.value_counts()
z = temp_df[temp_df.reviewer_freq=="Not Frequent (1-50)"].Score.value_counts()

tot_y = y.sum()

y = (y/tot_y)*100

tot_z = z.sum()

z = (z/tot_z)*100

ax1 = plt.subplot(121)
y.plot(kind="bar",ax=ax1)
plt.xlabel("Score")
plt.ylabel("Percentage")
plt.title("Frequent (>50 reviews) Distribution")

ax2 = plt.subplot(122)
z.plot(kind="bar",ax=ax2)
plt.xlabel("Score")
plt.ylabel("Percentage")
plt.title("Not Frequent (1-50) Distribution")
plt.show()

### Are frequent reviewers more helpful?

The distribution of helpfulness for frequent reviewers is similar to that of all reviews. However, frequent reviewers are more likely to have their review voted on and when voted on, more likely to be voted helpful, and less likely to be unhelpful.

In [0]:
sns.countplot(x='Usefulness', order=['useless', '>75%', '25-75%', '<25%'], \
              hue='reviewer_freq', data=temp_df, palette='RdBu')
plt.xlabel('Helpfulness')
plt.show()

### Are frequent reviewers more verbose?

The distributions of word counts for frequent and infrequent reviews shows that infrequent reviewers have a large amount of reviews of low word count. On the other hand, the largest concentration of word count is higher for frequent reviewers than for infrequent reviews. Moreover, the median word count for frequent reviewers is higher than the median for infrequent reviewers.

In [0]:
sns.violinplot(x='reviewer_freq', y='text_word_count',  \
               data=temp_df, palette='RdBu')
plt.xlabel('Frequency of Reviewer')
plt.ylim(-50, 400)
plt.show()

## Conclusion

<pre>
a. Positive reviews are very common.
b. Positive reviews are shorter.
c. Longer reviews are more helpful.
d. Despite being more common and shorter, positive reviews are found more helpful.
e. Frequent reviewers are more discerning in their ratings, write longer reviews, and write more helpful reviews
</pre>