
## Amazon Food Reviews

Download link: https://snap.stanford.edu/data/web-FineFoods.html

This dataset consists of reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

**Dataset Statistics**
- Number of reviews: 568,454
- Number of users: 256,059
- Number of products: 74,258
- Users with > 50 reviews: 260
- Median no. of words per review: 56
- Timespan: Oct 1999-Oct 2012

In [1]:
import re
import os
import string
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from datetime import datetime
from sklearn.feature_extraction import stop_words

pd.set_option('display.max_colwidth', -1)
sns.set(style='whitegrid', color_codes=True)

### Part I: Dataframe and Dictionary Setup

In [2]:
food_reviews_filepath = os.path.expanduser('~/Downloads/Stears/finefoods.txt') 

with open(food_reviews_filepath,encoding="latin-1") as f:
    food_reviews_raw = f.read()

In [3]:
fields = ['product/productId', 'review/userId', 'review/profileName', 'review/helpfulness',
           'review/score', 'review/time', 'review/summary', 'review/text']

df = pd.DataFrame([{line.split(': ')[0]:''.join(line.split(': ')[1]) 
                    for line in review.split('\n') if line.split(': ')[0] in fields}
                    for review in list(filter(None, food_reviews_raw.split('\n\n')))]) # Remove empty string created by the split.

df.columns = ['product_id', 'helpfulness', 'profile_name', 'score', 'summary', 'text', 'time', 'user_id']
reordered_columns = ['product_id', 'user_id', 'profile_name', 'helpfulness', 'score', 'time', 'summary', 'text']
df = df[reordered_columns]

In [4]:
df['norm_text'] = df.text.map(lambda x: re.sub(r'<a href=\S+>', '', x))
df['norm_text'] = df.norm_text.map(lambda x: x.replace('</a>', '')) 
df['norm_text'] = df.norm_text.map(lambda x: x.replace('<br />', ' ')) 
df['norm_text'] = df.norm_text.map(lambda x: x.replace('&quot;', '')) 
df['norm_text'] = df.norm_text.map(lambda x: x.replace('&amp;', 'and'))
df['norm_text'] = df.norm_text.map(lambda x: x.lower()) 
df['norm_text'] = df.norm_text.map(lambda x: x.translate(str.maketrans('', '', string.punctuation))) 
df['norm_text'] = df.norm_text.map(lambda x: x.replace('  ', ' '))

In [5]:
df = df.drop_duplicates(subset=['text'], keep=False)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 334656 entries, 0 to 568453
Data columns (total 9 columns):
product_id      334656 non-null object
user_id         334656 non-null object
profile_name    334656 non-null object
helpfulness     334656 non-null object
score           334656 non-null object
time            334656 non-null object
summary         334656 non-null object
text            334656 non-null object
norm_text       334656 non-null object
dtypes: object(9)
memory usage: 25.5+ MB


**Reviews from Target Product Categories**

Load 2000 reviews of coffee, tea, chocolate, and pet food.

In [7]:
product_indexes_df = pd.read_csv('product indexes.csv', index_col=0)

In [8]:
coffee_df = df[df.index.isin(product_indexes_df.coffee)]
tea_df = df[df.index.isin(product_indexes_df.tea)]
chocolate_df = df[df.index.isin(product_indexes_df.chocolate)]
petfood_df = df[df.index.isin(product_indexes_df.petfood)]

In [9]:
print(len(coffee_df), len(tea_df), len(chocolate_df), len(petfood_df))

2000 2000 2000 2000


In [None]:
print('HOORAY! I AM ALL SET UP AND READY TO COMPLETE THIS ASSEMMENT BY WEDNESDAY AT 11.59PM GMT!')

#### Step 2

In [10]:
# Creating a dataframe of random 2,000 reviews 
random_df = df.sample(n=2000)


In [11]:
reviews = np.array(df['norm_text'])

In [12]:
# Creating a text file with all the reviews.
with open('fasttext-embedding-train.txt', 'w', encoding='utf-8') as target:
    for review in reviews:
        target.write(review.strip())

In [14]:
# Importing fasttext and training a model on the text file created above using the skipgram method

import fastText as ft
# Skipgram model :
model = ft.train_unsupervised('fasttext-embedding-train.txt', model='skipgram')

In [15]:
# Getting the wordvector for 'the' from the model built

model.get_word_vector("the")

array([-0.13178036,  0.15103085, -0.09445549,  0.08752053, -0.10314239,
       -0.03947441, -0.17363147,  0.03591475,  0.12136652, -0.11854628,
        0.0938407 ,  0.0889246 , -0.05785259,  0.03846101, -0.31799814,
        0.0712427 ,  0.07776776, -0.12263791, -0.00141949, -0.20832673,
        0.10824981,  0.28727096, -0.2985838 , -0.23388836, -0.04605399,
       -0.11447003,  0.24057436, -0.4701844 ,  0.0008751 , -0.0604997 ,
       -0.11415392,  0.13534136,  0.21330446, -0.07566531, -0.15537713,
       -0.24764854,  0.08898085, -0.07422888, -0.02914689,  0.4069497 ,
       -0.05612462,  0.09801798, -0.09631327,  0.08725455, -0.07992944,
        0.17693417,  0.06491788, -0.02639729,  0.06743106, -0.05741929,
       -0.17996565,  0.01780565,  0.0575686 ,  0.2427695 ,  0.02689069,
       -0.01879577, -0.1529363 , -0.01951067,  0.08162269,  0.08273382,
        0.07461046,  0.2154608 ,  0.16531011,  0.21018226, -0.00971015,
       -0.1895928 ,  0.05623697,  0.1672068 , -0.26963466, -0.04

In [16]:
# Converting the dataframe for each of the product reviews to list


full_reviews = df['norm_text'].tolist()
coffee_reviews = coffee_df['norm_text'].tolist()
tea_reviews = tea_df['norm_text'].tolist()
chocolate_reviews = chocolate_df['norm_text'].tolist()
petfood_reviews = petfood_df['norm_text'].tolist()
random_reviews = random_df['norm_text'].tolist()


In [22]:
# Setting stop words to a variable and define function to remove stopwords from text passed to it

stop_list = set(stop_words.ENGLISH_STOP_WORDS)

def remove_stop_word(text):
    return [word for word in text.split() if word not in stop_list]
    

In [23]:
# Remove stop words for each of the product review dataframes

full_reviews = [remove_stop_word(x) for x in full_reviews]
coffee_reviews = [remove_stop_word(x) for x in coffee_reviews]
tea_reviews = [remove_stop_word(x) for x in tea_reviews]
chocolate_reviews = [remove_stop_word(x) for x in chocolate_reviews]
petfood_reviews = [remove_stop_word(x) for x in petfood_reviews]
random_reviews = [remove_stop_word(x) for x in random_reviews]

### Step 3

In [34]:
# Importing CountVectorizer from sklearn and defining a function to create a dictionary for list passed to it 

from sklearn.feature_extraction.text import CountVectorizer


def dictionary_creator(review):
    vectorizer = CountVectorizer()
    vectorizer.fit_transform([x for y in review for x in y])
    return vectorizer.vocabulary_

In [70]:
# Creating a list of all reviews list and passing them through a loop to create dictionaries for each product review

reviews_list = [full_reviews,coffee_reviews,tea_reviews,chocolate_reviews,petfood_reviews,random_reviews]

full_review_dict = dict()
coffee_review_dict = dict()
tea_review_dict = dict()
chocolate_review_dict = dict()
petfood_review_dict = dict()
random_review_dict = dict()

dict_list = [full_review_dict,coffee_review_dict,tea_review_dict,chocolate_review_dict,petfood_review_dict,random_review_dict]


for review,dic in zip(reviews_list,dict_list):
    
    dictionary = dictionary_creator(review)
    dic.update(dictionary)

In [120]:
max(petfood_review_dict.values())

10775

## Part II

In [77]:
# Define a function to take in 2 sets and calculate the jaccard similarity

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

In [88]:
print ('Jaccard similarity of coffee to tea {}'.format(jaccard_similarity(coffee_review_dict,tea_review_dict)))
print ('Jaccard similarity of coffee to chocolate {}'.format(jaccard_similarity(coffee_review_dict,chocolate_review_dict)))
print('Jaccard similarity of coffee to petfood {}'.format(jaccard_similarity(coffee_review_dict,petfood_review_dict)))
print('Jaccard similarity of coffee to random {}'.format(jaccard_similarity(coffee_review_dict,random_review_dict)))

Jaccard similarity of coffee to tea 0.3152220784821044
Jaccard similarity of coffee to chocolate 0.3079188757047884
Jaccard similarity of coffee to petfood 0.2528184892897407
Jaccard similarity of coffee to random 0.3081225872528165


In [89]:
print('Jaccard similarity of coffee to coffee {}'.format(jaccard_similarity(coffee_review_dict,coffee_review_dict)))

Jaccard similarity of coffee to coffee 1.0


###  Rankings
 Rankings in descending order:


|   Review  | Jaccard Similarity |
|:---------:|:------------------:|
|   Coffee  |         1.0        |
|    Tea    |        0.315       |
|   Random  |        0.308       |
| Chocolate |        0.307       |
|  Petfood  |        0.252       |

 - The most similar product is Tea
 - The least similar product is Petfood

# Fasttext Similarity

#### Step I

In [102]:
# Define function to get vectors of each dataframe passed to it


def get_vector(review_df,name):
    
    # Convert each review to an array
    reviews = np.array(review_df['norm_text'])
    
    # Write the text to a file with the name input
    with open('{}_fasttext-embedding-train.txt'.format(name), 'w', encoding='utf-8') as target:
        for review in reviews:
            target.write(review.strip())
    
    # Open the file and read all text and assign to variable
    with open('{}_fasttext-embedding-train.txt'.format(name)) as f:
        text = f.read()
        
    # get the word vector for the text 
    return model.get_word_vector(text)

In [105]:
# Get the word vector for all reviews 

coffee_vector = get_vector(coffee_df,'coffee')
chocolate_vector = get_vector(chocolate_df,'chocolate')
tea_vector = get_vector(tea_df,'tea')
petfood_vector = get_vector(petfood_df,'petfood')

In [108]:
random_vector= get_vector(random_df,'random')

In [106]:
# Calculate the cosine similarity after importing scipy 

from scipy import spatial
def fasttext_similarity(vector1,vector2):
    
    # spatial.distance.cosine computes the distance, and not the similarity. 
    #So, you must subtract the value from 1 to get the similarity
    
    return 1 - spatial.distance.cosine(vector1, vector2)

In [110]:
print ('Cosine similarity of coffee to coffee {}'.format(fasttext_similarity(coffee_vector,coffee_vector)))

Cosine similarity of coffee to coffee 1.0


In [109]:
print ('Cosine similarity of coffee to tea {}'.format(fasttext_similarity(coffee_vector,tea_vector)))
print ('Cosine similarity of coffee to chocolate {}'.format(fasttext_similarity(coffee_vector,chocolate_vector)))
print('Cosine similarity of coffee to petfood {}'.format(fasttext_similarity(coffee_vector,petfood_vector)))
print('Cosine similarity of coffee to random {}'.format(fasttext_similarity(coffee_vector,random_vector)))

Cosine similarity of coffee to tea 0.9665085077285767
Cosine similarity of coffee to chocolate 0.9321482181549072
Cosine similarity of coffee to petfood 0.8971323370933533
Cosine similarity of coffee to random 0.9554436802864075


The scores for the jaccard similarity are on a lower scale while the scores for the cosine similarity are higher. As shown in the table below, Cosine similarirty captures the similarity between coffee and tea better.

| Review    | Cosine Similarity(with Coffee) | Jaccard Similarity(with coffee) |
|-----------|--------------------------------|---------------------------------|
| Coffee    | 1.0                            | 1.0                             |
| Chocolate | 0.932                          | 0.307                           |
| Tea       | 0.966                          | 0.315                           |
| Random    | 0.955                          | 0.308                           |
| Petfood   | 0.897                          | 0.252                           |

The random reviews has a high similarity in both jaccard and cosine, probably because it contains coffee, chocolate and tea reviews.

#### Question 3

This assumption depends on the content of the 400 word review. If it is more detailed than the 100 word review then it could have more information.

Some extra normalisation could be added to the reviews such as expanding abbreviations, text canonicalization.

From the frequency of words in the dictionaries they are all equally indicative.

### Bonus

In [128]:
df['time'] = df['time'].astype(int)

In [127]:
bonus_df = df[df['time'].between(939340800,1351209600)]
len(bonus_df)

334656