## Amazon Fine Food Reviews Analysis
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews 

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review
Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).


[Q] How to determine if a review is positive or negative?

[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

## Loading the data
The dataset is available in two forms

.csv file
SQLite Database
In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently. 


Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [4]:
#block for all Imports
import matplotlib.pyplot as plt
import sqlite3
import pandas as pd
import numpy as np
import seaborn as sns

In [5]:
#Making connection with Database
con = sqlite3.connect('./database.sqlite')

##fetching Data

filtered_data = pd.read_sql_query("SELECT * from reviews where Score != 3",con)

##Taking only 10000 documents
filtered_data=filtered_data.iloc[0:10000,:]

In [6]:
temp = filtered_data['Score']

In [7]:
##To arrange score as only positive and negative set of scores we set all the score below 3 as negative and above as positive

# One way to do that:
#filtered_data.loc[:,['Score']]>3 = 1
#filtered_data.loc[:,['Score']]<3 = 0

# Another way


temp = temp.map(lambda x:1 if x>3 else 0)
filtered_data['Score'] = temp
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


# Data De-duplication(pandas drop_duplicates)

In [8]:
#Lets find if there are any duplicate
###display = pd.read_sql_query(" Select * from reviews r1  where Score != 3 and r1.productid = productid",con)
#No duplicate in above query
#There are duplicates as there can not be multiple review from the user at same time and text being also same
display = pd.read_sql_query("""
Select userid,score,time,count(*) from reviews r1  
group by userid,score,time having count(*) > 1 """,con)
display

Unnamed: 0,UserId,Score,Time,count(*)
0,#oc-R115TNMSPFT9I7,2,1331510400,2
1,#oc-R11D9D7SHXIJB9,5,1342396800,3
2,#oc-R11DNU2NBKQ23Z,1,1348531200,2
3,#oc-R11O5J5ZVQE25C,5,1346889600,3
4,#oc-R12KPBODL2B5ZD,1,1348617600,2
...,...,...,...,...
70863,AZZJDUEFXYXBM,4,1284163200,4
70864,AZZNK89PXD006,5,1269648000,2
70865,AZZTH6DJ0KSIP,5,1304208000,2
70866,AZZU1VEO8KUXH,5,1317513600,3


In [9]:
display = pd.read_sql_query("""
Select * from reviews r1  
where userid='AZZJDUEFXYXBM' """,con)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,113682,B001O2HBIM,AZZJDUEFXYXBM,J. Lewis,0,0,4,1284163200,Recommend,My 6.5 month son enjoyed this flavor and it he...
1,386128,B000ER3EAM,AZZJDUEFXYXBM,J. Lewis,0,0,4,1284163200,Recommend,My 6.5 month son enjoyed this flavor and it he...
2,446101,B001BM6NIY,AZZJDUEFXYXBM,J. Lewis,0,0,4,1284163200,Recommend,My 6.5 month son enjoyed this flavor and it he...
3,506451,B000ER5D9W,AZZJDUEFXYXBM,J. Lewis,0,0,4,1284163200,Recommend,My 6.5 month son enjoyed this flavor and it he...


In [10]:
#sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [11]:
print(filtered_data.columns)
final = filtered_data.drop_duplicates({'UserId','ProfileName', 'Time','Text'},keep = 'first' , inplace = False)

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')


In [12]:
##Size of remaining data
final.shape[0]/filtered_data.shape[0]

final.shape

(9564, 10)

## Checking for Data inconsistencies

In [13]:
display = pd.read_sql_query("""
Select * from reviews r1  
where HelpfulnessNumerator> HelpfulnessDenominator """,con)

##There are data inconsistencies
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
1,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


In [14]:
##there are lot of to deal with it, one of the way is

final = final[final['HelpfulnessNumerator']<=final['HelpfulnessDenominator']]

In [15]:
##Size of remaining data
final.shape[0]/filtered_data.shape[0]

0.9564

In [16]:
##checking the priors
final['Score'].value_counts()

1    7976
0    1588
Name: Score, dtype: int64

# [3].  Text Preprocessing.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [17]:
##### regular Expression https://pymotw.com/2/re
from bs4 import BeautifulSoup
import re
##Remove all http links
def remhttp(text): 
    text = re.sub(r'http\S+','',text) ##\S is all the non-whitespace character + - one or more
    return text

##Remove all html tags
def remhtml(text):
    soup = BeautifulSoup(text, 'lxml')
    text = soup.get_text()
    return text

##Remove all punct. or special characters
def remchar(text):
    text = re.sub(r'[^A-Za-z0-9\s]+','',text)
    return text

##Remove all words less than 3 letters
def remles2letter(text):
    text = re.sub(r'\W*\b\w{1,3}\b','',text)
    return text
##convert to lower
def lower(text):
    text = text.lower()
    return text


def allconvert(text):
    return lower(remles2letter(remchar(remhtml(remhttp(text)))))
final['Score'][1]

0

In [18]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

#import nltk
#nltk.download('stopwords')                                     download stopwords

#stop = set(stopwords.words('english'))

##we can use set stopwords manually also.
stop= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])
sno = SnowballStemmer('english')
stop = set(stopwords.words('english'))
print(stop)

{'have', 'why', 'such', "shouldn't", 'doesn', 'shan', 'then', 'will', 'shouldn', 'i', 'because', 'with', 'so', 'ma', 'll', 'weren', 'me', 'd', "weren't", 'and', 'for', 'yourself', 'after', 'between', 'just', "that'll", 'isn', "won't", 'being', 'can', 'to', "you've", 'than', 'been', 'himself', 'very', 'they', 'those', 'my', 'up', 'of', 'themselves', 'won', 'same', 'their', 'has', 'itself', "she's", 'your', 'ourselves', 'any', "you're", 'over', 'not', "wasn't", 'own', 'against', 'during', 'do', 'where', 'or', 'he', "hadn't", 'we', 'wouldn', 'but', 'by', 'y', "hasn't", 'needn', 'nor', 'she', 'again', "you'd", 'be', 'if', 'about', 'o', 'too', 'few', "you'll", 'how', 'until', 'through', 'them', 'above', 'yourselves', 'each', "should've", 'haven', 'are', 'mustn', 'off', 'whom', 'don', "it's", 've', 'these', 'at', 'on', "mustn't", 'is', 'which', 'hadn', 'that', 'from', 'his', 'under', 'were', "mightn't", 'all', 'ours', 'who', 't', 'the', 'most', 'our', "doesn't", 'in', 'having', 'her', 'might

In [51]:
###CODE to do actual Pre-Processing for ['Text']:

i=-1
all_positive_words=[]
all_negative_words=[]
final_text=[]
for text in final['Text']:
    i = i+1
    filtered_words=[]
    text = allconvert(text)
    for w in text.split():
        if (w not in stop):
            s =(sno.stem(w)).encode('utf8')
            filtered_words.append(s)
            if (final['Score'].values)[i]==1:
                all_positive_words.append(s)
            if (final['Score'].values)[i]==0:
                all_negative_words.append(s)
        else:
            continue
    str =b' '.join(filtered_words)
    final_text.append(str)


                
    

TypeError: sequence item 0: expected a bytes-like object, str found

```python
###CODE to do actual Pre-Processing for ['Summary']:
i=-1
all_positive_words=[]
all_negative_words=[]
final_text=[]
for text in final['Summary']:
    i = i+1
    filtered_words=[]
    text = allconvert(text)
    for w in text.split():
        if (w not in stop):
            s =(sno.stem(w)).encode('utf8')
            filtered_words.append(s)
            if (final['Score'].values)[i]==1:
                all_positive_words.append(s)
            if (final['Score'].values)[i]==0:
                all_negative_words.append(s)
        else:
            continue
    str = b' '.join(filtered_words)
    final_text.append(str)
```

In [21]:
final['Cleanedtext'] = final_text
#final['Cleanedsummary'] = final_text

conn = sqlite3.connect('final2.sqlite')
c = conn.cursor()
conn.text_factory = str


##Using this attribute you can control what objects are returned for the TEXT data type.
##By default, this attribute is set to unicode and the sqlite3 module will return Unicode objects for TEXT.
##If you want to return bytestrings instead, you can set it to str.


final.to_sql('Reviews',conn,schema=None,if_exists='replace')

# Bag of Words 

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Cleanedtext'])

In [23]:
type(final_counts)

scipy.sparse.csr.csr_matrix

In [24]:
final_counts.shape

(9564, 19416)

# unigram, bigram,n - gram

Motivation:::
As we have all our positive and negative words separated we will try to find whether sequencing of words will make any sense or not.

Both negative and positive has 'like' and 'love', so use N-grams

In [36]:
import nltk

In [37]:
freq_dist_positive = nltk.FreqDist(all_positive_words)
freq_dist_negative = nltk.FreqDist(all_negative_words)
print('Most Common Positive Word:',freq_dist_positive.most_common(20))
print('*'*50)
print('Most Common Negative Word:',freq_dist_negative.most_common(20))

Most Common Positive Word: [(b'like', 3487), (b'tast', 3136), (b'flavor', 2905), (b'good', 2875), (b'love', 2665), (b'great', 2640), (b'coffe', 2335), (b'product', 2104), (b'make', 1860), (b'food', 1657), (b'would', 1361), (b'realli', 1340), (b'tri', 1311), (b'time', 1302), (b'best', 1276), (b'use', 1210), (b'price', 1209), (b'find', 1195), (b'much', 1184), (b'order', 1174)]
**************************************************
Most Common Negative Word: [(b'tast', 906), (b'like', 863), (b'product', 748), (b'flavor', 555), (b'would', 506), (b'good', 406), (b'coffe', 402), (b'food', 371), (b'order', 346), (b'tri', 338), (b'even', 325), (b'dont', 319), (b'time', 315), (b'make', 296), (b'much', 286), (b'drink', 261), (b'realli', 260), (b'review', 258), (b'water', 250), (b'use', 242)]


In [38]:
#count_vect_1 = CountVectorizer(ngram_range=(1,3)) ## It will give unigram,bigram and trigram all of them

##Stopwords like 'not' should not be removed when doing N-grams

#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect_1 = CountVectorizer(ngram_range=(1,2))
final_counts_1 = count_vect_1.fit_transform(final['Cleanedtext'])

In [39]:
final_counts_1.shape

(9564, 213094)

# TF- IDF


In [None]:
#from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(final['Cleanedtext'].values)

In [None]:
final_tf_idf.shape

In [None]:
features = tf_idf_vect.get_feature_names()
features[10000:10010]

In [None]:
#source: https://buhrmann.github.io/tfidf-analysis.html
##Fucntion to fetch top TF_IDF features

def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf_feats(final_tf_idf[2,:].toarray()[0],features,25)   ##to array is used because we can not print sparse matrix 
                                                              ##and it converts sparse matrix to dense matrix


# Word 2 Vec

In [71]:
i=0
list_of_sentance=[]
for sentance in final['Text']:
    list_of_sentance.append(sentance.split())

####If you use stemming and use encoding('utf-8') you need to decode it back for word2vec, 
##also stemming pre-processing not required for word2vec

In [72]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
import os

google_word2Vec = False
Train_Model = True

if google_word2Vec:
    if os.path.isfile('GoogleNews-vectors-negative300.bin'):
        w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
        print(w2v_model.wv('computer'))
        print(w2v_model.wv.most_similar('worst'))
    else:
        print('File Does not exist')
elif Train_Model:
    w2v_model = Word2Vec(list_of_sentance,min_count=5,size=50, workers=4)
    print(w2v_model.wv.most_similar('great'))
    print('='*50)
    print(w2v_model.wv.most_similar('worst'))
        


[('good', 0.8708452582359314), ('wonderful', 0.8256304264068604), ('delicious', 0.792957067489624), ('perfect', 0.7846847772598267), ('nice', 0.7772462368011475), ('Great', 0.7187432646751404), ('healthy', 0.7084371447563171), ('fantastic', 0.6984565258026123), ('you.', 0.6946903467178345), ('quick', 0.6784365177154541)]
[('best.', 0.926323413848877), ('seen.', 0.9088826179504395), ('K-cup.', 0.8918473124504089), ('popcorn', 0.8828972578048706), ('gum', 0.8807246685028076), ('K-cup', 0.862686812877655), ('had!', 0.8614288568496704), ('cheapest', 0.8607341647148132), ('tried!', 0.8582534790039062), ('blending', 0.8580650091171265)]


# BOW