# Sentiment prediction from Amazon reviews

## About DataSet

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Contents

database.sqlite: Contains the table 'Reviews'

Data includes:

Reviews from Oct 1999 - Oct 2012 568,454 reviews 256,059 users 74,258 products 260 users with > 50 reviews

In [60]:
#Importing Libraries

import sqlite3
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
import gensim
from tqdm import tqdm


In [31]:
raw_data = pd.read_csv("C:\Anand\Projects_GWU\Sentiment_Analysis_amazon_product_reviews\data\Reviews.csv")

In [32]:
raw_data.head(10)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,5,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,5,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,5,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [33]:
print(raw_data["Text"].head(10))

0    I have bought several of the Vitality canned d...
1    Product arrived labeled as Jumbo Salted Peanut...
2    This is a confection that has been around a fe...
3    If you are looking for the secret ingredient i...
4    Great taffy at a great price.  There was a wid...
5    I got a wild hair for taffy and ordered this f...
6    This saltwater taffy had great flavors and was...
7    This taffy is so good.  It is very soft and ch...
8    Right now I'm mostly just sprouting this so my...
9    This is a very healthy dog food. Good for thei...
Name: Text, dtype: object


In [34]:
raw_data.shape

(568454, 10)

In [35]:
# Drop rows with rating/score as 3.
value_to_drop = 3

# Drop rows where 'Score' has value 3.
raw_data = raw_data[raw_data['Score'] != value_to_drop]

In [36]:
# After dropping row with score 3
print(raw_data.shape)

#Unique values in Score column must be 1/2/4/5.
print(raw_data.Score.unique())

(525814, 10)
[5 1 4 2]


In [37]:
# Giving 4&5 as Positive and 1&2 as Negative Rating 
def assign_values(value):
    if value < 3:
        return 'Negative'
    else:
        return 'Positive'

raw_data['Review'] = raw_data['Score'].apply(assign_values)

In [38]:
raw_data.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Review
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,Positive
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,Negative
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,Positive
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,Negative
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,Positive


In [39]:
# Checking for duplicate Reviews 
boolean = not raw_data["Text"].is_unique      
boolean = raw_data['Text'].duplicated().any()
print(boolean)

True


In [40]:
# Drop duplicated Reviews
raw_data = raw_data.drop_duplicates(subset='Text', keep='first')

# Check the shape
print(raw_data.shape)

(363836, 11)


In [41]:
# Check if HelpfulnessNumerator is less than HelpfulnessDenominator, If so then drop those rows
raw_data=raw_data[raw_data.HelpfulnessNumerator<=raw_data.HelpfulnessDenominator]
print(raw_data.shape)

(363834, 11)


The observations in the dataset dropped from 568454 to 363834 as there were a lot of Duplicate Reviews and Number of people who found review helpful cannot be greater than number of people who viewed the review. These rows were dropped

# Check proportions of categories in output label:
raw_data['Review'].value_counts()

In [42]:
# Function to clean texts by : Removing punctuations, HTML tags, Extra Whitespaces, Remove URLs.
def clean_text(text):
    unwanted_chars_patterns = [
        r'[!?,;:—".]',  # Remove punctuation
        r'<[^>]+>',  # Remove HTML tags
        r'http[s]?://\S+',  # Remove URLs
        r"^[A-Za-z]+$" # Non-Alpha Numeric
    ]
    
    for pattern in unwanted_chars_patterns:
        text = re.sub(pattern, '', text)
    
    return text

In [43]:
# Apply clean_text function to clean the text column.
raw_data['Clean_Text'] = raw_data['Text'].apply(lambda x: clean_text(x))

In [44]:
print(raw_data["Text"].head(10))

0    I have bought several of the Vitality canned d...
1    Product arrived labeled as Jumbo Salted Peanut...
2    This is a confection that has been around a fe...
3    If you are looking for the secret ingredient i...
4    Great taffy at a great price.  There was a wid...
5    I got a wild hair for taffy and ordered this f...
6    This saltwater taffy had great flavors and was...
7    This taffy is so good.  It is very soft and ch...
8    Right now I'm mostly just sprouting this so my...
9    This is a very healthy dog food. Good for thei...
Name: Text, dtype: object


In [45]:
# Preprocessing of Text by making Text Lowercase, Removing Stopwords
# Save the clean text into a new column keeping the original text as we need it for Bi-Grams/Tri-Grams.
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
nltk.download('punkt')

def preprocess_text(text):
    # Tokenizing the text and removing stopwords
    tokens = nltk.word_tokenize(text)
    # tokens = [word for word in tokens if word not in stop_words]
    tokens = [word for word in tokens if word not in stop_words and word.isalpha() and len(word) >= 3]
    # Applying Snowball stemming
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    return ' '.join(stemmed_tokens)

# Apply text preprocessing to the 'Text' column
raw_data['Clean_Text'] = raw_data['Clean_Text'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [46]:
# Comparing the Original Text and the processed text
print(raw_data["Text"][1])

print("\n After Processing of Text \n")

print(raw_data["Clean_Text"][1])

Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".

 After Processing of Text 

product arriv label jumbo salt peanutsth peanut actual small size unsalt not sure error vendor intend repres product jumbo


In [47]:
i = 0
list_of_words_in_sentance = []

for sent in raw_data['Clean_Text'].values:
    list_of_words_in_filtered_sentence = []
    sent = clean_text(sent)
    
    # Split the sentence into words
    words = sent.split()

    # Check if each word is alphanumeric (Just for a double check)
    for word in words:
        if word.isalnum():
            list_of_words_in_filtered_sentence.append(word.lower())
    
    list_of_words_in_sentance.append(list_of_words_in_filtered_sentence)


In [48]:
raw_data["Clean_Text"].iloc[0]

'bought sever vital can dog food product found good qualiti the product look like stew process meat smell better labrador finicki appreci product better'

In [49]:
print(raw_data['Clean_Text'])

0         bought sever vital can dog food product found ...
1         product arriv label jumbo salt peanutsth peanu...
2         this confect around centuri light pillowi citr...
3         look secret ingredi robitussin believ found go...
4         great taffi great price there wide assort yumm...
                                ...                        
568449    great sesam chickenthi good better restur eate...
568450    disappoint flavor the chocol note especi weak ...
568451    these star small give one train session tri tr...
568452    these best treat train reward dog good groom l...
568453    satisfi product advertis use cereal raw vinega...
Name: Clean_Text, Length: 363834, dtype: object


In [50]:
print(f"Sentence cleaned: {raw_data['Clean_Text'].values[0]}")
print(f"Words in cleaned sentence{list_of_words_in_sentance[0]}")

Sentence cleaned: bought sever vital can dog food product found good qualiti the product look like stew process meat smell better labrador finicki appreci product better
Words in cleaned sentence['bought', 'sever', 'vital', 'can', 'dog', 'food', 'product', 'found', 'good', 'qualiti', 'the', 'product', 'look', 'like', 'stew', 'process', 'meat', 'smell', 'better', 'labrador', 'finicki', 'appreci', 'product', 'better']


## Bag of Words 

In [58]:
Count_vectorizer = CountVectorizer()
bow_data = Count_vectorizer.fit_transform(raw_data["Clean_Text"].values)
print(f"Shape of dataset after converting into BOW is {bow_data.get_shape()}")

Shape of dataset after converting into BOW is (363834, 191293)


## Uni, Bi and Tri Grams

In [59]:
Count_vectorizer_uni_bi_tri_grams = CountVectorizer(ngram_range=(1,3) ) 
final_uni_bi_tri_gram_counts = Count_vectorizer_uni_bi_tri_grams.fit_transform(raw_data["Clean_Text"].values)
print("Shape of dataset after converting into uni, bi and tri-grams is ",final_uni_bi_tri_gram_counts.get_shape())

Shape of dataset after converting into uni, bi and tri-grams is  (363834, 13535921)


## Tf-Idf Vectorization

In [62]:
tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1,2))
tf_idf_vectorizer = tf_idf_vectorizer.fit_transform(raw_data['Clean_Text'].values)
print("Shape of dataset after converting into tf-idf is ",tf_idf_vectorizer.get_shape())

Shape of dataset after converting into tf-idf is  (363834, 3388056)


## word2vec Model
Making word2vec model using our data set and the same model will be used further.

In [51]:
# Training word2vec model on our own data.
w2v_model=gensim.models.Word2Vec(list_of_words_in_sentance,min_count=5, workers=4) 

In [52]:
# Saving the vocabolary of words in our trained word2vec model
w2v_vocab = list(w2v_model.wv.key_to_index)

In [53]:
# Get the top 10 words most similar words to "quality"
w2v_model.wv.most_similar('good')

[('great', 0.7645207047462463),
 ('decent', 0.7281476259231567),
 ('excel', 0.6507630944252014),
 ('fantast', 0.6283754110336304),
 ('awesom', 0.6276445388793945),
 ('nice', 0.609722375869751),
 ('bad', 0.5874612331390381),
 ('tasti', 0.5669059157371521),
 ('terrif', 0.5574554204940796),
 ('like', 0.5510575175285339)]

In [54]:
raw_data.shape[0]

363834

## Average word2vec

In [69]:
sent_vectors_avg_word2vec = []; # The avg-w2v for each sentence/review is stored in this list
vector_size = len(w2v_model.wv['good']) 

for sent in tqdm(list_of_words_in_sentance): # Iterating over each review/sentence
    sent_vec = np.zeros(vector_size) 
    cnt_words =0; 
    for word in sent: # Iterating over each word in a review/sentence
        if word in w2v_vocab:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_avg_word2vec.append(sent_vec)
print(len(sent_vectors_avg_word2vec))

100%|██████████| 363834/363834 [08:00<00:00, 757.60it/s] 

363834





## Tf-Idf Wword2vec 

In [70]:
tfidf_model = TfidfVectorizer()
tf_idf_matrix = tfidf_model.fit_transform(raw_data['Clean_Text'].values)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))

# TF-IDF weighted Word2Vec
tfidf_feat = tfidf_model.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_words_in_sentance): # for each review/sentence 
    sent_vec = np.zeros(vector_size) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_vocab:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

100%|██████████| 363834/363834 [08:58<00:00, 675.85it/s] 
