## Amazon Fine Food Reviews

Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews
* All data in one sqlite database. 568,454 food reviews Amazon users left up to October 2012
* Total Columns:10
* Columns List:Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenomenator,Score or Rating,Time,Summary,Text.
* We are Droping Id column and changing our Score variable to Response

In [1]:
#Importing all necessary Libraries.
import pandas as pd
import seaborn as sns
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import sqlite3
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve,auc
from nltk.stem.porter import PorterStemmer  #Natural languge Tool Kit(nltk).

#Reading data from SQLite Table
con = sqlite3.connect("database.sqlite")

#Filtering Positive and Negative Reviews only Reviews having the score is equal to 3 are not considered.
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)

#Assigning Positive rating to reviews having the score >3 otherwise negative rating
def partition(x):
    if x<3:
        return 'negative'
    else:
        return 'positive'

#Change Reviews to Positive and Negative
actual_score = filtered_data['Score']
positiveNegative = actual_score.map(partition)
filtered_data['Score'] = positiveNegative


In [2]:
#Getting Shape of the Data Set atlast and Preview Data Set i.e some rows with all features or Variables
print(filtered_data.shape)
filtered_data.head(5)

(525814, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [3]:
#Sorting the data according to ProductId in ascending order on filtered data
sorted_data = filtered_data.sort_values('Time', axis=0, ascending=True)
sorted_data.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138683,150501,0006641040,AJ46FKXOVC7NR,Nicholas A Mesiano,2,2,positive,940809600,This whole series is great way to spend time w...,I can remember seeing the show when it aired o...
417839,451856,B00004CXX9,AIUWLEQ1ADEG5,Elizabeth Medina,0,0,positive,944092800,Entertainingl Funny!,Beetlejuice is a well written movie ..... ever...


### Data Cleaning: Deduplication 

In [4]:
final = sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep='first',inplace=False)
print(final.shape)

(364173, 10)


In [5]:
#Finding How much percentage of the DATA Remaining.
print((final['Id'].size)/(filtered_data['Id'].size))
#69.25% of the data available.

0.6925890143662968


In [6]:
#Removing the data points that are having helpfulnessNum >= HelpfulnessDen (always den greater)
final = final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
#final["HelpfulnessNumerator"]<=final["HelpfulnessDenominator"] results the same as above

In [7]:
print(final.shape) #NOW Calculating the Rows and Colums available after removing above case

(364171, 10)


In [8]:
final['Score'].value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

In [9]:
sample_final = final.sample(10000) #Sampling Randomly 10k points from 364k
sample_final.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
442836,478823,B001QZYFOU,AEPG7I28BZKZB,Silcat3,0,0,positive,1346803200,"Yum, but mash it","Our three love the meat, and lick the plate if..."
226042,245088,B000GFYRHG,A2K3J2X8KDY47N,"Jewelry Lover ""me""",0,0,positive,1306972800,My all time favorite tea....,I am now receiving a case of this tea every si...
232542,252248,B0046GSTUM,A2RUR1SMPNGKXJ,Katie,2,2,positive,1345507200,Excellent very delicious,I am on a strict diet and this is a very delic...
516810,558761,B000BVY02M,AOEDWQLH2WKKW,"E. J Tastad ""ejt""",21,21,positive,1169424000,"Hot sauce concentrate, use at your own risk",This is like hot sauce concentrate. You MUST ...
141203,153228,B0038YJ4MU,A1AZ21Z4JQEQZU,JRTN,0,0,positive,1340150400,Something that works!,"As a chronic insomniac, I have tried most prod..."


In [10]:
sample_final = sample_final.sort_values('Time', axis=0, ascending=True) #Sorting Based upon Time stamp.
sample_final.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138683,150501,0006641040,AJ46FKXOVC7NR,Nicholas A Mesiano,2,2,positive,940809600,This whole series is great way to spend time w...,I can remember seeing the show when it aired o...
179643,194858,B0000E65WB,A2VZ11U5DXM8J5,"C. Ebeling ""ctlpareader""",1,1,positive,1068336000,Stock Up On This Item,I usually purchase this item in smaller links ...
390522,422248,B0000D9N9A,A3LFT71N1YOQXN,Bell Mays,14,16,positive,1068422400,Hot Sizzling bubbly Raclette! ! Bubbly Bubbly...,Put in under a Reclette grill or just put it i...
472979,511508,B0000D94P1,A2801SG8XA9LNX,PACW,14,15,positive,1069113600,Tastes great for what it is,I have relied on these cake mixes for a few ye...
370384,400533,B0000V8HTU,A6M8KOVEPQ0BO,"Cyn ""cynnergy""",2,3,positive,1073865600,This is the best coffee!,Hawaii Roasters is definitely the best coffee ...


In [11]:
#How many Positives and Negatives that are present in Scores Columns.
sample_final['Score'].value_counts()

positive    8478
negative    1522
Name: Score, dtype: int64

### Text Preprocessing

Text Preprocessing---> Removing Stop Words,Upper Case to Lower Case Conversion,Stemming,Lemmatizatio

In [12]:
final_text = sample_final['Text']
final_text = pd.DataFrame(final_text)
final_text.head()

Unnamed: 0,Text
138683,I can remember seeing the show when it aired o...
179643,I usually purchase this item in smaller links ...
390522,Put in under a Reclette grill or just put it i...
472979,I have relied on these cake mixes for a few ye...
370384,Hawaii Roasters is definitely the best coffee ...


In [13]:
import re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer

snowball_stemmer = SnowballStemmer("english")
wordnet_lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))


def nlp_preprocessing(total_text, index, column):
    if type(total_text) is not int:
        string_1 = ""
        #Replace special chars with space
        total_text = re.sub('[^a-zA-Z0-9\n]',' ',total_text)
        #Replace Multiple spaces with single space.
        total_text = re.sub('\s+',' ',total_text)
        #Convert all upper case words to lower case.
        total_text = total_text.lower()
        #Stop word Removal.
        for word_1 in total_text.split():
            #If the word is not stop word then retain it, otherwise remove it.
            if not word_1 in stop_words:
                string_1 += word_1 + " "
        total_text = string_1
        #Stemming using Snowball Stemmer.
        string_2 = ""
        for word_2 in total_text.split():
            string_2 += snowball_stemmer.stem(word_2) + " "
        total_text = string_2
        #Lemmatizer
        string_3 = ""
        for word_3 in total_text.split():
            string_3 += wordnet_lemmatizer.lemmatize(word_3) + " "
        
        final_text[column][index] = string_3

In [14]:
#Text Preprocessing stage.
import time
start_time = time.clock()
for index,row in final_text.iterrows():
    nlp_preprocessing(row['Text'], index, 'Text')
print('Time took for preprocessing the text :', time.clock() - start_time, "seconds")

Time took for preprocessing the text : 17.9822186546936 seconds


#### Dividing Data into Train and Test:

In [16]:
from sklearn.model_selection import train_test_split
y_true = sample_final['Score'].values
#Split the data as train and test 
X_train, X_test, y_train, y_test = train_test_split(final_text['Text'], y_true, test_size=0.3, shuffle=False)
#Split the X_1 and y_1 into train and Cross validate
#X_train, X_cv, y_train, y_cv = train_test_split(X_1, y_1, test_size=0.2, shuffle=False)
print(X_train.head())

138683    rememb see show air televis year ago child sis...
179643    usual purchas item smaller link 9 pound stash ...
390522    put reclett grill put oven 5 10 minut serv boi...
472979    reli cake mix year find handi tast fine sugar ...
370384    hawaii roaster definit best coffe glad get ama...
Name: Text, dtype: object


### Bag of Words(Bow)

In [22]:
#Converting Text (or) Paragraphs to Vectors
#Why vectors?
#I can do all the mathematical operations vectors using Linear Algebra.
count_vect = CountVectorizer() #from scikit Learn
train_bow = count_vect.fit_transform(X_train.values)

In [23]:
test_bow = count_vect.transform(X_test)

In [24]:
type(train_bow)

scipy.sparse.csr.csr_matrix

In [25]:
print(train_bow.get_shape())
print(test_bow.get_shape())

(7000, 12057)
(3000, 12057)


In [26]:
import pickle
with open('train_bow.pickle', 'wb') as handle:
    pickle.dump(train_bow, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('test_bow.pickle', 'wb') as handle:
    pickle.dump(test_bow, handle, protocol=pickle.HIGHEST_PROTOCOL)


### TF-IDF

In [29]:
import warnings
warnings.filterwarnings("ignore")
tfidf_vect = TfidfVectorizer()
train_tfidf = tfidf_vect.fit_transform(X_train.values)

In [30]:
test_tfidf = tfidf_vect.transform(X_test)

In [31]:
train_tfidf.get_shape()

(7000, 12057)

In [32]:
features = tfidf_vect.get_feature_names()
len(features)

12057

In [33]:
type(train_tfidf)

scipy.sparse.csr.csr_matrix

In [34]:
features[109:120] #Printing 10 Features.

['162',
 '1670',
 '1696',
 '16oz',
 '16th',
 '17',
 '170',
 '1708',
 '170mg',
 '175',
 '17lbs']

In [35]:
#Convert a row in Sparse Matrix to numpy array
print(train_tfidf[3,:].toarray()[0])

[0. 0. 0. ... 0. 0. 0.]


In [36]:
#Creating a Pickle file.
with open('train_tfidf.pickle', 'wb') as handle:
    pickle.dump(train_tfidf, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('test_tfidf.pickle', 'wb') as handle:
    pickle.dump(test_tfidf, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [37]:
with open('y_train.pickle', 'wb') as handle:
    pickle.dump(y_train, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('y_test.pickle', 'wb') as handle:
    pickle.dump(y_test, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Word2Vec 

#### Train Test split for Word2Vec

In [42]:
from sklearn.model_selection import train_test_split
y_true = sample_final['Score'].values
#Split the data as train and test 
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(sample_final['Text'], y_true, test_size=0.3, shuffle=False)
#Split the X_1 and y_1 into train and Cross validate
#X_train_w, X_cv_w, y_train_w, y_cv_w = train_test_split(X_1, y_1, test_size=0.2, shuffle=False)
print(X_train_w.head())

138683    I can remember seeing the show when it aired o...
179643    I usually purchase this item in smaller links ...
390522    Put in under a Reclette grill or just put it i...
472979    I have relied on these cake mixes for a few ye...
370384    Hawaii Roasters is definitely the best coffee ...
Name: Text, dtype: object


In [43]:
import re
def cleanhtml(sentence): #function to clean the word of any html tags.
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr,' ',sentence)
    return cleantext
def cleanpunc(sentence): #Function to clean words of punctuation.
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|(|)|\|/]',r'',cleaned)
    return cleaned

### Avg Word2Vec

In [44]:
#from gensim.models import Word2Vec.
#Train your own Word2Vec Model using our own text Corpus.

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim

def avg_word2vec(final_text):
    i=0
    list_of_sent=[]
    for sent in final_text.values:
        filtered_sentence=[]
        sent=cleanhtml(sent)
        for w in sent.split():
            for cleaned_words in cleanpunc(w).split():
                if (cleaned_words.isalpha()):
                    filtered_sentence.append(cleaned_words.lower())
                else:
                    continue
        list_of_sent.append(filtered_sentence)

    #Word2Vec Model
    w2v_model = gensim.models.Word2Vec(list_of_sent,min_count=5,size=300,workers=4)

    #Avg word2vec is done only for 10k points
    #Average Word2Vec
    #Compute average w2v for each review
    sent_vectors = []; 
    for sent in list_of_sent:
        sent_vec = np.zeros(300)
        cnt_words = 0;
        for word in sent:
            try:
                vec = w2v_model.wv[word]
                sent_vec += vec
                cnt_words += 1
            except:
                pass
        sent_vec /= cnt_words
        sent_vectors.append(sent_vec)
    print(len(sent_vectors))
    print(len(sent_vectors[0]))
    return sent_vectors

In [45]:
train_avg_word2vec = avg_word2vec(X_train_w)

7000
300


In [46]:
train_avg_word2vec[2]

array([-9.13575224e-02, -4.47940966e-02,  6.17986763e-02, -4.98300192e-02,
       -2.34878440e-01,  2.13162693e-02, -1.87575054e-01,  2.20469415e-02,
        2.20825537e-02, -1.61932165e-01, -1.27512975e-04,  9.88196107e-02,
        1.12091546e-01, -4.29454954e-02, -5.12704245e-02, -1.18187675e-01,
        1.23256331e-02, -1.70489881e-01,  7.29508118e-02, -8.55551501e-02,
       -3.40148298e-02,  1.30341168e-01, -4.36049535e-02,  6.85732573e-02,
       -1.40729885e-01, -1.33843139e-01,  1.69613931e-02, -1.89379374e-01,
        1.21075724e-01,  3.30583367e-02, -5.75963351e-02,  1.37715322e-02,
        3.02028746e-03, -8.04775749e-02, -6.70864611e-02, -1.06477294e-01,
       -2.11554136e-02, -2.08362241e-01, -4.02353493e-03, -3.81884192e-02,
        2.15567998e-01,  5.15320052e-02, -6.99333848e-03, -1.07191285e-01,
        1.67498763e-01, -8.68687229e-02, -1.33854655e-01,  1.33357496e-01,
        6.26992199e-02, -1.73796219e-02,  1.81256981e-02, -3.76669619e-03,
       -1.34030752e-01,  

In [47]:
test_avg_word2vec = avg_word2vec(X_test_w)

3000
300


In [48]:
#Creating a Pickle file.
with open('train_avg_word2vec.pickle', 'wb') as handle:
    pickle.dump(train_avg_word2vec, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('test_avg_word2vec.pickle', 'wb') as handle:
    pickle.dump(test_avg_word2vec, handle, protocol=pickle.HIGHEST_PROTOCOL)

### TFIDF Weighted Word2Vec

In [49]:
def tfidf_word2vec(final_text):
    i=0
    list_of_sent=[]
    for sent in final_text.values:
        filtered_sentence=[]
        sent=cleanhtml(sent)
        for w in sent.split():
            for cleaned_words in cleanpunc(w).split():
                if (cleaned_words.isalpha()):
                    filtered_sentence.append(cleaned_words.lower())
                else:
                    continue
        list_of_sent.append(filtered_sentence)

    #Word2Vec Model
    w2v_model = gensim.models.Word2Vec(list_of_sent,min_count=5,size=300,workers=4)

    #Tfidf weigted word2vec
    tfidf_feat = tfidf_vect.get_feature_names() #tfidf words/col-names
    tfidf_sent_vectors = [];
    row = 0;
    for sent in list_of_sent:
        sent_vec = np.zeros(300)
        weight_sum = 0;
        for word in sent:
            try:
                vec = w2v_model.wv[word]
                tf_idf = final_tf_idf[row,tfidf_feat.index(word)]
                sent_vec += (vec * tf_idf)
                weight_sum += tf_idf
            except:
                pass
        sent_vec /= weight_sum
        tfidf_sent_vectors.append(sent_vec)
        row += 1

    print(tfidf_sent_vectors[2])
    return tfidf_sent_vectors

In [50]:
train_tfidf_word2vec = avg_word2vec(X_train_w)

7000
300


In [51]:
train_tfidf_word2vec[3]

array([-0.13468803,  0.04505837,  0.05099325,  0.03937044, -0.42393803,
       -0.02780443, -0.26548989, -0.00140673, -0.11699958, -0.19363701,
       -0.01939635,  0.15379193,  0.05386574,  0.1139471 , -0.05895299,
       -0.10969904,  0.15326163, -0.07685471,  0.0461257 ,  0.00404729,
       -0.00212115,  0.23534476,  0.13289246,  0.02117112, -0.09087324,
       -0.12764124,  0.03217189, -0.26916636,  0.04226346, -0.0551585 ,
       -0.06161767,  0.04566315, -0.04511133,  0.02185931, -0.10457619,
       -0.11435286, -0.04377859, -0.1411576 ,  0.1483881 ,  0.16589516,
        0.08295923, -0.0249235 ,  0.02431173, -0.0440818 ,  0.16415046,
        0.00066396, -0.02563621,  0.11759276,  0.1663628 , -0.01327317,
       -0.09856221,  0.08839298, -0.24811286,  0.25382858, -0.04463097,
        0.16833731, -0.14997524,  0.05852601,  0.15804391, -0.03989062,
        0.13692023,  0.06251453,  0.08666201,  0.1835815 ,  0.09227506,
       -0.07693344, -0.12836327, -0.08873512,  0.20354578,  0.27

In [52]:
test_tfidf_word2vec = avg_word2vec(X_test_w)

3000
300


In [53]:
#Creating a Pickle file.
with open('train_tfidf_word2vec.pickle', 'wb') as handle:
    pickle.dump(train_tfidf_word2vec, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('test_tfidf_word2vec.pickle', 'wb') as handle:
    pickle.dump(test_tfidf_word2vec, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [54]:
with open('y_train_w.pickle', 'wb') as handle:
    pickle.dump(y_train_w, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('y_test_w.pickle', 'wb') as handle:
    pickle.dump(y_test_w, handle, protocol=pickle.HIGHEST_PROTOCOL)