In [1]:
import pandas as pd
import numpy as np
import datetime as dt

# set options
pd.options.display.max_colwidth = 50
pd.set_option('display.max_colwidth', -1) 
pd.options.mode.chained_assignment = None  # default='warn'

## Loading data

In [2]:
food_reviews = pd.read_csv('food_reviews.csv', encoding="ISO-8859-1")

In [25]:
food_reviews.tail()

Unnamed: 0,product/productId,review/userId,review/profileName,review/helpfulness,review/score,review/time,review/summary,review/text
568457,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0/0,2.0,1331251000.0,disappointed,"I'm disappointed with the flavor. The chocolate notes are especially weak. Milk thickens it but the flavor still disappoints. This was worth a try but I'll never buy again. I will use what's left, which will be gone in no time thanks to the small cans."
568458,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2/2,5.0,1329782000.0,Perfect for our maltipoo,"These stars are small, so you can give 10-15 of those in one training session. I tried to train our dog with ""Ceaser dog treats"", it just made our puppy hyper. If you compare the ingredients, you will know why. Little stars has just basic food ingredients without any preservatives and food coloring. Sweet potato flavor also did not make my hand smell like dog food."
568459,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1/1,5.0,1331597000.0,Favorite Training and reward treat,These are the BEST treats for training and rewarding your dog for being good while grooming. Lower in calories and loved by all the doggies. Sweet potatoes seem to be their favorite Wet Noses treat!
568460,B001LR2CU2,A3LGQPJCZVL9UC,srfell17,0/0,5.0,1338422000.0,Great Honey,"I am very satisfied ,product is as advertised, I use it on cereal, with raw vinegar, and as a general sweetner."
568461,,,,,,,,


#### Last row is NaN. Removing it

In [3]:
food_reviews = food_reviews.iloc[0:len(food_reviews) - 1,]

## Pre-processing data

* renaming columns
* checking datatype
* dealing with null values - impute for summary, removal of rest
* reindexing of data
* conversion of timestamp to date
* features - creation of helfulness_score, extended review
* removal of neutral reviews - with rating 3
* classificaton of positive and negative sentiment into 0 and 1

#### Renaming columns

In [4]:
# rename columns
#food_reviews.columns = [each.split("/")[1] for each in food_reviews.columns]
food_reviews.columns = ['productid', 'userid', 'username', 'helpfulness','rating','time','summary','text']
food_reviews.head()

Unnamed: 0,productid,userid,username,helpfulness,rating,time,summary,text
0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1/1,5.0,1303862000.0,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0/0,1.0,1346976000.0,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."
2,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1/1,4.0,1219018000.0,"""Delight"" says it all","This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch."
3,B000UA0QIQ,A395BORC6FGVXV,Karl,3/3,2.0,1307923000.0,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
4,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0/0,5.0,1350778000.0,Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."


#### Check datatypes

In [28]:
food_reviews.dtypes

productid      object 
userid         object 
username       object 
helpfulness    object 
rating         float64
time           float64
summary        object 
text           object 
dtype: object

#### Dealing with null values - imputation / dropping

In [29]:
# check all null records (any one column)
food_reviews[food_reviews.isnull().values]

Unnamed: 0,productid,userid,username,helpfulness,rating,time,summary,text
25509,B000LKZB4Y,A36BVYD0NT7Z0F,,0/0,5.0,1.314576e+09,These are the best mints and no aspartame or BHT,I was so shocked to find out that almost all gums have BHT. I went to the health food store and got gum with Xylitol but I didn't like the taste. B Fresh was the only one with Xylitol that didn't taste like aspartame. I saw Newmans Own Organic Mints and was happy to see no aspartame or BHT and they are really good. I hope the company starts making gum. The only problem is the mints are made in Mexico. What is the matter Americans can't make mints?
33958,B00412W76S,A3TJPSWY2HE4BS,"S. Layton ""homeschool blogger""",1/24,2.0,1.173312e+09,,"I only used two maybe three tea bags and got pregnant - can not drink during pregnancy. Not a bad taste, but I'm not a big tea fan either."
38874,B000AYDGZ2,A36BVYD0NT7Z0F,,2/3,1.0,1.278374e+09,doesn't anyone care that they are putting BHT into their bodies?,I called Kellogg's to see why Special K red berries has the natural preservative but Special K blueberry has the killer BHT. The women who answered wasn't helpful and didn't know why. She also seemed annoyed with my question. I won't even give my dog food with BHT why would I feed my family a cereal with BHT. Why don't these company use the more natural and safer preservative like tocoperol etc?
40548,B00020HHRW,A3TJPSWY2HE4BS,"S. Layton ""homeschool blogger""",1/24,2.0,1.173312e+09,,"I only used two maybe three tea bags and got pregnant - can not drink during pregnancy. Not a bad taste, but I'm not a big tea fan either."
49800,B000CRHQN0,A2LYFY32LXQDON,,0/0,2.0,1.282608e+09,They were melted and the chocolate had turned white,"We love these bars but i won't order them shipped from anywhere anymore. They came melted, white, and didn't taste as good as they do when they are fresh."
67077,B0006348H2,A2P0P67Y55SNOX,,1/1,5.0,1.314662e+09,Wheatgrass,"Kitty seems to like this sprinkled on her food...Glad I bought 2, because I forgot to water one for a few days... Brown grass doesn't work..."
94197,B002RIZUQ2,AS2DLXUWDK0GP,"MABEL ""Tell us about yourself!",,,,,
94197,B002RIZUQ2,AS2DLXUWDK0GP,"MABEL ""Tell us about yourself!",,,,,
94197,B002RIZUQ2,AS2DLXUWDK0GP,"MABEL ""Tell us about yourself!",,,,,
94197,B002RIZUQ2,AS2DLXUWDK0GP,"MABEL ""Tell us about yourself!",,,,,


In [5]:
# Dropping Summary and text columns have missing values. Dropping those rows which have both of them missing.
food_reviews = food_reviews[food_reviews['summary'].notnull() | food_reviews['text'].notnull()]
# Where NA in summary, replace with None
food_reviews['summary'] = food_reviews['summary'].replace(np.nan,'None', regex=True)

In [34]:
# check remaining null values
food_reviews.isna().sum()

productid      7 
userid         7 
username       23
helpfulness    0 
rating         0 
time           0 
summary        0 
text           0 
dtype: int64

In [35]:
# drop remaining null values
food_reviews.dropna(inplace = True)
food_reviews.shape

(568431, 8)

#### Reindexing of data

In [6]:
# since all null records are gone, reindex
food_reviews.index = range(food_reviews.shape[0])

#### Extracting time,month,year

In [7]:
food_reviews['time'] = pd.to_datetime(food_reviews['time'], unit='s')
food_reviews['year'] = food_reviews['time'].dt.year
food_reviews['month'] = food_reviews['time'].dt.month
food_reviews.head()

Unnamed: 0,productid,userid,username,helpfulness,rating,time,summary,text,year,month
0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1/1,5.0,2011-04-27,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.,2011,4
1,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0/0,1.0,2012-09-07,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo"".",2012,9
2,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1/1,4.0,2008-08-18,"""Delight"" says it all","This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.",2008,8
3,B000UA0QIQ,A395BORC6FGVXV,Karl,3/3,2.0,2011-06-13,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.,2011,6
4,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0/0,5.0,2012-10-21,Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal.",2012,10


#### Extracting helpfulness score

In [8]:
food_reviews.loc[:,'helpful_upvotes'] = food_reviews.helpfulness.apply(lambda x: float(x.split("/")[0]))
food_reviews.loc[:,'helpful_responses'] = food_reviews.helpfulness.apply(lambda x: float(x.split("/")[1]))

In [9]:
# drop the 2 rows where upvotes > responses since it doesn't make sense
food_reviews = food_reviews[~(food_reviews.helpful_upvotes > food_reviews.helpful_responses)]

In [10]:
food_reviews['helpfulness_score']  = food_reviews.helpful_upvotes / food_reviews.helpful_responses 

#### Replacing NaNs due to divide by 0 by -1

In [11]:
food_reviews.helpfulness_score[food_reviews['helpful_responses'] == 0] = -1
food_reviews.head()

Unnamed: 0,productid,userid,username,helpfulness,rating,time,summary,text,year,month,helpful_upvotes,helpful_responses,helpfulness_score
0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1/1,5.0,2011-04-27,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.,2011,4,1.0,1.0,1.0
1,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0/0,1.0,2012-09-07,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo"".",2012,9,0.0,0.0,-1.0
2,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1/1,4.0,2008-08-18,"""Delight"" says it all","This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.",2008,8,1.0,1.0,1.0
3,B000UA0QIQ,A395BORC6FGVXV,Karl,3/3,2.0,2011-06-13,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.,2011,6,3.0,3.0,1.0
4,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0/0,5.0,2012-10-21,Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal.",2012,10,0.0,0.0,-1.0


#### Create extended review text

In [12]:
# create extended text column
food_reviews['text_ext'] = food_reviews['summary'] + ' ' + food_reviews['text']

In [13]:
food_reviews_eda = food_reviews.copy() # store in another df for eda below which requires rating 3

#### Remove of neutral reviews

In [43]:
food_reviews = food_reviews[~(food_reviews['rating'] == 3.0)]
food_reviews['rating'].value_counts()

5.0    363106
4.0    80653 
1.0    52264 
2.0    29767 
Name: rating, dtype: int64

#### Classification of positive and negative reviews                        

Giving scores of 4 and above a 1 and a 0 to 2 and below

In [14]:
food_reviews['sentiment'] = food_reviews['rating'].apply(lambda x: 0 if x == 1 or x == 2 else 1)

In [15]:
food_reviews.columns

Index(['productid', 'userid', 'username', 'helpfulness', 'rating', 'time',
       'summary', 'text', 'year', 'month', 'helpful_upvotes',
       'helpful_responses', 'helpfulness_score', 'text_ext', 'sentiment'],
      dtype='object')

In [16]:
# reorder columns
food_reviews.drop(columns=['helpfulness'])
columnsTitles = ['productid', 'userid', 'username', 'time' ,'year','month','helpful_upvotes',
                 'helpful_responses','helpfulness_score','rating','sentiment','text_ext','summary','text']
food_reviews = food_reviews.reindex(columns=columnsTitles)

#### Final dataframe

In [47]:
food_reviews.head(5)

Unnamed: 0,productid,userid,username,time,year,month,helpful_upvotes,helpful_responses,helpfulness_score,rating,sentiment,text_ext,summary,text
0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,2011-04-27,2011,4,1.0,1.0,1.0,5.0,1,Good Quality Dog Food I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,B00813GRG4,A1D87F6ZCVE5NK,dll pa,2012-09-07,2012,9,0.0,0.0,-1.0,1.0,0,"Not as Advertised Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo"".",Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."
2,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",2008-08-18,2008,8,1.0,1.0,1.0,4.0,1,"""Delight"" says it all This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.","""Delight"" says it all","This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch."
3,B000UA0QIQ,A395BORC6FGVXV,Karl,2011-06-13,2011,6,3.0,3.0,1.0,2.0,0,Cough Medicine If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
4,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",2012-10-21,2012,10,0.0,0.0,-1.0,5.0,1,"Great taffy Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal.",Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."


In [257]:
food_reviews.columns

Index(['productid', 'userid', 'username', 'time', 'year', 'month',
       'helpful_upvotes', 'helpful_responses', 'helpfulness_score', 'rating',
       'sentiment', 'text_ext', 'summary', 'text', 'review_tokens'],
      dtype='object')

In [17]:
food_reviews.iloc[:,0:len(food_reviews.columns) - 1].to_csv('EDA.csv', sep=',')

## Text processing

#### Tokenize the words after converting the text to lower case

In [48]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import nltk
#nltk.download('punkt')



In [None]:
food_reviews['review_tokens'] = food_reviews.text.str.lower().apply(lambda x: word_tokenize(x))
#food_reviews['review_tokens'] = food_reviews.text_ext.str.lower().apply(lambda x: word_tokenize(x)) # use this...

#### Remove stop words, stemming, and punctuations. 

In [None]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
# nltk.download()

In [None]:
stemmer = PorterStemmer()
food_reviews['review_tokens'] = food_reviews['review_tokens'].apply(lambda x: [stemmer.stem(words) for words in x])

In [None]:
def remove_punctuations_and_stopwords(x):
    y = []
    stop_words = set(stopwords.words('english'))
    for i in range(len(x)):
        if x[i].isalnum() and x[i] not in stop_words:
            y.extend([x[i]])
    return y        

In [None]:
food_reviews['review_tokens'] = food_reviews['review_tokens'].apply(remove_punctuations_and_stopwords)

### Saving word2vec model. Use it load in the future.

In [None]:
model = Word2Vec(food_reviews.review_tokens, size=100, window=5, min_count=1, workers=4)
model.save("word2vec_binary.model")

In [None]:
food_reviews['review_tokens']

In [None]:
pd.DataFrame(model[food_reviews.review_tokens[0]])

## Model Building

In [None]:
def get_mean(x):
    # Gets the mean of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.mean()

def get_sum(x):
    # Gets the sum of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.sum()
    

In [None]:
food_reviews.index = range(food_reviews.shape[0]) 

In [None]:
df_model_avg = []
for i in range(0, len(food_reviews)):
    df_model_avg.append(get_mean(model[food_reviews.review_tokens[i]]))
    
df_model_avg = pd.DataFrame(df_model_avg)    

In [None]:
df_model_sum = []
for i in range(0, len(food_reviews)):
    df_model_sum.append(get_sum(model[food_reviews.review_tokens[i]]))
    
df_model_sum = pd.DataFrame(df_model_sum)    

### Performing logistic regression and cross- validation to get the sentiment

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import cross_validation

In [None]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_avg, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

Accuracy of this is 90.6%

In [None]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_sum, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

Accuracy of this is 90.18%. It decreased very slightly. Going with the mean model only

### Implementing the same on extended review

In [50]:
food_reviews['review_tokens'] = food_reviews['text_ext'].str.lower().apply(lambda x: word_tokenize(x))

#### Remove stop words, stemming, and punctuations. 

In [51]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
# nltk.download()

In [52]:
stemmer = PorterStemmer()
food_reviews['review_tokens'] = food_reviews['review_tokens'].apply(lambda x: [stemmer.stem(words) for words in x])

In [53]:
def remove_punctuations_and_stopwords(x):
    y = []
    stop_words = set(stopwords.words('english'))
    for i in range(len(x)):
        if x[i].isalnum() and x[i] not in stop_words:
            y.extend([x[i]])
    return y        

In [54]:
food_reviews['review_tokens'] = food_reviews['review_tokens'].apply(remove_punctuations_and_stopwords)

### Saving word2vec model. Use it load in the future.

In [55]:
model = Word2Vec(food_reviews.review_tokens, size=100, window=5, min_count=1, workers=4)
model.save("word2vec_binary_ext.model")

In [56]:
def get_mean(x):
    # Gets the mean of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.mean()

def get_sum(x):
    # Gets the sum of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.sum()
    

In [57]:
food_reviews.index = range(food_reviews.shape[0]) 

In [58]:
df_model_avg = []
for i in range(0, len(food_reviews)):
    df_model_avg.append(get_mean(model[food_reviews.review_tokens[i]]))
    
df_model_avg = pd.DataFrame(df_model_avg)    

  This is separate from the ipykernel package so we can avoid doing imports until


In [59]:
df_model_sum = []
for i in range(0, len(food_reviews)):
    df_model_sum.append(get_sum(model[food_reviews.review_tokens[i]]))
    
df_model_sum = pd.DataFrame(df_model_sum)    

  This is separate from the ipykernel package so we can avoid doing imports until


### Performing logistic regression and cross- validation to get the sentiment

In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import cross_validation



In [61]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_avg, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

0.9214153939785846


Accuracy of this is 92.14%

In [62]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_sum, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

0.919192072880808


Accuracy of this is 91.91%. It decreased very slightly. Going with the mean model.

#### Hence, feature engineering- Making a new column with summary and text increased the accuracy. This is the best model till now.

#### Adding helphulness score as well

In [63]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), pd.concat([df_model_avg.reset_index(drop=True), food_reviews['helpfulness_score']], axis=1), food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

0.9215998782784002


In [64]:
word_2_vec_df = pd.concat([df_model_avg.reset_index(drop=True), food_reviews['helpfulness_score'], food_reviews['text_ext'].str.split().str.len()], axis=1)

In [65]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), word_2_vec_df, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

0.9213621407786379


### Implementing bigrams

In [158]:
from gensim.models.phrases import Phrases, Phraser

In [193]:
bigram = Phrases(list(food_reviews['review_tokens']), min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
food_reviews['review_tokens'] = bigram_phraser[list(food_reviews['review_tokens'])]

### Saving word2vec model. Use it load in the future.

In [194]:
model = Word2Vec(food_reviews.review_tokens, size=100, window=5, min_count=1, workers=4)
model.save("word2vec_binary_bigram.model")

In [195]:
def get_mean(x):
    # Gets the mean of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.mean()

def get_sum(x):
    # Gets the sum of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.sum()
    

In [196]:
food_reviews.index = range(food_reviews.shape[0]) 

In [197]:
df_model_avg = []
for i in range(0, len(food_reviews)):
    df_model_avg.append(get_mean(model[food_reviews.review_tokens[i]]))
    
df_model_avg = pd.DataFrame(df_model_avg)    

  This is separate from the ipykernel package so we can avoid doing imports until


In [198]:
df_model_sum = []
for i in range(0, len(food_reviews)):
    df_model_sum.append(get_sum(model[food_reviews.review_tokens[i]]))
    
df_model_sum = pd.DataFrame(df_model_sum)    

  This is separate from the ipykernel package so we can avoid doing imports until


### Performing logistic regression and cross- validation to get the sentiment

In [199]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import cross_validation

In [200]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_avg, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

0.9227315087772685


Accuracy of this is 92.27%, an improvement of 0.13%.

In [201]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_sum, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

0.9200517316799482


Slight improvement as compared to the previous sum model to 92% 

### Implementing the same using TF-IDF

In [None]:
from nltk import FreqDist

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
food_reviews['extended_review_processed'] = food_reviews['text_ext'].apply(lambda x: " ".join(x.lower() for x in x.split()))
stop_words = set(stopwords.words('english'))
food_reviews['extended_review_processed'] = food_reviews['extended_review_processed'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))


In [None]:
stemmer = PorterStemmer()
food_reviews['extended_review_processed'] = food_reviews['extended_review_processed'].apply(lambda x: [stemmer.stem(words) for words in x.split(' ')])

In [None]:
food_reviews['extended_review_processed'] = food_reviews['extended_review_processed'].apply(lambda x : ' '.join(x))

In [None]:
vectorizer = TfidfVectorizer(max_df = 0.8, max_features=3000)
tfidf = vectorizer.fit_transform(food_reviews['extended_review_processed'])


In [None]:
model = None
del model

In [None]:
feature_names = vectorizer.get_feature_names()
corpus_index = food_reviews.index
tfidf_df = pd.DataFrame(tfidf.todense(), columns=feature_names)
tfidf_df.head()

### Performing logistic regression and 10 fold cross- validation to get the sentiment

In [None]:
tfidf.shape


In [None]:
predicted = cross_validation.cross_val_predict(LogisticRegression(), tfidf, food_reviews['sentiment'], cv=10)
print(metrics.accuracy_score(food_reviews['sentiment'], predicted))

The TF-IDF method gives an accuracy of 94%

### Implementing LDA

In [66]:
food_reviews.review_tokens2 = food_reviews.review_tokens[0:500]
food_reviews.review_tokens2

  """Entry point for launching an IPython kernel.


0      [good, qualiti, dog, food, bought, sever, vital, dog, food, product, found, good, qualiti, product, look, like, stew, process, meat, smell, better, labrador, finicki, appreci, thi, product, better]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1      [advertis, product, arriv, label, jumbo, salt, peanut, peanut, actual, small, size, unsalt, sure, thi, wa, error, vendor, intend, repres, product, jumbo]                                                                                                                                                  

In [67]:
import gensim
from gensim.utils import simple_preprocess

In [68]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(food_reviews.review_tokens2)

In [70]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in food_reviews.review_tokens2]

In [98]:
ldamodel = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics = 5, id2word=dictionary, passes=15, minimum_probability= 0)
ldamodel.save('model5.gensim')

(0, '0.032*"thi" + 0.026*"br" + 0.017*"food" + 0.013*"great" + 0.012*"good" + 0.010*"wa" + 0.009*"dog"')
(1, '0.023*"thi" + 0.020*"br" + 0.013*"one" + 0.013*"food" + 0.010*"cat" + 0.010*"like" + 0.009*"tast"')
(2, '0.032*"thi" + 0.017*"sugar" + 0.012*"product" + 0.011*"flavor" + 0.011*"use" + 0.010*"wa" + 0.010*"good"')
(3, '0.017*"tast" + 0.015*"thi" + 0.014*"like" + 0.013*"br" + 0.012*"wa" + 0.010*"bag" + 0.009*"chip"')
(4, '0.043*"chip" + 0.027*"br" + 0.016*"flavor" + 0.013*"thi" + 0.013*"kettl" + 0.012*"bag" + 0.012*"love"')


In [155]:
debae = pd.DataFrame()
for i in range(0, 500):
    debae = pd.concat([debae, pd.DataFrame(ldamodel.get_document_topics(bow_corpus[i]))[1]], axis=1)
    
debae = debae.transpose()    

## Taking a sample 0f 1% data and carrying out model building on it!

In [205]:
food_review_sample = food_reviews.sample(frac=0.01)

In [206]:
len(food_review_sample.productid.unique())

3997

### Carrying out LDA on it

In [85]:
from sklearn.decomposition import LatentDirichletAllocation

In [207]:
dictionary = gensim.corpora.Dictionary(food_review_sample.review_tokens)
bow_corpus = [dictionary.doc2bow(doc) for doc in food_review_sample.review_tokens]

In [208]:
ldamodel = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics = 50, id2word=dictionary, passes=15, minimum_probability= 0)

In [212]:
lda_df = pd.DataFrame()
for i in range(0, len(food_review_sample)):
    lda_df = pd.concat([lda_df, pd.DataFrame(ldamodel.get_document_topics(bow_corpus[i]))[1]], axis=1)
    
lda_df = lda_df.transpose()    

In [215]:
food_review_sample.shape

(5258, 15)

In [216]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), lda_df, food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8488018257892734


#### TF-IDF on the sample

In [217]:
from nltk import FreqDist

In [218]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [224]:
vectorizer = TfidfVectorizer(max_df = 0.8, max_features=3000)
tfidf = vectorizer.fit_transform(food_review_sample['review_tokens'].apply(lambda x : ' '.join(x)))


In [225]:
feature_names = vectorizer.get_feature_names()
corpus_index = food_reviews.index
tfidf_df = pd.DataFrame(tfidf.todense(), columns=feature_names)
tfidf_df.shape

(5258, 3000)

In [228]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), tfidf_df, food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8883605933815139


88.83% Accuracy

In [240]:
lda_df.index = range(0, len(lda_df))

#### Combining TF-IDF and LDA

In [254]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), pd.concat([tfidf_df.reset_index(drop=True), lda_df], axis=1), food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8881704069988589


88.81% Accuracy. Slight decrease

In [244]:
bigram = Phrases(list(food_review_sample['review_tokens']), min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
food_review_sample['review_tokens'] = bigram_phraser[list(food_review_sample['review_tokens'])]

In [246]:
model = Word2Vec(food_review_sample.review_tokens, size=100, window=5, min_count=1, workers=4)

In [247]:
def get_mean(x):
    # Gets the mean of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.mean()

def get_sum(x):
    # Gets the sum of the sentence so that the sentence is represented in the 100-d space
    x = pd.DataFrame(x)
    return x.sum()
    

In [248]:
food_review_sample.index = range(food_review_sample.shape[0]) 

In [249]:
df_model_avg = []
for i in range(0, len(food_review_sample)):
    df_model_avg.append(get_mean(model[food_review_sample.review_tokens[i]]))
    
df_model_avg = pd.DataFrame(df_model_avg)    

  This is separate from the ipykernel package so we can avoid doing imports until


In [250]:
df_model_sum = []
for i in range(0, len(food_review_sample)):
    df_model_sum.append(get_sum(model[food_review_sample.review_tokens[i]]))
    
df_model_sum = pd.DataFrame(df_model_sum)    

  This is separate from the ipykernel package so we can avoid doing imports until


In [251]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_avg, food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8497527577025485


Accuracy of this is 84.97%.

In [252]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), df_model_sum, food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8488018257892734


Accuracy of this is 84.88%.

#### Combining TF-IDF, Word2Vec and LDA

In [253]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), pd.concat([tfidf_df.reset_index(drop=True), lda_df, df_model_avg], axis=1), food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8879802206162039


Accuracy of this is 88.79%.

#### Combining TF-IDF and Word2Vec

In [255]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), pd.concat([tfidf_df.reset_index(drop=True), df_model_avg], axis=1), food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8887409661468239


Accuracy of this is 88.87%.

#### Combining LDA and Word2Vec

In [256]:
#Performing logistic regression with 10-fold CV
predicted = cross_validation.cross_val_predict(LogisticRegression(), pd.concat([lda_df.reset_index(drop=True), df_model_avg], axis=1), food_review_sample['sentiment'], cv=10)
print(metrics.accuracy_score(food_review_sample['sentiment'], predicted))

0.8489920121719285


Accuracy of this is 84.89%.

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
pyLDAvis.gensim.prepare(ldamodel, bow_corpus, dictionary)

'''
What do we see here?

The left panel, labeld Intertopic Distance Map, circles represent different topics and the distance between them. 
Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot 
corresponds to the relative frequency of the topic in the corpus. An individual topic may be selected for closer 
scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

The right panel, include the bar chart of the top 30 terms. When no topic is selected in the plot on the left,
the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how 
frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics. 
Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. 
Relevence is defined as in footer 2 and can be tuned by parameter  λ , smaller  λ  gives higher weight to the term's 
distinctiveness while larger  λ s corresponds to probablity of the term occurance per topics.

Therefore, to get a better sense of terms per topic use  λ =0.

'''