# Objective

To reduce the number of dimensions of the Amazon food reviews dataset so that we can visualize it using a scatterplot. The plot should ideally distinguish between positive and negative reviews.

I'm not building a classification model here. The goal is to apply t-SNE on various vector representation of the text data. I will use these models to vectorize text:


*   Bag of Words (Unigram and Bigram)
* Tfidf
* Doc2Vec
* Average W2V
* Tfidf weighted W2V



As always, let's start by loading the files that we need. I'm working on Google Colab for this Exercise, since my laptop isn't powerful enough to handle the workload.

To avoid having to upload data from local disk everytime the environment is disconnected, I'll be using Google Drive to store all the data and pickled models.

In [None]:
from google.colab import drive
drive.mount('/gdrive')

List files on the drive.

In [None]:
# # List all files present on google drive
# import os
# os.listdir('/gdrive/My Drive')

Import needed libraries

In [None]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn
import nltk
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer, CountVectorizer
from sklearn.metrics import confusion_matrix, roc_curve, auc
from nltk.stem.porter import PorterStemmer

The raw data is present in the form of a sqlite file. Let's retrieve it from google drive.

In [None]:
# load sqlite database
con = sqlite3.connect(r'/gdrive/My Drive/amazon/database.sqlite')

Since the aim here is to just visualize the positive and negative reviews, I select only the reviews that aren't neutral. It's a fair assumption that the reviews with score = 3 are neutral. We'll work with reviews that have score either 1,2,4 or 5.

In [None]:
review_vector_3k_tsne = tsne.fit_transform(X_scaled)

Shape of the numpy array after reducing the dimensions to two. 

In [None]:
review_vector_3k_tsne.shape

Pickle important variables again.

In [None]:
# # Pickle it first!
# import pickle
# pickle_file = open('/gdrive/My Drive/amazon/pickled_3k_reviews_tsne_bigram.pkl', 'wb')
# pickle.dump(review_vector_3k_tsne, pickle_file)
# pickle_file.close()

Build dataframe

In [None]:
bigram_tsne_3k_array = np.vstack((review_vector_3k_tsne.T, df_np['Score'])).T
df_bigram_tsne_3k = pd.DataFrame(bigram_tsne_3k_array, columns=['First Dimension', 'Second Dimension', 'Label'])

In [None]:
df_bigram_tsne_3k.head()

Plot the result.

In [None]:
sbn.FacetGrid(df_bigram_tsne_3k, hue="Label", size=6).map(plt.scatter, 'First Dimension', 'Second Dimension').add_legend()
plt.show()

Unfortunately, using bigrams also does not yield anything useful. Let's look at TFIDF next.

### Tfidf

Term Frequency Inverse Document Frequency. 

Let's define the model and fit it to the data.

There are about half a million reviews in the original data file that are not neautral.

In [None]:
df.info()

For this exercise, we want score to be a categorical feature.
* Mark reviews with rating > 3 as positive
* Mark reviews with rating < 3 as negative

In [None]:
m1 =  df['Score'] > 3 
m2 =  df['Score'] < 3 

df['Score'] = np.select([m1,m2], ['positive','negative'])

This is an imbalanced dataset. The number of positice reviews is almost 6 times the number of negative reviews!

In [None]:
df.Score.value_counts()

Let's change the datatype of Score in our Pandas dataframe to 'Category'. Using Categories instead of the default 'Object' datatype leads to performance improvement.

In [None]:
df['Score']=df['Score'].astype('category')
df.info()

In [None]:
df.head()

Are there any duplicate rows in the dataset?
Inspecting the "Text" column, we clearly see there are duplicates.

In [None]:
df.duplicated('Text').value_counts()

There are also a few anomalies in the data where the HelpfulnessNumerator is greater than the helpfulnessDenominator. 

* HelpfulnessNumberator = Number of positive reviews
* HelpfulnessDenominator = Number of positive reviews + Number of negative reviews

Therefore, HelpfulnessDenominator can't be less than HelpfulnessNumerator.  We need to get rid of such erroneous records.

In [None]:
bigrams.shape

In [None]:
bigrams

Convert the sparse matrix to a dense numpy matrix.

In [None]:
review_vector_3k = bigrams.toarray()

In [None]:
review_vector_3k.shape

Standardize the data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(review_vector_3k)

At this point, memory is about to run out so let's store the variable on Google drive.

In [None]:
# import pickle
# pickle_file = open('/gdrive/My Drive/amazon/X_scaled_standardized_3k_bigram_bow_nparray.pkl', 'wb')
# pickle.dump(X_scaled, pickle_file)
# pickle_file.close()

Load the pickled X_scaled file.

In [None]:
# import pickle
# file_path = open('/gdrive/My Drive/amazon/X_scaled_standardized_3k_bigram_bow_nparray.pkl', 'rb')
# X_scaled = pickle.load(file_path)

Create a t-SNE model

Considering how long it takes for a modest computer to run the above code for cleaning data, it's a good idea to store it  on disk for future use. The entire dataframe along with the newly created column "cleaned_data" is stored.

In [None]:
# #Save final cleaned dataframe to the drive.
# conn = sqlite3.connect('/gdrive/My Drive/amazon/reviews_cleaned_final.sqlite')
# df.to_sql('Reviews', conn)
# conn.close()

If the environment was disconnected, load the cleaned dataframe.

In [None]:
# conn = sqlite3.connect('/gdrive/My Drive/amazon/reviews_cleaned_final.sqlite')
# df = pd.read_sql('select * from Reviews;', conn, index_col='index')
# conn.close()
# df.head()

### Bag of words

Let's vectorize the data using the simplest method first: BoW

There is class imbalance in our original dataset. The positive reviews are far more than the negative ones. Let's take 1500 positive and 1500 negative samples.

In [None]:
n = df['Score'] == 'negative'
p = df['Score'] == 'positive'
#df_n = df[df['Score']]
df_n = df[n][['cleaned_text','Score']][:1500]
df_p = df[p][['cleaned_text', 'Score']][:1500]

df_np = pd.concat([df_n, df_p])

Initialize the CountVectorizer class which creates a Bag of Words representation. 

In [None]:
count_vec = CountVectorizer()
final_counts = count_vec.fit_transform(df_np['cleaned_text'].values)

Type of the returned object.

In [None]:
print('The type of final_counts is {}'.format(type(final_counts)))
print('The shape of the matrix is {}'.format(final_counts.get_shape))

Convert the sparse matrix of Bag of Words model to a numpy array.

In [None]:
review_vector_3k = final_counts.toarray()

Before applying t-SNE, it's necessary that we standardize our data. 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(review_vector_3k)

Let's see the shape of of standardized data of BoW representation. It contains 3000 rows for 3000 reviews and 7207 dimensions.

These are some dimensions in the tfidf representation.

In [None]:
features[100:110]

Convert the tfidf matrix to a dense numpy array.

In [None]:
review_vector_3k_TFIDF=tfidf.toarray()
review_vector_3k_TFIDF.shape

Standardize the tfidf data.

In [None]:
X_scaled.shape

Apply t-SNE

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=7, perplexity=45, early_exaggeration = 17, learning_rate = 300, method='exact')
review_vector_3k_tsne = tsne.fit_transform(X_scaled)

As expected, the newly create data array has 2 dimensions and 3000 reviews

Pickle the tsne object to Google Drive.

Now apply t-SNE

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=7, perplexity=40, early_exaggeration = 18, learning_rate = 250, method='exact')

review_vector_3k_tsne = tsne.fit_transform(X_scaled)

Let's see the shape of the array after reducing the dimensions with t-SNE. It should be 3000 x 2.

In [None]:
print(review_vector_3k_tsne.shape)
print(review_vector_3k_tsne)

This model took an hour to run. It's a good idea to store it on disk for later use.

In [None]:
import pickle
pickle_file = open('/gdrive/My Drive/amazon/pickled_3k_reviews_tsne_bow.pkl', 'wb')
pickle.dump(review_vector_3k_tsne, pickle_file)
pickle_file.close()

Construct a dataframe to help visualize the t-SNE result.

In [None]:
bow_tsne_3k_array = np.vstack((review_vector_3k_tsne.T, df_np['Score'])).T
df_bow_tsne_3k = pd.DataFrame(bow_tsne_3k_array, columns=['First Dimension', 'Second Dimension', 'Label'])

The following dataframe contains the new dimensions that were created by t-SNE. The original dimensions are lost.

In [None]:
df_bow_tsne_3k.head()

Display a scatter plot.

In [None]:
df = pd.read_sql_query("select * from reviews where score <> 3;", con)

Let's define a couple of handy functions to clean the data and the stemmer.

In [None]:
stop_words = set(stopwords.words('english'))
porter = PorterStemmer()
lemma = WordNetLemmatizer()

def clean_html(sentence, compiled_regex):
    cleaned_sentence = re.sub(compiled_regex, ' ', sentence)
    return cleaned_sentence

def clean_punctuation(sentence):
    cleaned_sentence = re.sub(r'[?|!|\'|"|#]', r'', sentence)
    cleaned_sentence = re.sub(r'[.|,|)|(|\|/]', r' ', cleaned_sentence)
    return cleaned_sentence

In [None]:
print(stop_words, end='\n\n-------------------------\n\n')
print('Stemmed form of "Goodness" is: {}'.format(porter.stem('Goodness')))
print('Lemmatized form of "Goodness" is: {}'.format(lemma.lemmatize('Goodness')))

Let's clean the reviews using the functions defined above.

In [None]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase
# this code takes a while to run as it needs to run on 500k sentences.
i=0
str1=' '
final_string=[]

all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.

s=''

regex_html=re.compile('<.*?>')

for review in df['Text'].values:
    filtered_sentence=[]
    #print(sent);
    review=clean_html(review, regex_html) # remove HTMl tags
    for w in review.split():
        for cleaned_words in clean_punctuation(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop_words):
                    s = (porter.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (df['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(df['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    #print(filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    #print("***********************************************************************")
    
    final_string.append(str1)
    i+=1

df['cleaned_text']=final_string #adding a column of CleanedText which displays the data after pre-processing of the review 

The column "cleaned_text" contains the cleaned reviews.

In [None]:
tsne = TSNE(n_components=2, random_state=7, perplexity=45, early_exaggeration = 13, learning_rate = 300, method='exact')

review_vector_3k_tsne = tsne.fit_transform(review_vector_3k)

As expected, the resulting matrix has 2 dimensions which are created using t-SNE, from the original vector generated using Doc2Vec.

Generate a dataframe from it.

In [None]:
d2v_tsne_3k_array = np.vstack((review_vector_3k_tsne.T, df_np['Score'])).T
df_d2v_tsne_3k = pd.DataFrame(d2v_tsne_3k_array, columns=['First Dimension', 'Second Dimension', 'Label'])

In [None]:
df_d2v_tsne_3k.head()

In [None]:
sbn.FacetGrid(df_d2v_tsne_3k, hue="Label", size=6).map(plt.scatter, 'First Dimension', 'Second Dimension').add_legend()
plt.show()

Not much here, either.

### Word2Vec

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

Here, I'm training the model using our own corpus of food reviews. The reason behind not using Google's model trained on google news is - there are several words used in the food reviews that aren't present in the model trained by Google. Creating vectors of reviews using it throws exceptions. 

In [None]:
# Training a Word2Vec model using our own corpus

list_of_reviews = []

for review in df['cleaned_text'].values:
    list_of_reviews.append(review.decode('utf-8').split())

In [None]:
print(df['cleaned_text'][0])
print('------------------')
print(list_of_reviews[0])


In [None]:
w2v_model = gensim.models.Word2Vec(list_of_reviews, min_count=4,size=50,workers=4)

In [None]:
words = list(w2v_model.wv.vocab)
print(len(words))

The following result is fascinating. It shows the words which are similar to 'smell'. The list of words printed is amazing.

In [None]:
tfidf_vec = TfidfVectorizer(ngram_range=(1,1))
tfidf = tfidf_vec.fit_transform(df_np['cleaned_text'].values)

In [None]:
tfidf.shape

There are 3000 rows as expected and 7207 dimensions.

In [None]:
features = tfidf_vec.get_feature_names()
len(features)

In [None]:
sbn.FacetGrid(df_bow_tsne_3k, hue="Label", size=6).map(plt.scatter, 'First Dimension', 'Second Dimension').add_legend()
plt.show()

The above scatter plot doesn't make any sense. The positive and negative reviews are almost perfectly overlapped. 

This suggests that the Bag of Words model isn't really good at distinguishing between the two types of reviews.

### BoW Bigrams 

This is an extension to the previous model. We consider bi-grams here instead of uni-grams.

First, let's create lists of positive and negative words.

In [None]:
all_positive_words = []

for w in df_p['cleaned_text']:
    all_positive_words.extend(w.split())
#all_positive_words

all_negative_words = []

for w in df_n['cleaned_text']:
    all_negative_words.extend(w.split())
#all_negative_words

Most used words in positive and negative reviews. 

In [None]:
freq_dist_positive = nltk.FreqDist(all_positive_words)
print('Most common positive words: {}'.format(freq_dist_positive.most_common(10)))

freq_dist_negative = nltk.FreqDist(all_negative_words)
print('Most common negative words: {}'.format(freq_dist_negative.most_common(10)))

In [None]:
#  Bi-grams
count_vec = CountVectorizer(ngram_range=(1,2))
bigrams = count_vec.fit_transform(df_np['cleaned_text'].values)

The number of bigrams is far more than the unigrams.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=7, perplexity=40, early_exaggeration = 18, learning_rate = 250, method='exact')

In [None]:
df[df['HelpfulnessNumerator'] > df['HelpfulnessDenominator']]

Let's drop these data points using the drop() method given by pandas.

In [None]:
df.drop(df[df['HelpfulnessNumerator'] > df['HelpfulnessDenominator']].index.tolist(), axis=0, inplace=True)

Verify whether the rows have been dropped.

In [None]:
df [df['HelpfulnessNumerator'] > df['HelpfulnessDenominator']]

Now, let's drop the reviews which have the same data for the attributes:
* UserId
* ProfileName
* Time
* Text

These are the reviews for same products, duplicated in the dataset because Amazon considers slight variations of the same product to be different products. i.e. A food item with red color would be different than the same item of green color. We need to drop these.

In [None]:
import pickle
pickle_file = open('/gdrive/My Drive/amazon/avg_w2v_nparray_3kcorpus_50dim.pkl', 'wb')
pickle.dump(corpus_vec, pickle_file)
pickle_file.close()

In [None]:
corpus_vec.shape

Appply t-SNE

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=7, perplexity=45, early_exaggeration = 13, learning_rate = 250)

review_vector_3k_tsne = tsne.fit_transform(corpus_vec)

Build a dataframe for visualization

In [None]:
avg_w2v_tsne_3k_array = np.vstack((review_vector_3k_tsne.T, df_np['Score'])).T
df_avg_w2v_tsne_3k = pd.DataFrame(avg_w2v_tsne_3k_array, columns=['First Dimension', 'Second Dimension', 'Label'])

In [None]:
sbn.FacetGrid(df_avg_w2v_tsne_3k, hue="Label", size=6).map(plt.scatter, 'First Dimension', 'Second Dimension').add_legend()
plt.show()

The above result is far from ideal, but it's slightly better than the techniques seen so far. The density of positive points is more in the right side of the map as compared to the left side.

### Tfidf weighted W2V

In this representation, each review is made up from the tfidf weighted sum of all the words in a review.

In [None]:
corpus_tfidf_weighted_w2v = np.zeros(shape=(50))
review_number = 0

for review in df_np['cleaned_text'].values:
    review_vector_tfidf_weighted = np.zeros(shape=(50))
    tfidf_sum = 0
    for word in review.decode('utf-8').split():
        try:
            tfidf_value = review_vector_3k_TFIDF[review_number, features.index(word)]
            review_vector_tfidf_weighted += w2v_model.wv[word] * tfidf_value
            tfidf_sum += tfidf_value
            
        except KeyError:
            continue
    
    review_number += 1
    
    review_vector_tfidf_weighted /= tfidf_sum
    corpus_tfidf_weighted_w2v=np.vstack((corpus_tfidf_weighted_w2v,review_vector_tfidf_weighted))
    
corpus_tfidf_weighted_w2v = np.delete(corpus_tfidf_weighted_w2v, 0, axis=0)

Shape of the data matrix

In [None]:
import pickle
pickle_file = open('/gdrive/My Drive/amazon/pickled_3k_reviews_tsne_tfidf_1gram.pkl', 'wb')
pickle.dump(review_vector_3k_tsne, pickle_file)
pickle_file.close()

Construct dataframe

In [None]:
tfidf_tsne_3k_array = np.vstack((review_vector_3k_tsne.T, df_np['Score'])).T
df_tfidf_tsne_3k = pd.DataFrame(tfidf_tsne_3k_array, columns=['First Dimension', 'Second Dimension', 'Label'])

In [None]:
df_tfidf_tsne_3k.head()

Plot the result of applying t-SNE on tfidf.

In [None]:
sbn.FacetGrid(df_tfidf_tsne_3k, hue="Label", size=6).map(plt.scatter, 'First Dimension', 'Second Dimension').add_legend()
plt.show()

The TFIDF model too didn't prove to be of much use here. Let's look at a different technique.

### Doc2Vec

In [None]:
# Install gensim if not installed

!pip install gensim

Import gensim

In [None]:
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors

Define a function that reads each review and converts it into TaggedDocument format needed for training a model using Doc2Vec technique.

In [None]:
corpus_tfidf_weighted_w2v.shape

Apply t-SNE 

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=13, perplexity=50, early_exaggeration = 14, learning_rate = 225)

review_vector_3k_tsne = tsne.fit_transform(corpus_tfidf_weighted_w2v)

The dimensions of the array reduced to 2 as expected.

Build a dataframe.

In [None]:
w2v_model.wv.most_similar('smell')

Similarly, words similar to 'bad'

In [None]:
w2v_model.wv.most_similar('bad')

Calculating the avg W2V representation for each review.

In [None]:
import sys

corpus_vec = np.zeros(shape=(50))

for review in df_np['cleaned_text'].values:
    review_vector = np.zeros(shape=(50))
    for word in review.decode('utf-8').split():
        try:
            
            review_vector += w2v_model.wv[word]
            
        except KeyError:
            continue
    
    review_vector /= 50
    corpus_vec=np.vstack((corpus_vec,review_vector))    
corpus_vec=np.delete(corpus_vec, 0, axis=0)

Again, store the files on Google Drive for future use.

In [None]:
df.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep = 'first', inplace=True)

Dropping the duplicate significantly reduces the size of our dataframe.

In [None]:
df.shape

### Text Cleaning

Data collected from the web often contains unrendered HTML tags in them. Let's see if our reviews have them. We are only interested in the 'Text' column, because this is the column that will help us decide if a review is positive or negative. 

In [None]:
#train_corpus = gensim.utils.simple_preprocess(df_np['cleaned_text'][1])

def read_corpus(df_np, tokens_only=False):
        for i, review in enumerate(df_np['cleaned_text']):
            if tokens_only:
                yield gensim.utils.simple_preprocess(review)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(review), [i])

train_corpus = list(read_corpus(df_np))

In [None]:
train_corpus[:3]

Define D2V model

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=45)

Build vocabulary

In [None]:
model.build_vocab(train_corpus)

In [None]:
model

Train D2V model

In [None]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Create a corpus of vector representation of all reviews under consideration, using D2V technique.

In [None]:
review_vector_3k = np.zeros((1,50))


for review in df_np['cleaned_text']:
    arr = np.reshape(model.infer_vector(str(review).split()), (1,-1))
    review_vector_3k=np.vstack([review_vector_3k, arr])

#model.infer_vector(str(df_np['cleaned_text'][1636]).split())
#df_np['cleaned_text'][1637]
#df_np['cleaned_text']
#np.reshape(model.infer_vector(str(df_np['cleaned_text'][2]).split()), (1,-1)).shape

review_vector_3k = np.delete(review_vector_3k, (0), axis=0)

A vector is generated that has 3000 rows, each for 1 review and 50 dimensions as defined by the vector_size parameter specified when we defined the model above.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

review_vector_3k = scaler.fit_transform(review_vector_3k)

Define t-SNE and transform data

In [None]:
import re

i=0

for review in df['Text'].values:
    if (len(re.findall('<.*>', review))):
            i+=1
            r=review

How many reviews contain HTML tags?

In [None]:
print('Number of reviews which contain HTML tags: {}'.format(i), end='\n\n------------------\n\n')
print('Sample review containing HTML tags: {}'.format(r))

Let's import some text processing libraries we need. 

In [None]:
avg_tfidf_weighted_w2v_tsne_3k_array = np.vstack((review_vector_3k_tsne.T, df_np['Score'])).T
df_avg_tfidf_weighted_w2v_tsne_3k = pd.DataFrame(avg_tfidf_weighted_w2v_tsne_3k_array, columns=['First Dimension', 'Second Dimension', 'Label'])

The tfidf weighted W2V plot is not very different from the plot obtained using the avg. W2V representation. 

TFIDF values of words in a review