In [1]:
## pip install gensim
## pip install python-Levenshtein

## Reading and Exploring the Dataset

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [2]:
import pandas as pd
import numpy as np
import gensim

In [3]:
df = pd.read_json('reviews_Cell_Phones_and_Accessories_5.json', lines=True)

In [4]:
len(df)

194439

In [5]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [6]:
df.columns

Index(['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText',
       'overall', 'summary', 'unixReviewTime', 'reviewTime'],
      dtype='object')

In [7]:
df.iloc[0].values.tolist()

['A30TL5EWN6DFXT',
 '120401325X',
 'christina',
 [0, 0],
 "They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again",
 4,
 'Looks Good',
 1400630400,
 '05 21, 2014']

In [8]:
df.shape

(194439, 9)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194439 entries, 0 to 194438
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   reviewerID      194439 non-null  object
 1   asin            194439 non-null  object
 2   reviewerName    190920 non-null  object
 3   helpful         194439 non-null  object
 4   reviewText      194439 non-null  object
 5   overall         194439 non-null  int64 
 6   summary         194439 non-null  object
 7   unixReviewTime  194439 non-null  int64 
 8   reviewTime      194439 non-null  object
dtypes: int64(2), object(7)
memory usage: 13.4+ MB


In [10]:
df.reviewText[1]

'These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :)'

## Simple Preprocessing & Tokenization

The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [11]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [12]:
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [13]:
review_text[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

## Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

### Initialize the model

In [14]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

Build Vocabulary

In [15]:
model.build_vocab(review_text , progress_per=1000)

In [16]:
model.epochs

5

Train the Word2Vec Model

In [17]:
model.train(review_text, total_examples=len(review_text), epochs=10)

(123009576, 167737950)

## Save the Model
Save the model so that it can be reused in other applications

In [18]:
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

## Finding Similar Words and Similarity between words
https://radimrehurek.com/gensim/models/word2vec.html

In [19]:
model.wv.most_similar("sad")

[('upset', 0.7495953440666199),
 ('mad', 0.7191388607025146),
 ('bummed', 0.6880806684494019),
 ('dissapointed', 0.6502204537391663),
 ('disapointed', 0.625237226486206),
 ('unhappy', 0.6064099073410034),
 ('disappointing', 0.604128897190094),
 ('disappointed', 0.5910249948501587),
 ('excited', 0.5799297094345093),
 ('scared', 0.5773676037788391)]

In [20]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.54094934

In [21]:
model.wv.similarity(w1="great", w2="good")


0.78667766

In [22]:
model.wv.similarity(w1="dirty",w2="smelly")

0.17979848

In [23]:
model.wv.similarity(w1="dirty",w2="dirty")

1.0

## Further Reading
You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/