In [1]:
# !pip install gensim

In [1]:
import gensim
import pandas as pd

<h2 style="text-align: left;">Reading and Exploring the Dataset</h2>

The dataset we are using here is a subset of Amazon reviews from the Clothing Shoes and Jewelry category. The data is stored as a JSON file and can be read using pandas.

Explore all datasets related to amazon reviews:  (https://jmcauley.ucsd.edu/data/amazon/index_2014.html)

In [9]:
df = pd.read_json("reviews_Clothing_Shoes_and_Jewelry_5.json", lines=True)

In [36]:
df.head(2)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A1KLRMWW2FWPL4,31887,"Amazon Customer ""cameramom""","[0, 0]",This is a great tutu and at a really great pri...,5,Great tutu- not cheaply made,1297468800,"02 12, 2011"
1,A2G5TCU2WDFZ65,31887,Amazon Customer,"[0, 0]",I bought this for my 4 yr old daughter for dan...,5,Very Cute!!,1358553600,"01 19, 2013"


In [11]:
df.shape

(278677, 9)

<h2 style="text-align: left;">Simple Preprocessing & Tokenization</h2>

The first thing to do for any data science task is to clean the data.
For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something I am doing over here too.

Additionally, also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [12]:
df.reviewText[0]

"This is a great tutu and at a really great price. It doesn't look cheap at all. I'm so glad I looked on Amazon and found such an affordable tutu that isn't made poorly. A++"

In [14]:
gensim.utils.simple_preprocess("This is a great tutu and at a really great price.")

['this', 'is', 'great', 'tutu', 'and', 'at', 'really', 'great', 'price']

In [15]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [18]:
review_text.loc[0]

['this',
 'is',
 'great',
 'tutu',
 'and',
 'at',
 'really',
 'great',
 'price',
 'it',
 'doesn',
 'look',
 'cheap',
 'at',
 'all',
 'so',
 'glad',
 'looked',
 'on',
 'amazon',
 'and',
 'found',
 'such',
 'an',
 'affordable',
 'tutu',
 'that',
 'isn',
 'made',
 'poorly']

In [19]:
df.reviewText.loc[0]

"This is a great tutu and at a really great price. It doesn't look cheap at all. I'm so glad I looked on Amazon and found such an affordable tutu that isn't made poorly. A++"

<h2 style="text-align: left;">Training the Word2Vec Model</h2>


Train the model for reviews. Use a window of size 5 i.e. 5 words before the present word and 5 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

#### Initialize the model

In [20]:
model = gensim.models.Word2Vec(
    window=5,
    min_count=2,
    workers=8
)

#### Build Vocabulary

Building vocab means unique list of words from our processed `review_text`

In [21]:
model.build_vocab(review_text, progress_per=500)

#### Train the Word2Vec Model

In [23]:
model.corpus_count # it show how many total examples we have 

278677

In [25]:
model.train(review_text, total_examples=model.corpus_count, epochs=10)

(113567860, 154809840)

<h2 style="text-align: left;">Save the Model</h2>


Save the model so that it can be reused in other applications

In [26]:
model.save("./word2vec-amazon-clothing-shoes-and-jewelry.model")

<h2 style="text-align: left;">Finding Similar Words and Similarity between words</h2>

https://radimrehurek.com/gensim/models/word2vec.html

In [27]:
model.wv.most_similar("bad")

[('terrible', 0.687960147857666),
 ('horrible', 0.6489638090133667),
 ('shabby', 0.5613303184509277),
 ('strange', 0.5533720850944519),
 ('funny', 0.5530536770820618),
 ('ridiculous', 0.5439699292182922),
 ('poor', 0.5360315442085266),
 ('weird', 0.5307701230049133),
 ('cheap', 0.5304936766624451),
 ('disappointing', 0.521852433681488)]

In [34]:
model.wv.similarity("cheap", "inexpensive")

0.5360757

In [35]:
model.wv.similarity("great", "good")

0.88100845

<h2 style="text-align: left;">Further Reading</h2>


Read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html