**Word2Vec Theory**

https://youtu.be/hQwFeIupNP0?feature=shared

In [3]:
#pip install gensim
#pip install python-Levenshtein

In [4]:
import gensim
import pandas as pd

Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [10]:
df = pd.read_json("Cell_Phones_and_Accessories_5.json", lines=True)
df.sample(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
137539,AEYNJAOO0AJ50,B00A6WWPAK,PLD,"[0, 0]",Gret phone case. I bought it for a friend and ...,5,Phone case,1386547200,"12 9, 2013"
78560,A1ODEH9SPDZCMP,B007C8XRJY,Dave,"[0, 0]",Great fit and performance. Manufacturer is ve...,5,Excellent,1372809600,"07 3, 2013"
190390,A1NR108NFAYBWH,B00IGISO9C,Daniel Manzo,"[1, 3]",A great deal and the screen protector firsts p...,5,Great Fit,1397865600,"04 19, 2014"
43808,A2SWBDSALQTPYE,B004Z274UI,THE TRUTH,"[0, 0]",This case sucks. Cheap and once you try to tak...,1,0 stars,1329350400,"02 16, 2012"
123383,A2IXTJNMOSLZ81,B009CRH0L4,T. J. Brown,"[0, 0]",Liked how it looked wished it as worked well. ...,1,Ashamed to say its Samsung,1393200000,"02 24, 2014"


In [11]:
df.shape

(194439, 9)

**Simple Preprocessing & Tokenization**

The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [13]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

In [14]:
gensim.utils.simple_preprocess("They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again")

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [16]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [17]:
review_text.loc[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [18]:
df.reviewText.loc[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

**Training the Word2Vec Model**

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

Initialize the model

In [19]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

window=10: The model considers up to 10 words before and after the current word to understand the context.

min_count=2: Words that appear fewer than 2 times in the dataset will be ignored.

workers=4: The model will use 4 CPU cores to train faster.

Build Vocabulary

In [20]:
model.build_vocab(review_text, progress_per=1000)

progress_per=1000: Displays progress updates every 1000 words processed. This helps you track how fast the vocabulary is being built, especially for large datasets.

Train the Word2Vec Model

In [26]:
model.corpus_count

194439

In [28]:
model.epochs = 10

by default, epochs are 5

In [29]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(123016367, 167737950)

Save the Model

Save the model so that it can be reused in other applications

In [30]:
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

Finding Similar Words and Similarity between words

https://radimrehurek.com/gensim/models/word2vec.html

In [31]:
model.wv.most_similar("bad")

[('shabby', 0.6859990358352661),
 ('terrible', 0.6751855611801147),
 ('good', 0.5950871109962463),
 ('horrible', 0.5876322388648987),
 ('legit', 0.5796961784362793),
 ('funny', 0.5471799373626709),
 ('pathetic', 0.5417964458465576),
 ('crappy', 0.5368625521659851),
 ('disappointing', 0.5320417284965515),
 ('upsetting', 0.525600254535675)]

In [32]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.56400585

In [33]:
model.wv.similarity(w1="great", w2="good")

0.80127376

Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html