# Word2Vec 

> We are using a subset of a dataset available from Amazon reviews from the Cell Phones and Accessories category. It is stored in a JSON file.

Link ->http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz



In [2]:
# !pip install gensim
# !pip install python-Levenshtein
import gensim # NLP Library
import pandas as pd

## Reading JSON file with Pandas

In [4]:
df=pd.read_json('/content/drive/MyDrive/Colab Notebooks/Cell_Phones_and_Accessories_5.json',lines=True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [None]:
# import json

# Training a word2vec model

> We are training a word2vec model using only the 'reviewText' column from the json file.

Steps in training a Word2Vec model:-

* Preprocessing(Converting into lowercase, removing the stock words(a, an, the,I), removing trailing spaces, removing punctuations).

In [5]:
df.shape

(194439, 9)

In [8]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

In [10]:
gensim.utils.simple_preprocess(df.reviewText[0]) # This function is preprocessing the Text review.

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [11]:
review_text_preprocessed= df.reviewText.apply(gensim.utils.simple_preprocess)
review_text_preprocessed

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [14]:
review_text_preprocessed[1]

['these',
 'stickers',
 'work',
 'like',
 'the',
 'review',
 'says',
 'they',
 'do',
 'they',
 'stick',
 'on',
 'great',
 'and',
 'they',
 'stay',
 'on',
 'the',
 'phone',
 'they',
 'are',
 'super',
 'stylish',
 'and',
 'can',
 'share',
 'them',
 'with',
 'my',
 'sister']

In [18]:
# 'window=10' means 10 words before the target word and 10 words after the target word.
# 'min_count=2' means that if we have a sentence with only 'one' word then don't use that sentence. Atleast 'two' words need to be present in the sentence in order to be considered for the training.
# 'workers=4', workers means how many cpu threads we want to use to train the model.

model=gensim.models.Word2Vec(window=10, min_count=2, workers=4)

## Building the vocabulary

> It means building a unique list of words.

In [19]:
model.build_vocab(review_text_preprocessed,progress_per=100)

In [20]:
model.epochs

5

## Training the model for Word2Vec 

In [21]:
model.corpus_count

194439

In [22]:
model.train(review_text_preprocessed, total_examples=model.corpus_count, epochs=model.epochs)

(61503732, 83868975)

In [23]:
model.save('/content/drive/MyDrive/Colab Notebooks/amazon_accessories_review_word2vec.model')

## Trying out the Word2Vec Trained model

In [24]:
model.wv.most_similar('bad')

[('terrible', 0.6885898113250732),
 ('shabby', 0.6561310291290283),
 ('horrible', 0.6159200668334961),
 ('good', 0.5934832096099854),
 ('legit', 0.5569067597389221),
 ('awful', 0.5565348863601685),
 ('okay', 0.5368521213531494),
 ('poor', 0.522596001625061),
 ('crappy', 0.5175753831863403),
 ('cheap', 0.5153719782829285)]

In [25]:
model.wv.most_similar('good')

[('decent', 0.8247765898704529),
 ('great', 0.7794734239578247),
 ('fantastic', 0.7090641260147095),
 ('nice', 0.6958004236221313),
 ('excellent', 0.6440078616142273),
 ('outstanding', 0.6423240900039673),
 ('superb', 0.6396431922912598),
 ('awesome', 0.6006938815116882),
 ('wonderful', 0.5982356071472168),
 ('bad', 0.5934832096099854)]

### Similarity Score between two words

In [26]:
model.wv.similarity(w1='cheap', w2='inexpensive')

0.5268609

In [27]:
model.wv.similarity(w1='good', w2='bad')

0.5934832

In [28]:
model.wv.similarity(w1='good', w2='excellent')

0.6440078

In [30]:
model.wv.similarity(w1='good', w2='product') # Not very similar

-0.02259911

In [31]:
model.wv.similarity(w1='good', w2='awesome')

0.6006939

In [32]:
model.wv.similarity(w1='good', w2='good')

1.0

In [33]:
model.wv.similarity(w1='good', w2='great')

0.77947336