<a href="https://colab.research.google.com/github/csoren66/Deep-Learning/blob/main/Implement_word2vec_in_gensim_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import gensim
import pandas as pd

## **Reading and Exploring the Dataset**

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [None]:
df = pd.read_json('/content/Cell_Phones_and_Accessories_5.json', lines=True)
df.head

<bound method NDFrame.head of             reviewerID        asin       reviewerName helpful  \
0       A30TL5EWN6DFXT  120401325X          christina  [0, 0]   
1        ASY55RVNIL0UD  120401325X           emily l.  [0, 0]   
2       A2TMXE2AFO7ONB  120401325X              Erica  [0, 0]   
3        AWJ0WZQYMYFQ4  120401325X                 JM  [4, 4]   
4        ATX7CZYFXI1KW  120401325X   patrice m rogoza  [2, 3]   
...                ...         ...                ...     ...   
194434  A1YMNTFLNDYQ1F  B00LORXVUE    eyeused2loveher  [0, 0]   
194435  A15TX8B2L8B20S  B00LORXVUE       Jon Davidson  [0, 0]   
194436  A3JI7QRZO1QG8X  B00LORXVUE  Joyce M. Davidson  [0, 0]   
194437  A1NHB2VC68YQNM  B00LORXVUE     Nurse Farrugia  [0, 0]   
194438  A1AG6U022WHXBF  B00LORXVUE     Trisha Crocker  [0, 0]   

                                               reviewText  overall  \
0       They look good and stick good! I just don't li...        4   
1       These stickers work like the review says 

In [None]:
df.shape

(194439, 9)

In [None]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

# Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [None]:
review_text =df.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [None]:
review_text.loc[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [None]:
df.reviewText.loc[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

# **Training the Word2Vec Model**
Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

**Initialize the model**

In [None]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

Build Vocabulary

In [None]:
model.build_vocab(review_text, progress_per=1000)

Train the Word2Vec Model

In [None]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61505622, 83868975)

## **Save the Model**
Save the model so that it can be reused in other applications

In [None]:
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

# Finding Similar Words and Similarity between words

In [None]:
model.wv.most_similar("bad")

[('terrible', 0.6754758358001709),
 ('shabby', 0.65128093957901),
 ('horrible', 0.6260858178138733),
 ('good', 0.5825314521789551),
 ('disappointing', 0.5514594316482544),
 ('funny', 0.5490015745162964),
 ('crummy', 0.5345684885978699),
 ('okay', 0.5314438939094543),
 ('awful', 0.5272002220153809),
 ('crappy', 0.5256125330924988)]

In [None]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.5355522

In [None]:
model.wv.similarity(w1="great", w2="good")

0.77763975