Install the packages which are required. 

In [None]:
# !pip install gensim
# !pip install python-Levenshtein
#!pip install pandas

In [1]:
# Import the libraries 

import gensim
import pandas as pd

### Reading and Exploring the Dataset
The dataset we are using here is a subset of Sport and outdoor activities review dataset. The data is stored as a JSON file and can be read using pandas.

Download the data set from Kaggle with name ##Sport and outdoor review dataset

In [5]:
df = pd.read_json("./Data/Sports_and_Outdoors_data.json",lines=True)
df.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AIXZKN4ACSKI,1881509818,David Briner,"[0, 0]",This came in on time and I am veru happy with ...,5,Woks very good,1390694400,"01 26, 2014"
1,A1L5P841VIO02V,1881509818,Jason A. Kramer,"[1, 1]",I had a factory Glock tool that I was using fo...,5,Works as well as the factory tool,1328140800,"02 2, 2012"
2,AB2W04NI4OEAD,1881509818,J. Fernald,"[2, 2]",If you don't have a 3/32 punch or would like t...,4,"It's a punch, that's all.",1330387200,"02 28, 2012"
3,A148SVSWKTJKU6,1881509818,"Jusitn A. Watts ""Maverick9614""","[0, 0]",This works no better than any 3/32 punch you w...,4,It's a punch with a Glock logo.,1328400000,"02 5, 2012"
4,AAAWJ6LW9WMOO,1881509818,Material Man,"[0, 0]",I purchased this thinking maybe I need a speci...,4,"Ok,tool does what a regular punch does.",1366675200,"04 23, 2013"


In [7]:
df.shape

(296337, 9)

### Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data.
For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. 
This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [9]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [10]:
review_text[0]

['this',
 'came',
 'in',
 'on',
 'time',
 'and',
 'am',
 'veru',
 'happy',
 'with',
 'it',
 'haved',
 'used',
 'it',
 'already',
 'and',
 'it',
 'makes',
 'taking',
 'out',
 'the',
 'pins',
 'in',
 'my',
 'glock',
 'very',
 'easy']

In [11]:
review_text

0         [this, came, in, on, time, and, am, veru, happ...
1         [had, factory, glock, tool, that, was, using, ...
2         [if, you, don, have, punch, or, would, like, t...
3         [this, works, no, better, than, any, punch, yo...
4         [purchased, this, thinking, maybe, need, speci...
                                ...                        
296332    [this, is, water, bottle, done, right, it, is,...
296333    [if, you, re, looking, for, an, insulated, wat...
296334    [this, hydracentials, sporty, oz, double, insu...
296335    [as, usual, received, this, item, free, in, ex...
296336    [hydracentials, insulated, oz, water, bottle, ...
Name: reviewText, Length: 296337, dtype: object

In [14]:
df.reviewText.loc[1]

"I had a factory Glock tool that I was using for my Glock 26, 27, and 17.  I've since lost it and had needed another.  Since I've used Ghost products prior, and know that they are reliable, I had decided to order this one.  Sure enough, this is just as good as a factory tool."

### Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

#### Initialize the model

In [15]:
model = gensim.models.Word2Vec(
    window = 10,
    min_count = 1,
    workers = 4
)

#### Build Vocabulary 

In [21]:
model.build_vocab(review_text,progress_per=1000)

#### Train the word2Vec model

In [None]:

model.epochs

5

In [19]:
model.corpus_count

0

In [22]:
model.train(review_text,total_examples = model.corpus_count,epochs=model.epochs)

(91638671, 121496535)

### Save the Model

Save the model so that it can be reused in other applications

In [23]:
model.save("./word2vec-Sport_and_OutDoor_Review.model")

### Finding Similar Words and Similarity between words
https://radimrehurek.com/gensim/models/word2vec.html

In [24]:
model.wv.most_similar("bottle")

[('flask', 0.7885138988494873),
 ('bottles', 0.7672917246818542),
 ('waterbottle', 0.7658523917198181),
 ('thermos', 0.7442213892936707),
 ('nalgene', 0.7217821478843689),
 ('mug', 0.7011425495147705),
 ('vessel', 0.6937223672866821),
 ('kanteen', 0.6901846528053284),
 ('bladder', 0.6803690195083618),
 ('reservoir', 0.6790879368782043)]

In [28]:
model.wv.similarity("bottle","mug" )

np.float32(0.7011426)