In [1]:
# !pip install gensim
# !pip install python-Levenshtein

In [2]:
import gensim
import pandas as pd

Reading and Exploring the Dataset
The dataset using here is a subset of IMDB Reviews from the Kaggle Data Set.

In [5]:
df = pd.read_excel("IMDB.xlsx")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df.shape

(249, 2)

**Simple Preprocessing & Tokenization


For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [8]:
df.columns

Index(['review', 'sentiment'], dtype='object')

In [9]:
review = df.review.apply(gensim.utils.simple_preprocess)
review

0      [one, of, the, other, reviewers, has, mentione...
1      [wonderful, little, production, br, br, the, f...
2      [thought, this, was, wonderful, way, to, spend...
3      [basically, there, family, where, little, boy,...
4      [petter, mattei, love, in, the, time, of, mone...
                             ...                        
244    [based, on, edgar, rice, burroughs, novel, at,...
245    [have, seen, this, film, at, least, times, and...
246    [this, show, makes, absolutely, no, sense, eve...
247    [this, movie, was, so, bad, it, was, funny, fo...
248    [meryl, streep, is, such, genius, well, at, le...
Name: review, Length: 249, dtype: object

In [11]:
review.loc[0]

['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 'oz',
 'episode',
 'you',
 'll',
 'be',
 'hooked',
 'they',
 'are',
 'right',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me',
 'br',
 'br',
 'the',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'go',
 'trust',
 'me',
 'this',
 'is',
 'not',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid',
 'this',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs',
 'sex',
 'or',
 'violence',
 'its',
 'is',
 'hardcore',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word',
 'br',
 'br',
 'it',
 'is',
 'called',
 'oz',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'it',
 'focuses'

In [14]:
review.loc[244]

['based',
 'on',
 'edgar',
 'rice',
 'burroughs',
 'novel',
 'at',
 'the',
 'earth',
 'core',
 'provides',
 'little',
 'more',
 'than',
 'means',
 'to',
 'escape',
 'and',
 'give',
 'your',
 'brain',
 'rest',
 'victorian',
 'scientist',
 'dr',
 'abner',
 'perry',
 'peter',
 'cushing',
 'invents',
 'giant',
 'burrowing',
 'machine',
 'which',
 'he',
 'and',
 'his',
 'american',
 'partner',
 'doug',
 'mcclure',
 'use',
 'to',
 'corkscrew',
 'their',
 'way',
 'deep',
 'into',
 'the',
 'earth',
 'to',
 'explore',
 'what',
 'mysteries',
 'it',
 'may',
 'hold',
 'they',
 'soon',
 'discover',
 'lost',
 'world',
 'of',
 'subhuman',
 'creatures',
 'having',
 'conflict',
 'with',
 'prehistoric',
 'monsters',
 'br',
 'br',
 'cushing',
 'comes',
 'across',
 'as',
 'an',
 'absent',
 'minded',
 'professor',
 'to',
 'the',
 'point',
 'of',
 'being',
 'annoying',
 'instead',
 'of',
 'being',
 'bold',
 'adventurer',
 'he',
 'comes',
 'across',
 'effeminate',
 'on',
 'the',
 'other',
 'hand',
 'mcclure'

In [15]:
df.review.loc[244]

"Based on a Edgar Rice Burroughs novel, AT THE EARTH'S CORE provides little more than means to escape and give your brain a rest. A Victorian scientist Dr. Abner Perry(Peter Cushing)invents a giant burrowing machine, which he and his American partner(Doug McClure)use to corkscrew their way deep into the earth to explore what mysteries it may hold. They soon discover a lost world of subhuman creatures having conflict with prehistoric monsters.<br /><br />Cushing comes across as an absent minded professor to the point of being annoying. Instead of being a bold adventurer, he comes across effeminate. On the other hand McClure overacted enough to make himself also laughable. Caroline Munro plays the pretty Princess Dia that refuses to leave her world near the center of the earth. Also in the cast are: Godfrey James, Cy Grant and Michael Crane."

Training the Word2Vec Model
Train the model for reviews. Use a window of size 6(ur wish) i.e. 6 words before the present word and 6 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

***INITIALIZE THE MODEL****

In [16]:
model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=4,
)

**BUILD VOCABULARY**

In [17]:
model.build_vocab(review, progress_per=200)

**TRAIN THE WORD2VEC MODEL**

In [18]:
model.epochs=5

In [19]:
total_examples=model.corpus_count
total_examples

249

In [20]:
model.train(review, total_examples=model.corpus_count, epochs=model.epochs)


(183833, 272600)

***Finding Similar Words and Similarity between words****

In [23]:
model.wv.most_similar("hooked")

[('points', 0.9858283996582031),
 ('making', 0.985491931438446),
 ('memorable', 0.985313892364502),
 ('above', 0.9853003621101379),
 ('far', 0.9852741360664368),
 ('idea', 0.9852015972137451),
 ('think', 0.9851778745651245),
 ('down', 0.9851510524749756),
 ('comes', 0.985120952129364),
 ('ups', 0.9850931167602539)]

In [25]:
model.wv.similarity(w1="good", w2="glad")


0.994768

In [26]:
model.wv.similarity(w1="memorable", w2="notable")

0.95729357

In [27]:
model.wv.similarity(w1="idea", w2="thought")

0.9986893