<a href="https://colab.research.google.com/github/anshupandey/Natural_language_Processing/blob/master/word2vec_using_gensim_%26_sentiment_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import gensim
import re

# Dataset Preparation

In [2]:
## Downloading datasets
# the dataset is about movie reviews from IMDB
!wget -q https://www.dropbox.com/s/0ygoimffauvl7x5/unlabeledTrainData.tsv
!wget -q https://www.dropbox.com/s/4f1s02mh6bfjcr5/labeledTrainData.tsv

In [3]:
#load data
df = pd.read_csv("unlabeledTrainData.tsv",delimiter="\t",quoting=3)
df.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


In [4]:
df.shape

(50000, 2)

In [9]:
df['review'][2]

'"Minor Spoilers<br /><br />In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible performance of the beautiful El

In [7]:
df.shape

(50000, 2)

In [8]:
def clean_data(doc):
  doc = re.sub("<br"," ",doc)
  doc = re.sub("[^A-Za-z]"," ",doc)
  doc = " ".join([w.strip() for w in doc.strip().lower().split()])
  return doc
clean_data(df['review'][2])

'minor spoilers in new york joan barnard elvire audrey is informed that her husband the archeologist arthur barnard john saxon was mysteriously murdered in italy while searching an etruscan tomb joan decides to travel to italy in the company of her colleague who offers his support once in italy she starts having visions relative to an ancient people and maggots many maggots after shootings and weird events joan realizes that her father is an international drug dealer there are drugs hidden in the tomb and her colleague is a detective of the narcotic department the story ends back in new york when joan and her colleague decide to get married with each other in a very romantic end yesterday i had the displeasure of wasting my time watching this crap the story is so absurd mixing thriller crime supernatural and horror and even a romantic end in a non sense way the acting is the worst possible highlighting the horrible performance of the beautiful elvire audrey john saxon just gives his na

In [10]:
df['review'] = df['review'].apply(clean_data)

#  Building the word2vec model

In [11]:
doc_words = []
for doc in df['review']:
  doc_words.append(doc.split(' '))

In [12]:
print(doc_words[0])

['watching', 'time', 'chasers', 'it', 'obvious', 'that', 'it', 'was', 'made', 'by', 'a', 'bunch', 'of', 'friends', 'maybe', 'they', 'were', 'sitting', 'around', 'one', 'day', 'in', 'film', 'school', 'and', 'said', 'hey', 'let', 's', 'pool', 'our', 'money', 'together', 'and', 'make', 'a', 'really', 'bad', 'movie', 'or', 'something', 'like', 'that', 'what', 'ever', 'they', 'said', 'they', 'still', 'ended', 'up', 'making', 'a', 'really', 'bad', 'movie', 'dull', 'story', 'bad', 'script', 'lame', 'acting', 'poor', 'cinematography', 'bottom', 'of', 'the', 'barrel', 'stock', 'music', 'etc', 'all', 'corners', 'were', 'cut', 'except', 'the', 'one', 'that', 'would', 'have', 'prevented', 'this', 'film', 's', 'release', 'life', 's', 'like', 'that']


In [13]:
len(doc_words)

50000

In [14]:
model = gensim.models.Word2Vec(doc_words,vector_size=60,window=5,min_count=10,workers=8)

In [15]:
model.wv['time'].shape

(60,)

In [16]:
model.wv['amazing']

array([-3.11643147e+00,  4.31356877e-01, -3.36984187e-01,  1.51607931e+00,
       -7.71168053e-01, -3.79898101e-01, -2.10766215e-02,  1.32474685e+00,
        1.13126978e-01, -4.32162024e-02,  2.22071081e-01,  4.83371288e-01,
       -1.37416437e-01,  2.83054113e+00, -1.54120696e+00, -4.25378352e-01,
       -9.45339084e-01, -1.61651778e+00,  9.13741827e-01, -6.43577278e-01,
       -1.18862879e+00,  2.69646120e+00,  3.77729273e+00, -1.26228523e+00,
        3.73741776e-01,  1.09905235e-01,  3.59305322e-01, -5.26112854e-01,
       -2.16964984e+00, -6.13644719e-01,  1.62749350e+00,  1.72602952e+00,
       -8.36749852e-01,  7.40496218e-01, -2.59726029e-03, -1.10743392e+00,
        2.02151299e+00,  1.33105263e-01, -4.70758557e-01, -1.90964475e-01,
       -3.55837250e+00, -2.43913084e-02,  2.46300578e+00, -5.62194347e-01,
        3.42550850e+00, -2.73630828e-01,  1.33987427e+00, -2.04852390e+00,
       -2.76954484e+00,  9.41114604e-01,  1.25954771e+00, -1.21836472e+00,
       -6.53430998e-01, -

In [17]:
model.wv['anshu']

KeyError: ignored

In [18]:
model.wv.vectors.shape

(28322, 60)

In [19]:
model.wv.most_similar("amazing")

[('incredible', 0.8964046239852905),
 ('awesome', 0.8524603247642517),
 ('outstanding', 0.7961137890815735),
 ('fantastic', 0.7695695161819458),
 ('exceptional', 0.7659703493118286),
 ('excellent', 0.7607165575027466),
 ('brilliant', 0.7235077619552612),
 ('wonderful', 0.7214052677154541),
 ('astonishing', 0.7192554473876953),
 ('stunning', 0.7088392376899719)]

In [20]:
model.wv.most_similar("actress")

[('actor', 0.7749857902526855),
 ('performer', 0.7162069082260132),
 ('performance', 0.7007765769958496),
 ('role', 0.6995749473571777),
 ('singer', 0.605505645275116),
 ('garbo', 0.5965226292610168),
 ('dancer', 0.5913607478141785),
 ('accolade', 0.5849970579147339),
 ('actresses', 0.5810202956199646),
 ('woman', 0.573936402797699)]

In [21]:
model.save("imdb-vector.vec")

# Sentiment Analysis

In [22]:
df = pd.read_csv("labeledTrainData.tsv",delimiter="\t",quoting=3)
df.shape

(25000, 3)

In [23]:
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [24]:
df['review'] = df['review'].apply(clean_data)
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,with all this stuff going down at the moment w...
1,"""2381_9""",1,the classic war of the worlds by timothy hines...
2,"""7759_3""",0,the film starts with a manager nicholas bell g...
3,"""3630_4""",0,it must be assumed that those who praised this...
4,"""9495_8""",1,superbly trashy and wondrously unpretentious s...


In [25]:
df['review'][2]

'the film starts with a manager nicholas bell giving welcome investors robert carradine to primal park a secret project mutating a primal animal using fossilized dna like jurassik park and some scientists resurrect one of nature s most fearsome predators the sabretooth tiger or smilodon scientific ambition turns deadly however and when the high voltage fence is opened the creature escape and begins savagely stalking its prey the human visitors tourists and scientific meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre historical animals which are deadlier and bigger in addition a security agent stacy haiduk and her mate brian wimmer fight hardly against the carnivorous smilodons the sabretooths themselves of course are the real star stars and they are astounding terrifyingly though not convincing the giant animals savagely are stalking its prey and the group run afoul and fight against one nature s most fearsome predator

## Split the dataset into train and test

In [26]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(df['review'],df['sentiment'],test_size=0.2,random_state=5)

#### text preprocessing with tensorflow

In [27]:
# tokenizing each review
from tensorflow.keras.preprocessing import text
#creating the tokenization object
tok = text.Tokenizer(num_words=15000)
tok.fit_on_texts(xtrain.tolist())


In [28]:
# get the notations of tokenized words for train and test documents
xtrain = tok.texts_to_sequences(xtrain.tolist())
xtest = tok.texts_to_sequences(xtest.tolist())


In [29]:
for doc in xtrain[:5]:
  print(len(doc))

201
90
116
418
205


In [30]:
np.mean([len(doc) for doc in xtrain])

229.2208

In [31]:
doc_length=300
from tensorflow.keras.preprocessing import sequence
# padding of each doc to resize all docs to same size
xtrain = sequence.pad_sequences(xtrain,maxlen=doc_length,padding='post')
xtest = sequence.pad_sequences(xtest,maxlen=doc_length,padding='post')

## Modelling part

In [32]:
word2vec = gensim.models.Word2Vec.load("/content/imdb-vector.vec")

# embedding length for each word
vector_length=word2vec.wv.vector_size
vector_length

60

In [None]:
tok.word_index

In [34]:
# crating a weight matrix for the words in our current local dictionary, weights are captured from
# the pretrained word2vec

weight_matrix = np.zeros((15001,60))

for word,i in sorted(tok.word_index.items(),key=lambda x:x[1]):
  print(word,i)
  if i > 15000:
    break
  if word in word2vec.wv.key_to_index:
    weight_matrix[i] = word2vec.wv[word]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
marketed 10002
civilized 10003
savior 10004
gamut 10005
nineties 10006
contend 10007
hindsight 10008
tighter 10009
shortened 10010
slack 10011
upsetting 10012
loudly 10013
achieving 10014
bronx 10015
disconnected 10016
hilary 10017
opener 10018
yacht 10019
bathing 10020
shoestring 10021
summarize 10022
recovered 10023
pouring 10024
evocative 10025
unwittingly 10026
ana 10027
undertones 10028
annoys 10029
stereo 10030
plotline 10031
compromised 10032
slater 10033
purposely 10034
intrusive 10035
ripper 10036
compound 10037
schizophrenic 10038
shaolin 10039
lizard 10040
transform 10041
draft 10042
lili 10043
ethics 10044
distributor 10045
sordid 10046
robber 10047
waterfront 10048
nanny 10049
punchline 10050
shudder 10051
helicopters 10052
cassandra 10053
witted 10054
chalk 10055
zoo 10056
perlman 10057
censored 10058
ra 10059
woven 10060
enlightened 10061
argued 10062
supplies 10063
organic 10064
stealth 10065
compassionate

In [35]:
from tensorflow.keras import models,layers

In [36]:
# modelling
model = models.Sequential()
model.add(layers.Embedding(15001,60,input_length=300,weights=[weight_matrix],
                           trainable=False))

model.add(layers.Flatten())
model.add(layers.Dense(500,activation='relu'))
model.add(layers.Dense(250,activation='relu'))
model.add(layers.Dense(50,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [37]:
model.fit(xtrain,ytrain,epochs=10,batch_size=100,validation_data=(xtest,ytest))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7f4445a11a20>

In [46]:
model.predict(xtest[5].reshape(1,300))



array([[0.00839654]], dtype=float32)

In [None]:
xtest[0]

In [None]:
tok.index_word

In [48]:
test0 = [tok.index_word[k] for k in xtest[5] if k!=0]
print(" ".join(test0))

let s face it nancy drew was never great literature it is in the same category as babysitter club magic tree house abc mysteries in fact it was one of the original formula stories nancy is perfect pretty thoughtful nice has no internal conflicts ever and never changes ned is pretty much the same the movie was true to that style and i have to say i liked it it will never be a great movie but it had a that same nostalgic flavor that the books held it had just the right amount of suspense for my children and there was almost no offensive language i liked the push for more conservative dress corky was a bit of an annoyance he was a little out of place on a high school campus i never quite got why he was there in the first place
