# Session 4 Comprehension: WORD2VEC

### Import libraries

To start off, you will import the required libraries for visualising and analysing the text data.


In [1]:
# Import libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Step 1: Loading the data

##### Method 1 - Google Colab
- Run the code block below if you're using Google colab for this demonstration and have uploaded the papers.csv file in your Google Drive
- Else, skip this part and move to the next cell block and run it in case you're using Jupyter 

In [2]:
import pandas as pd
raw_df = pd.read_csv('IMDB_reviews_Word2Vec.csv').sample(6000, random_state=123)
raw_df.head()

Unnamed: 0,review,sentiment
11872,"This movie was beyond awful, it was a pimple o...",negative
40828,As of this writing John Carpenter's 'Halloween...,positive
36400,I must admit a slight disappointment with this...,positive
5166,Oh dear! The BBC is not about to be knocked of...,negative
30273,its a totally average film with a few semi-alr...,negative


### Step 2: Data Preparation

Our objective is to tokenize all the above reviews

In [3]:
# Convert all the review text into a long string and print its length
raw_corpus = u''.join(raw_df['review']+" ")
print("Raw Corpus contains {0:,} characters".format(len(raw_corpus)))

Raw Corpus contains 7,868,711 characters


In [4]:
### Let's check what happened
raw_corpus[0:500]

'This movie was beyond awful, it was a pimple on the a*s of the movie industry. I know that every movie can\'t be a hit or for that matter even average, but the responsible parties that got together for this epic dud, should have been able to see that they had a ticking time bomb on their hands. I can\'t help but think that the cast would get together in between scenes and console each other for being in such a massive heap of dung. I can hear it now, "You getting\' paid?" "Nope, you?" I understand '

As you can see multiple reviews have been appended together. In the next step, we will clean and tokenize the sentences

In [5]:
# import natural language toolkit
import nltk
# download the punkt tokenizer
nltk.download('punkt')
print("The punkt tokenizer is downloaded")

The punkt tokenizer is downloaded


[nltk_data] Downloading package punkt to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
# Load the punkt tokenizer
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
print("The punkt tokenizer is loaded")

# we tokenize the raw string into raw sentences - takes longer time for execution
raw_sentences = tokenizer.tokenize(raw_corpus)
print("We have {0:,} raw sentences".format(len(raw_sentences)))

The punkt tokenizer is loaded
We have 63,395 raw sentences


In [7]:
## Recheck the raw review data once again
## Check the first review
raw_df['review'].head(1)

11872    This movie was beyond awful, it was a pimple o...
Name: review, dtype: object

If you check the first review, there are 5 unique sentences. Let's check the first 5 entries in  `raw_sentences`

In [8]:
raw_sentences[0:5]

['This movie was beyond awful, it was a pimple on the a*s of the movie industry.',
 "I know that every movie can't be a hit or for that matter even average, but the responsible parties that got together for this epic dud, should have been able to see that they had a ticking time bomb on their hands.",
 "I can't help but think that the cast would get together in between scenes and console each other for being in such a massive heap of dung.",
 'I can hear it now, "You getting\' paid?"',
 '"Nope, you?"']

As you can see, the first review has been broken down to it's individual sentences in the first 5 entries of the `raw_sentences` list. Similarly the rest of the reviews have been broken down and stored in the same list

In the next step, we shall remove the punctuations and other special characters from each of the sentences.

In [9]:
import re

# Clean and split sentence into words
def clean_and_split_str(string):
    strip_special_chars = re.compile("[^A-Za-z]+") ## only retrieves alphabets and removes any punctuations and special chars
    string = re.sub(strip_special_chars, " ", string)
    return string.strip().split()

In [10]:
### let's check the above function in action on a sample raw sentence
sample = "We've already used it to look for splinters and a few other uses."

In [11]:
clean_and_split_str(sample)

['We',
 've',
 'already',
 'used',
 'it',
 'to',
 'look',
 'for',
 'splinters',
 'and',
 'a',
 'few',
 'other',
 'uses']

All the words in the sample sentence have been tokenized. The regex code in the function `clean_and_split_str` also breaks down "**We've**" to "**We**" and "**ve**", and there are no periods at the end of the sentence

Now let's go ahead and clean all the sentences in the `raw_sentences` list

In [12]:
# clean each raw sentences and build the list of sentences
sentences = []
for raw_sent in raw_sentences:
    if len(raw_sent) > 0:
        sentences.append(clean_and_split_str(raw_sent))
print("We have {0:,} clean sentences".format(len(sentences)))

We have 63,395 clean sentences


In [13]:
###Check one raw sentence and its equivalent clean sentence
### You can change the index and evaluate for multiple sentences
print(raw_sentences[30],'\n')
print(sentences[30])

I loved Richardson's mannerisms and curt delivery of lines, dismissing the presence and ideas of Raymond Massey's aloof, confident visitor. 

['I', 'loved', 'Richardson', 's', 'mannerisms', 'and', 'curt', 'delivery', 'of', 'lines', 'dismissing', 'the', 'presence', 'and', 'ideas', 'of', 'Raymond', 'Massey', 's', 'aloof', 'confident', 'visitor']


In [14]:
token_count = sum([len(sentence) for sentence in sentences])
print("The dataset corpus contains {0:,} tokens".format(token_count))

The dataset corpus contains 1,430,237 tokens


### Step 2: Model Building

In [15]:
import multiprocessing

#Dimensionality of the resulting word vectors
num_features = 300

#Minimum word count threshold
min_word_count = 3

#Number of threads to run in parallel
num_workers = multiprocessing.cpu_count()

#Context window length
context_size = 7

#Seed for the RNG, to make the result reproducible
seed = 1

In [16]:
import gensim

# create a word2vec model object
word2vec_model = gensim.models.word2vec.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers, 
    vector_size=num_features, 
    min_count=min_word_count, 
    window=context_size)

### Question 1: Which of the following is the closest to the vocabulary length of the word2vec model?

In [26]:
word2vec_model.build_vocab(sentences)
print("The vocabulary is built")
# write your code here to calculate vocabulary length
vocab_length = len(word2vec_model.wv.index_to_key)
print("Vocabulary length:", vocab_length)

The vocabulary is built
Vocabulary length: 21869


In [38]:
print(word2vec_model.wv.index_to_key)



In [39]:
#Start training the model
#Training for epochs = 1
word2vec_model.train(corpus_iterable = sentences, total_examples=word2vec_model.corpus_count, epochs=1)
print("Training finished")

Training finished


### Question 2: For the focal word “living," which of the following falls into the category “top 5 similar words for an epoch value of 1”? 

In [45]:
# find top N similar word to a given word in the vocabulary
# write your code here
given_word = 'living'
top_n = 5  # Number of similar words to retrieve

print(word2vec_model.wv.similar_by_word(given_word, topn=top_n))

# retrive the word embedding for a given word
print('similarity score betwee w1 and w2',word2vec_model.wv.similarity('living','area'))

[('state', 0.7998521327972412), ('Iraq', 0.7942113876342773), ('tribe', 0.7835680842399597), ('dreams', 0.7802200317382812), ('dying', 0.7776700854301453)]
similarity score betwee w1 and w2 0.7372716


### Question 3: What happens to the similarity scores for the top five similar words of “living”, when you increase the number of epochs from one to five in the given example?

In [49]:
### Retrain the model for epochs = 5 and print the top five most similar words
# write your code here
given_word = 'living'
top_n = 5  # Number of similar words to retrieve

for i in range(1, top_n+1):
    word2vec_model.train(corpus_iterable = sentences, total_examples=word2vec_model.corpus_count, epochs=i)
    similar_words = word2vec_model.wv.similar_by_word(given_word, topn=top_n)
    print("Top", top_n, "similar words to", given_word, ":")
    for word, similarity in similar_words:
        print(word, ":", similarity)

Top 5 similar words to living :
tribe : 0.7611867785453796
suburban : 0.760480523109436
Africa : 0.7456793785095215
hostage : 0.7423768043518066
Iraq : 0.741137683391571
Top 5 similar words to living :
suburban : 0.6767222881317139
tribe : 0.669353723526001
interacting : 0.6462210416793823
Iraq : 0.6460049152374268
hostage : 0.6410962343215942
Top 5 similar words to living :
suburban : 0.5767548680305481
Istanbul : 0.5402824282646179
dutiful : 0.5342580676078796
unaware : 0.5339727401733398
tribe : 0.5324456691741943
Top 5 similar words to living :
dutiful : 0.4841460585594177
suburban : 0.47582370042800903
awaken : 0.4631274342536926
unaware : 0.46082624793052673
Istanbul : 0.45851805806159973
Top 5 similar words to living :
dutiful : 0.41444385051727295
awaken : 0.3977203071117401
suburban : 0.39531004428863525
Istanbul : 0.39133113622665405
Massachusetts : 0.38185206055641174
