Name: **Syed Farhan Naqvi**<br>
Div: **BE11-Q11**<br>
Roll no: **43344**<br>
Title: **Assignment 5: Implement the Continuous Bag of Words (CBOW) Model**<br>

In [10]:
#importing libraries
from keras.preprocessing import text
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.utils import pad_sequences
import numpy as np
import pandas as pd

In [15]:
#taking random sentences as data
data = """My fellow Americans: Four years ago, we launched a great national effort to rebuild our country, to renew its spirit, and to restore the allegiance of this government to its citizens.  In short, we embarked on a mission to make America great again— for all Americans.

As I conclude my term as the 45th President of the United States, I stand before you truly proud of what we have achieved together.  We did what we came here to do—and so much more.

This week, we inaugurate a new administration and pray for its success in keeping America safe and prosperous.  We extend our best wishes, and we also want them to have luck—a very important word.

I’d like to begin by thanking just a few of the amazing people who made our remarkable journey possible.

First, let me express my overwhelming gratitude for the love and support of our spectacular First Lady, Melania.  Let me also share my deepest appreciation to my daughter Ivanka, my son-in-law Jared, and to Barron, Don, Eric, Tiffany, and Lara.  You fill my world with light and with joy.

I also want to thank Vice President Mike Pence, his wonderful wife Karen, and the entire Pence family.

Thank you as well to my Chief of Staff, Mark Meadows; the dedicated members of the White House Staff and the Cabinet; and all the incredible people across our administration who poured out their heart and soul to fight for America.

I also want to take a moment to thank a truly exceptional group of people: the United States Secret Service.  My family and I will forever be in your debt.  My profound gratitude as well to everyone in the White House Military Office, the teams of Marine One and Air Force One, every member of the Armed Forces, and state and local law enforcement all across our country."""


In [16]:
!pip install nltk
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
stop = set(stopwords.words("english"))
filtered_words = [word.lower() for word in data.split() if word.lower() not in stop]
dl_data = data.split()



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\FARHAN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
#tokenization
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(dl_data)
word2id = tokenizer.word_index

word2id['PAD'] = 0
id2word = {v:k for k, v in word2id.items()}
wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in dl_data]

vocab_size = len(word2id)
embed_size = 100
window_size = 2 

print('Vocabulary Size:', vocab_size)
print('Vocabulary Sample:', list(word2id.items())[:20])

Vocabulary Size: 179
Vocabulary Sample: [('to', 1), ('and', 2), ('the', 3), ('my', 4), ('of', 5), ('we', 6), ('a', 7), ('our', 8), ('in', 9), ('i', 10), ('for', 11), ('as', 12), ('also', 13), ('its', 14), ('america', 15), ('all', 16), ('you', 17), ('want', 18), ('people', 19), ('thank', 20)]


In [18]:
#generating (context word, target/label word) pairs
def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size*2
    for words in corpus:
        sentence_length = len(words)
        for index, word in enumerate(words):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            
            context_words.append([words[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < sentence_length 
                                 and i != index])
            label_word.append(word)

            x = pad_sequences(context_words, maxlen=context_length)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)
            


In [19]:
#model building
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

print(cbow.summary())

# from IPython.display import SVG
# from keras.utils.vis_utils import model_to_dot

# SVG(model_to_dot(cbow, show_shapes=True, show_layer_names=False, rankdir='TB').create(prog='dot', format='svg'))

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 4, 100)            17900     
                                                                 
 lambda_1 (Lambda)           (None, 100)               0         
                                                                 
 dense_1 (Dense)             (None, 179)               18079     
                                                                 
Total params: 35,979
Trainable params: 35,979
Non-trainable params: 0
_________________________________________________________________
None


In [20]:
for epoch in range(1, 50):
    loss = 0.
    i = 0
    for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))

    print('Epoch:', epoch, '\tLoss:', loss)
    print()

Epoch: 1 	Loss: 1607.3180232048035

Epoch: 2 	Loss: 1543.9130997657776

Epoch: 3 	Loss: 1529.476705789566

Epoch: 4 	Loss: 1523.1182062625885

Epoch: 5 	Loss: 1519.2799623012543

Epoch: 6 	Loss: 1516.6812961101532

Epoch: 7 	Loss: 1514.2458634376526

Epoch: 8 	Loss: 1511.9314937591553

Epoch: 9 	Loss: 1509.589139699936

Epoch: 10 	Loss: 1507.1182687282562

Epoch: 11 	Loss: 1504.2775642871857

Epoch: 12 	Loss: 1501.0132565498352

Epoch: 13 	Loss: 1497.1641418933868

Epoch: 14 	Loss: 1492.619068145752

Epoch: 15 	Loss: 1487.4806954860687

Epoch: 16 	Loss: 1481.792938709259

Epoch: 17 	Loss: 1475.740296125412

Epoch: 18 	Loss: 1469.582598209381

Epoch: 19 	Loss: 1463.2615571022034

Epoch: 20 	Loss: 1456.9802780151367

Epoch: 21 	Loss: 1450.9542870521545

Epoch: 22 	Loss: 1445.262200832367

Epoch: 23 	Loss: 1439.842805504799

Epoch: 24 	Loss: 1434.4847514629364

Epoch: 25 	Loss: 1429.7269642353058

Epoch: 26 	Loss: 1425.7536908388138

Epoch: 27 	Loss: 1422.5108360052109

Epoch: 28 	Loss: 1

In [21]:
weights = cbow.get_weights()[0]
weights = weights[1:]
print(weights.shape)

pd.DataFrame(weights, index=list(id2word.values())[1:]).head()

(178, 100)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
and,-0.011648,-0.04793,0.008394,-0.009676,-0.009229,-0.014056,0.022342,-0.009655,0.003042,0.041171,...,-0.014002,0.019548,0.033761,-0.000545,0.033759,0.023135,0.024722,0.023964,-0.005562,-0.046044
the,0.01724,-0.01641,0.013487,0.047093,6.7e-05,0.040729,0.027776,-0.022116,-0.044522,-0.038054,...,0.023983,0.041698,0.021707,0.028402,0.011837,-0.011091,-0.024838,0.002965,0.046619,0.025107
my,-0.020606,0.03706,0.042983,-0.043511,0.022386,-0.045031,-0.040638,-0.04791,0.048303,-0.025121,...,0.012744,0.017239,0.016109,0.009494,0.041389,0.028991,-0.037136,0.037012,0.037409,0.012226
of,0.03378,0.042973,-0.038912,-0.034784,0.0345,-0.03887,-0.04517,-0.012586,0.031027,-0.042777,...,0.039541,-0.013469,0.049048,-0.023739,0.018168,-0.016173,0.000497,0.035067,0.033173,0.00774
we,-0.019298,-0.000263,-0.040538,-0.032503,0.035841,0.025789,-0.027111,-0.03573,0.034403,-0.001675,...,-0.032,0.049171,-0.029834,-0.008373,0.03781,-0.002068,-0.037655,-0.021353,0.046557,-0.042988


In [22]:
from sklearn.metrics.pairwise import euclidean_distances

distance_matrix = euclidean_distances(weights)
print(distance_matrix.shape)

similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['america']}

similar_words

(178, 178)


{'america': ['jared', 'joy', 'remarkable', 'amazing', 'before']}