<center><h1>Assignment 5</h1></center>

## Problem Statement
Implement the Continuous Bag of Words (CBOW) Model. Stages can be:
1. Data preparation
2. Generate training data
3. Train model
4. Output

## Notebook Details
1. Author : Varad Girish Mashalkar
2. Branch : Information Technology
3. Division : BE 11
4. Batch : Q11
5. Roll Number : 43335
6. Course : Laboratory Practice 4 (Deep Learning)

## Implementation Details
1. Python version : 3.7.0
2. Tensorflow version : 2.7.0 (Compatible with CUDA11.5 and cuDNN8.6.0)

## Imports
1. numpy
2. tensorflow
3. matplotlib
4. gensim

## Dataset link
https://github.com/bhoomikamadhukar/NLP/blob/master/corona.txt

# 1. Importing required libraries

In [1]:
import numpy as np
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
import gensim

# 2. Loading the dataset

In [2]:
# Loading the dataset from text file
raw_data = open("./corona.txt", "r")

In [3]:
corona_data = [text for text in raw_data if text.count(' ') >= 2]

# 3. Vectorizing the data using Tokenizer

In [4]:
vectorize = Tokenizer()
vectorize.fit_on_texts(corona_data)
corona_data = vectorize.texts_to_sequences(corona_data)
total_vocab = sum(len(s) for s in corona_data)
# total_vocab = 102
word_count = len(vectorize.word_index) + 1

In [5]:
print(total_vocab)
print(word_count)

198
103


# 4. Setting the window size

In [6]:
window_size = 2

# 5. Defining the CBOW model architecture

In [7]:
# Defining utility to generate context word pairs
def cbow_model(data, window_size, total_vocab):
    total_length = window_size*2
    for text in data:
        text_len = len(text)
        for idx, word in enumerate(text):
            context_word = []
            target   = []            
            begin = idx - window_size
            end = idx + window_size + 1
            context_word.append([
                text[i] 
                for i in range(begin, end) 
                if 0 <= i < text_len 
                and i != idx
            ])
            target.append(word)
            contextual = sequence.pad_sequences(
                context_word, 
                maxlen=total_length
            )
            final_target = np_utils.to_categorical(
                target, 
                total_vocab
            )
            yield(contextual, final_target)            

In [8]:
# Defining the model architecture
model = Sequential()
model.add(
    Embedding(
        input_dim=total_vocab, 
        output_dim=100, 
        input_length=window_size*2
    )
)
model.add(
    Lambda(
        lambda x: K.mean(x, axis=1), 
        output_shape=(100,)
    )
)
model.add(
    Dense(
        total_vocab, 
        activation='softmax'
    )
)

2022-10-30 14:06:11.723185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-30 14:06:11.750024: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-30 14:06:11.750221: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-30 14:06:11.750551: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [9]:
# Checking model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 100)            19800     
                                                                 
 lambda (Lambda)             (None, 100)               0         
                                                                 
 dense (Dense)               (None, 198)               19998     
                                                                 
Total params: 39,798
Trainable params: 39,798
Non-trainable params: 0
_________________________________________________________________


# 6. Compiling model

In [10]:
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam'
)

# 7. Training the CBOW model

In [11]:
for i in range(10):
    cost = 0
    for x, y in cbow_model(corona_data, window_size, total_vocab):
        cost += model.train_on_batch(x, y)
    print("Epoch ", i,"\t: ", cost)

Epoch  0 	:  1041.8004126548767
Epoch  1 	:  992.3414831161499
Epoch  2 	:  903.7286412715912
Epoch  3 	:  826.9192087650299
Epoch  4 	:  774.075676202774
Epoch  5 	:  723.7201819419861
Epoch  6 	:  670.4681974649429
Epoch  7 	:  615.2118247747421
Epoch  8 	:  559.9114753007889
Epoch  9 	:  506.38645058870316


# 8. Testing the model

### Writing vector to file

In [20]:
dimensions = 100
vect_file = open('./vectors.txt' ,'w')
vect_file.write('{} {}\n'.format(102, dimensions))

8

In [21]:
weights = model.get_weights()[0]
for text, i in vectorize.word_index.items():
    final_vec = ' '.join(map(str, list(weights[i, :])))
    vect_file.write('{} {}\n'.format(text, final_vec))
vect_file.close()

In [22]:
cbow_output = gensim.models.KeyedVectors.load_word2vec_format(
    'vectors.txt', 
    binary=False
)

In [23]:
cbow_output.most_similar(positive=['virus'])

[('interval', 0.7260262370109558),
 ('both', 0.7044325470924377),
 ('19', 0.6949691772460938),
 ('than', 0.6716287732124329),
 ('higher', 0.6118828058242798),
 ('before', 0.5684723258018494),
 ('shed', 0.5653639435768127),
 ('covid', 0.5485020279884338),
 ('for', 0.5267449021339417),
 ('speed', 0.4975537359714508)]

<center><h1>End of Notebook</h1></center>