## Name: Vaibhav Bichave

## Implement the Continuous Bag of Words (CBOW) Model for the given (textual document) using the below steps:
    a. Data preparation
    b. Generate training data
    c. Train model
    d. Output

In [42]:
data ="""The speed of transmission is an important point of difference between the two viruses.
Influenza has a shorter median incubation period (the time from infection to appearance of symptoms)
and a shorter serial interval (the time between successive cases) than COVID-19 virus.
The serial interval for COVID-19 virus is estimated to be 5-6 days, while for influenza virus, the serial interval is 3 days.
This means that influenza can spread faster than COVID-19.

Further, transmission in the first 3-5 days of illness, or potentially pre-symptomatic transmission –
transmission of the virus before the appearance of symptoms – is a major driver of transmission for influenza.
In contrast, while we are learning that there are people who can shed COVID-19 virus 24-48 hours prior to symptom onset,
at present, this does not appear to be a major driver of transmission.

The reproductive number – the number of secondary infections generated from one infected individual –
is understood to be between 2 and 2.5 for COVID-19 virus, higher than for influenza.
However, estimates for both COVID-19 and influenza viruses are very context and time-specific, making direct comparisons more difficult."""

data = data.split()

In [43]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)

word2id = tokenizer.word_index
word2id['PAD'] = 0

id2word = {v:k for k,v in word2id.items()}
wids = tokenizer.texts_to_sequences(data)

emb_size = 100
window_size = 2
vocab_size = len(word2id)

In [44]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical


In [45]:
def cbow_model(corpus,vocab_size, window_size):
    context_length = window_size*2
    for words in corpus:
        sequences_size = len(words)
        for index,word in enumerate(words):
            context_word = []
            label_word = []
            start = index - window_size
            end = index + window_size + 1
            context_word.append([words[i]
                               for i in range(start,end)
                               if 0<=i <sequences_size
                               and i!=index])
            label_word.append(word)
            
            x = pad_sequences(context_word,context_length)
            y = to_categorical(label_word,vocab_size)
            yield(x,y)
            
 

In [46]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Embedding,Lambda
import keras.backend as K

In [47]:
cbow = Sequential([
    Embedding(vocab_size,emb_size,input_length = window_size*2),
    Lambda(lambda x:K.mean(x,axis=1)),
    Dense(vocab_size,activation = 'softmax')
])

cbow.compile(loss='categorical_crossentropy', optimizer='adam')
cbow.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 4, 100)            10200     
                                                                 
 lambda_3 (Lambda)           (None, 100)               0         
                                                                 
 dense_3 (Dense)             (None, 102)               10302     
                                                                 
Total params: 20502 (80.09 KB)
Trainable params: 20502 (80.09 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [48]:
for epochs in range(5):
    loss  = 0
    for x,y in cbow_model(corpus=wids,vocab_size = vocab_size,window_size=window_size):
        loss += cbow.train_on_batch(x,y)
    print("Epochs", epochs,"loss",loss)

Epochs 0 loss 921.4534687995911
Epochs 1 loss 895.6222193241119
Epochs 2 loss 877.0970907211304
Epochs 3 loss 865.9605226516724
Epochs 4 loss 858.2670519351959


In [49]:
import pandas as pd
weights = cbow.get_weights()[0][:]
# pd.DataFrame(weights,index=word2id.keys())

In [50]:
### from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import euclidean_distances

distance_matrix = euclidean_distances(weights)
data = pd.DataFrame(distance_matrix,index=word2id.keys())
data.columns = word2id.keys()

data

Unnamed: 0,the,of,transmission,influenza,covid,19,virus,for,is,to,...,both,very,context,specific,making,direct,comparisons,more,difficult,PAD
the,0.000000,0.836190,0.815897,0.819417,0.855906,1.739186,1.913896,0.798697,0.782381,0.849092,...,0.856259,0.858682,0.809385,0.823802,1.018837,0.836049,0.831780,0.763594,0.801849,0.850343
of,0.836190,0.000000,0.425278,0.417070,0.429650,1.943304,2.048321,0.425651,0.415264,0.438144,...,0.453813,0.458324,0.446367,0.436162,0.753317,0.409675,0.414993,0.391752,0.425988,0.404708
transmission,0.815897,0.425278,0.000000,0.372726,0.432474,1.933517,2.078881,0.380973,0.414016,0.373835,...,0.394796,0.410210,0.403324,0.401245,0.714469,0.378531,0.376383,0.412540,0.386907,0.402417
influenza,0.819417,0.417070,0.372726,0.000000,0.411932,1.961233,2.053875,0.413034,0.417330,0.426481,...,0.405347,0.406183,0.362416,0.398759,0.715411,0.428862,0.355175,0.386643,0.409708,0.393297
covid,0.855906,0.429650,0.432474,0.411932,0.000000,1.966440,2.093399,0.419323,0.416520,0.426667,...,0.397206,0.410064,0.435209,0.412773,0.759146,0.391632,0.391423,0.431699,0.415460,0.390985
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
direct,0.836049,0.409675,0.378531,0.428862,0.391632,1.959030,2.039443,0.394785,0.353234,0.383320,...,0.450703,0.398894,0.404622,0.406233,0.763927,0.000000,0.372617,0.406112,0.375440,0.394559
comparisons,0.831780,0.414993,0.376383,0.355175,0.391423,1.962095,2.043729,0.385919,0.404116,0.417475,...,0.379439,0.406734,0.404755,0.413229,0.696204,0.372617,0.000000,0.376598,0.362357,0.404672
more,0.763594,0.391752,0.412540,0.386643,0.431699,1.876350,2.071642,0.382778,0.390106,0.404527,...,0.412858,0.411502,0.398859,0.446070,0.752353,0.406112,0.376598,0.000000,0.387478,0.396127
difficult,0.801849,0.425988,0.386907,0.409708,0.415460,1.939102,2.017124,0.420686,0.371082,0.347910,...,0.410086,0.395445,0.390825,0.369579,0.732837,0.375440,0.362357,0.387478,0.000000,0.384834


In [51]:
def SearchWord(WordList):
    similar_words ={}
    for search_term in WordList:
        if(search_term in word2id.keys()):
            similar_words[search_term]=[id2word[idx] for idx in 
                                        distance_matrix[word2id[search_term]-1].argsort()[0:5]+1] 
    return similar_words



In [52]:
SearchWord(['19'])

{'19': ['19', 'the', 'successive', 'more', 'estimates']}