**Problem statement:** 

Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the probability of a word given a context. A context may be a single word or a group of words. But for simplicity, I will take a single context word and try to predict a single target word. 

The purpose of this assignment is to be able to create a word embedding for the  given data set.  

**Data set :** w2v.txt 

In [1]:
#We will solve this using a shallow neural network

In [3]:
import numpy as np
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import gensim

In [5]:
data=open('corona.txt','r')

In [6]:
corona_data = [text for text in data if text.count(' ') >= 2]

In [7]:
corona_data

['The speed of transmission is an important point of difference between the two viruses. Influenza has a shorter median incubation period (the time from infection to appearance of symptoms) and a shorter serial interval (the time between successive cases) than COVID-19 virus. The serial interval for COVID-19 virus is estimated to be 5-6 days, while for influenza virus, the serial interval is 3 days. This means that influenza can spread faster than COVID-19. \n',
 'Further, transmission in the first 3-5 days of illness, or potentially pre-symptomatic transmission –transmission of the virus before the appearance of symptoms – is a major driver of transmission for influenza. In contrast, while we are learning that there are people who can shed COVID-19 virus 24-48 hours prior to symptom onset, at present, this does not appear to be a major driver of transmission. \n',
 'The reproductive number – the number of secondary infections generated from one infected individual – is understood to b

In [8]:
vectorize = Tokenizer()

In [9]:
vectorize.fit_on_texts(corona_data)

In [11]:
corona_data = vectorize.texts_to_sequences(corona_data)

In [13]:
total_vocab = sum(len(s) for s in corona_data)

In [15]:
corona_data

[[1,
  38,
  2,
  8,
  9,
  39,
  40,
  41,
  2,
  42,
  13,
  1,
  43,
  23,
  3,
  44,
  11,
  24,
  45,
  46,
  47,
  1,
  14,
  25,
  48,
  10,
  26,
  2,
  27,
  12,
  11,
  24,
  15,
  16,
  1,
  14,
  13,
  49,
  50,
  17,
  4,
  5,
  6,
  1,
  15,
  16,
  7,
  4,
  5,
  6,
  9,
  51,
  10,
  18,
  19,
  52,
  20,
  28,
  7,
  3,
  6,
  1,
  15,
  16,
  9,
  29,
  20,
  30,
  53,
  31,
  3,
  32,
  54,
  55,
  17,
  4,
  5],
 [56,
  8,
  33,
  1,
  57,
  29,
  19,
  20,
  2,
  58,
  59,
  60,
  61,
  62,
  8,
  63,
  2,
  1,
  6,
  64,
  1,
  26,
  2,
  27,
  21,
  9,
  11,
  34,
  35,
  2,
  8,
  7,
  3,
  33,
  65,
  28,
  66,
  22,
  67,
  31,
  68,
  22,
  69,
  70,
  32,
  71,
  4,
  5,
  6,
  72,
  73,
  74,
  75,
  10,
  76,
  77,
  78,
  79,
  30,
  80,
  81,
  82,
  10,
  18,
  11,
  34,
  35,
  2,
  8],
 [1,
  83,
  36,
  21,
  1,
  36,
  2,
  84,
  85,
  86,
  25,
  87,
  88,
  89,
  21,
  9,
  90,
  10,
  18,
  13,
  37,
  12,
  37,
  19,
  7,
  4,
  5,
  6,
  91,
  

In [16]:
word_count = len(vectorize.word_index) + 1

In [18]:
vectorize.word_index

{'the': 1,
 'of': 2,
 'influenza': 3,
 'covid': 4,
 '19': 5,
 'virus': 6,
 'for': 7,
 'transmission': 8,
 'is': 9,
 'to': 10,
 'a': 11,
 'and': 12,
 'between': 13,
 'time': 14,
 'serial': 15,
 'interval': 16,
 'than': 17,
 'be': 18,
 '5': 19,
 'days': 20,
 '–': 21,
 'are': 22,
 'viruses': 23,
 'shorter': 24,
 'from': 25,
 'appearance': 26,
 'symptoms': 27,
 'while': 28,
 '3': 29,
 'this': 30,
 'that': 31,
 'can': 32,
 'in': 33,
 'major': 34,
 'driver': 35,
 'number': 36,
 '2': 37,
 'speed': 38,
 'an': 39,
 'important': 40,
 'point': 41,
 'difference': 42,
 'two': 43,
 'has': 44,
 'median': 45,
 'incubation': 46,
 'period': 47,
 'infection': 48,
 'successive': 49,
 'cases': 50,
 'estimated': 51,
 '6': 52,
 'means': 53,
 'spread': 54,
 'faster': 55,
 'further': 56,
 'first': 57,
 'illness': 58,
 'or': 59,
 'potentially': 60,
 'pre': 61,
 'symptomatic': 62,
 '–transmission': 63,
 'before': 64,
 'contrast': 65,
 'we': 66,
 'learning': 67,
 'there': 68,
 'people': 69,
 'who': 70,
 'shed': 7

In [19]:
window_size = 2

In [20]:
def cbow_model(data, window_size, total_vocab):
    total_length = window_size*2
    for text in data:
        text_len = len(text)
        for idx, word in enumerate(text):
            context_word = []
            target   = []            
            begin = idx - window_size
            end = idx + window_size + 1
            context_word.append([text[i] for i in range(begin, end) if 0 <= i < text_len and i != idx])
            target.append(word)
            contextual = sequence.pad_sequences(context_word, total_length=total_length)
            final_target = np_utils.to_categorical(target, total_vocab)
            yield(contextual, final_target) 

In [21]:
model = Sequential()
model.add(Embedding(input_dim=total_vocab, output_dim=100, input_length=window_size*2))
model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(100,)))
model.add(Dense(total_vocab, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
for i in range(10):
    cost = 0
    for x, y in cbow_model(data, window_size, total_vocab):
        cost += model.train_on_batch(contextual, final_target)
    print(i, cost)

0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0


2022-03-30 19:13:21.205570: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [24]:
dimensions=100
vect_file = open('vectors.txt' ,'w')
vect_file.write('{} {}\n'.format(total_vocab,dimensions))
#Next, we will access the weights of the trained model and write it to the above created file. 

weights = model.get_weights()[0]
for text, i in vectorize.word_index.items():
    final_vec = ' '.join(map(str, list(weights[i, :])))
    vect_file.write('{} {}\n'.format(text, final_vec))
vect_file.close()

In [26]:
cbow_output = gensim.models.KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
#cbow_output.most_similar(positive=['virus'])

EOFError: unexpected end of input; is count incorrect or file otherwise damaged?

In [19]:
def read_txt(txt_file):
    file_lines = []
    file_words = []
    with open(txt_file, 'r', encoding="utf-8") as file:
        try:
            for line in file:
                try:
                    line = line.strip()
                    if line!="":
                        file_lines.append(line)
                        allWords = line.split(" ")
                        try:
                            for word in allWords:
                                file_words.append(word.strip())    
                        except Exception as E:
                            print ("got An exception", E)
                            pass  
                except Exception as E:
                    print ("got An exception", E)
                    pass         
        except Exception as E:
            print ("got An exception", E)
            pass  
    return {"lines":file_lines, "words":file_words}

In [22]:
file_details = read_txt("w2v.txt")

In [28]:
vocabulary = file_details['words']

In [30]:
len(vocabulary)

533

In [36]:
# Let us confirm if all the words are captured in the vocabulary

In [34]:
num_lines = np.array([len(line.split(" ")) for line in file_details['lines']])

In [35]:
num_lines.sum()

533

In [169]:
# Create a function for context-target with +1,+2,+3,+4 etc, where +1,+2,+3 ... etc are arguments dist.
def context_target(vocabulary, dist):
    Vocab_size = len(vocabulary)
    encoded_words = [one_hot(Vocab_word,Vocab_size)[0] for Vocab_word in vocabulary]
#    print(f'encoded words: {encoded_words}', len(encoded_words))
    contextTarget = {'context':[], 'target':[]}
    index = 0
    for wd in encoded_words:
        d = 0
        while d < dist:
            contextTarget['context'].append(encoded_words[index])
            if index+(d+1)<len(encoded_words):
                contextTarget['target'].append(encoded_words[index+(d+1)])
            else:
                contextTarget['target'].append(-1)
            d = d +1
        index = index+1
    return contextTarget

In [170]:
data = context_target(vocabulary, 4)

In [171]:
len(data['context'])

2132

In [172]:
 df = pd.DataFrame(data=data)

In [173]:
df.head()

Unnamed: 0,context,target
0,472,197
1,472,99
2,472,517
3,472,85
4,197,99


In [174]:
from sklearn.model_selection import train_test_split

In [175]:
X_train, X_test, y_train, y_test = train_test_split(df.context, df.target, test_size=0.33, random_state=42)

In [176]:
X_train.astype

<bound method NDFrame.astype of 1984    138
1302    517
780     140
481     472
1970     89
       ... 
1638    481
1095    239
1130    155
1294    261
860     103
Name: context, Length: 1428, dtype: int64>

In [None]:
INPUT_SHAPE = 

In [203]:
model = Sequential()
embedding_layer = Embedding(input_dim=len(vocabulary),output_dim=8,input_length=INPUT_SHAPE)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(len(vocabulary),activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
print(model.summary())

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 1, 8)              4264      
_________________________________________________________________
flatten_6 (Flatten)          (None, 8)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 533)               4797      
Total params: 9,061
Trainable params: 9,061
Non-trainable params: 0
_________________________________________________________________
None


In [197]:
## Initialising the CNN
classifier = Sequential()

In [None]:
classifier.add(Conv1D(filters=64, kernel_size=1, input_shape=INPUT_SHAPE, activation='relu'))
## MaxPooling
classifier.add(MaxPooling2D(pool_size = (2,2)))
classifier.add(Dropout(0.5))

## Add another layer
classifier.add(Conv2D(64,(3,3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2,2)))
classifier.add(Dropout(0.5))

## Add another layer
classifier.add(Conv2D(64,(3,3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2,2)))

In [204]:
callback_list = [tf.keras.callbacks.ModelCheckpoint(filepath="best.h5", monitor="accuracy", 
                                                            save_best_only=True)]

<IPython.core.display.Javascript object>

In [194]:
# Length of the randomly initialised weights of the vectors of the vocabulary words (We have a vocabulary length as 100)
len(embedding_layer.get_weights()[0])

533

In [195]:
# Randmly initialised weights of the vectors for the vocabulary words (We have a vocabulary length as 100)
embedding_layer.get_weights()

[array([[ 1.3433266e-02,  4.5047235e-02, -3.8966537e-06, ...,
         -3.1068588e-02, -3.7468623e-02, -4.7173679e-02],
        [-2.5510574e-02, -2.1416545e-02,  2.5614526e-02, ...,
         -8.6723566e-03,  2.3049999e-02, -3.4984946e-03],
        [-1.6835440e-02, -4.8060644e-02, -4.2632595e-03, ...,
          4.3702628e-02,  1.5077107e-03,  1.6896714e-02],
        ...,
        [-4.2838313e-02, -2.6425838e-02,  3.7297215e-02, ...,
          1.2443326e-02,  2.6815545e-02, -3.0329300e-02],
        [-2.5485767e-02,  7.3142760e-03,  4.3525901e-02, ...,
         -4.5685150e-02, -5.2221045e-03,  2.5858808e-02],
        [ 1.1596680e-02, -3.7186481e-02,  9.1599450e-03, ...,
          1.1475682e-03, -1.7056059e-02, -3.0640531e-02]], dtype=float32)]

In [205]:
history = model.fit(np.asarray(X_train).astype('float32'),np.asarray(y_train).astype('float32'),epochs=100,verbose=1, callbacks=callback_list)

Epoch 1/100


ValueError: in user code:

    /Applications/anaconda3/lib/python3.8/site-packages/keras/engine/training.py:853 train_function  *
        return step_function(self, iterator)
    /Applications/anaconda3/lib/python3.8/site-packages/keras/engine/training.py:842 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /Applications/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:1286 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /Applications/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2849 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /Applications/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:3632 _call_for_each_replica
        return fn(*args, **kwargs)
    /Applications/anaconda3/lib/python3.8/site-packages/keras/engine/training.py:835 run_step  **
        outputs = model.train_step(data)
    /Applications/anaconda3/lib/python3.8/site-packages/keras/engine/training.py:788 train_step
        loss = self.compiled_loss(
    /Applications/anaconda3/lib/python3.8/site-packages/keras/engine/compile_utils.py:201 __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    /Applications/anaconda3/lib/python3.8/site-packages/keras/losses.py:141 __call__
        losses = call_fn(y_true, y_pred)
    /Applications/anaconda3/lib/python3.8/site-packages/keras/losses.py:245 call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    /Applications/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206 wrapper
        return target(*args, **kwargs)
    /Applications/anaconda3/lib/python3.8/site-packages/keras/losses.py:1665 categorical_crossentropy
        return backend.categorical_crossentropy(
    /Applications/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206 wrapper
        return target(*args, **kwargs)
    /Applications/anaconda3/lib/python3.8/site-packages/keras/backend.py:4839 categorical_crossentropy
        target.shape.assert_is_compatible_with(output.shape)
    /Applications/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/tensor_shape.py:1161 assert_is_compatible_with
        raise ValueError("Shapes %s and %s are incompatible" % (self, other))

    ValueError: Shapes (None, 1) and (None, 533) are incompatible


In [111]:
def read_glove_data(glove_file):
    with open(glove_file, 'r', encoding="utf-8") as f:
        words = set()
        word_to_vec_map = {}
        curr_word=None
        i = 0
        try:
            for line in f:
                i+=1
                try:
                    line = line.strip().split()
                    curr_word = line[0]
                    word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
                except Exception as E:
                    print ("got An exception, word=", curr_word, i)
                    pass         
        except Exception as E:
            print ("got An exception before for, word=", curr_word, i)
            pass         
            
    return word_to_vec_map

In [112]:
word_to_vec_map = read_glove_data('glove.6B.50d.txt')

In [5]:
len(word_to_vec_map)

161507

In [6]:
def cos_similarity(vec_u, vec_v):
    distance = 0.0   
    dot = np.inner(vec_u,vec_v)
    norm_vec_u = np.linalg.norm(vec_u)
    norm_vec_v = np.linalg.norm(vec_v)
    cos_similarity = dot/(norm_vec_u*norm_vec_v)
    
    return cos_similarity

In [7]:
fatherVec = word_to_vec_map["father"]
motherVec = word_to_vec_map["mother"]
ballVec = word_to_vec_map["ball"]
crocodileVec = word_to_vec_map["crocodile"]
franceVec = word_to_vec_map["france"]
italyVec = word_to_vec_map["italy"]
parisVec = word_to_vec_map["paris"]
romeVec = word_to_vec_map["rome"]
deliciousVec = word_to_vec_map["delicious"]
tastyVec = word_to_vec_map["tasty"]
orangeVec = word_to_vec_map["orange"]
appleVec = word_to_vec_map["apple"]
grapefruitVec = word_to_vec_map["grapefruit"]
showVector = word_to_vec_map["show"]
displayVector = word_to_vec_map["display"]
viewVector = word_to_vec_map["view"]

print("cos_similarity(father, mother) = ", cos_similarity(fatherVec, motherVec))
print("cos_similarity(ball, crocodile) = ",cos_similarity(ballVec, crocodileVec))
print("cos_similarity(france - paris, italy - rome) = ",cos_similarity(franceVec - parisVec, 
                                                                          italyVec - romeVec))
print("cos_similarity(france - paris, rome - italy) = ",cos_similarity(franceVec - parisVec, 
                                                                          romeVec - italyVec))
print ("cos_similarity(delicious, tasty) = ",cos_similarity(deliciousVec, tastyVec))
print ("cos_similarity(orange, apple) = ",cos_similarity(orangeVec, appleVec))
print ("cos_similarity(orange, grapefruit) = ",cos_similarity(orangeVec, grapefruitVec))
print ("cos_similarity(show, view) = ",cos_similarity(showVector, viewVector))
print ("cos_similarity(show, display) = ",cos_similarity(showVector, displayVector))

cos_similarity(father, mother) =  0.8909038442893618
cos_similarity(ball, crocodile) =  0.2743924626137943
cos_similarity(france - paris, italy - rome) =  0.6751479308174204
cos_similarity(france - paris, rome - italy) =  -0.6751479308174204
cos_similarity(delicious, tasty) =  0.9297150322667408
cos_similarity(orange, apple) =  0.5388040721946524
cos_similarity(orange, grapefruit) =  0.6101272122317138
cos_similarity(show, view) =  0.630372310903227
cos_similarity(show, display) =  0.6203289883495692


In [8]:
def findMissing(word_a, word_b, word_c, word_to_vec_map):
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]#None
    words = word_to_vec_map.keys()
    max_cosine_sim = -1000              
    best_word = None                   
    
    for w in words:        
        if w in [word_a, word_b, word_c] :
            continue
        
        try:
            cosine_sim = cos_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
        
            if (cosine_sim > max_cosine_sim):
                max_cosine_sim = cosine_sim
                best_word = w
        except ValueError as ve:
            print ("Got an exception", ve, w)
            pass
        except KeyError as ke:
            print ("this key", w, "not found")
            
    print ("Done")    
    return best_word, max_cosine_sim

word_to_vec_map["usa"]

In [9]:
word_a, word_b, word_c= 'india', 'delhi', 'france'
w = "paris"
e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
cosine_sim = cos_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
print (cosine_sim ," for paris")
w = "nanterre"
cosine_sim = cos_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
print (cosine_sim ," for nanterre", type(e_a), e_a.shape)


0.6958505344885515  for paris
0.7249669057754776  for nanterre <class 'numpy.ndarray'> (50,)


In [10]:
puzzle_triads= [('italy', 'italian', 'spain'), 
                 ('father', 'mother', 'son'),
                 ('brother', 'sister', 'nephew'),
                 ('india', 'delhi', 'japan'), 
                 ('man', 'woman', 'boy'), 
                 ('small', 'smaller', 'large'),
                ('king', 'man', 'queen')
                 
                ]
for t in puzzle_triads:
    print (t, findMissing(t[0], t[1], t[2], word_to_vec_map))

Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('italy', 'italian', 'spain') ('spanish', 0.8875303721276963)
Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('father', 'mother', 'son') ('daughter', 0.8145878966131403)
Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('brother', 'sister', 'nephew') ('niece', 0.7876098303940358)
Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('india', 'delhi', 'japan') ('tokyo', 0.7444555691961849)
Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('man', 'woman', 'boy') ('girl', 0.6695094729169301)
Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('small', 'smaller', 'large') ('larger', 0.6831324455301026)
Got an exception operands could not be broadcast together with shapes (37

In [11]:
puzzle_triads= [('cricket', 'bat', 'football'), 
                 ('tennis', 'racquet', 'badminton')
                ]
for t in puzzle_triads:
    print (t, findMissing(t[0], t[1], t[2], word_to_vec_map))

Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('cricket', 'bat', 'football') ('varitek', 0.8066231171522057)
Got an exception operands could not be broadcast together with shapes (37,) (50,)  motoko
Done
('tennis', 'racquet', 'badminton') ('pricking', 0.749053514756415)
