### **Student Details**
Name: Vishal Pattar  
Roll no: 43556  
Class: BE AIML  
Subject: Deep Learning for AI  
Assignment: 5

### **Problem Statement:**

Implement the Continuous Bag of Words (CBOW) Model for word embedding. The implementation should include the following stages:
- Data preparation
- Generate training data
- Train the model
- Output word embeddings

Use a suitable text dataset such as the Brown Corpus or any other sizable text corpus.

In [13]:
# Import necessary libraries
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import skipgrams
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Lambda, Dot, Reshape, Dense
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
from tensorflow.keras.utils import to_categorical
import random

In [1]:
# Sample dataset: For demonstration, using a small text. Replace with a larger corpus as needed.
sample_text = """
In the age of information, data is the new oil. The ability to process and analyze data
has become a crucial skill in various industries. Machine learning and artificial intelligence
are driving innovations that were once thought impossible.
"""

In [4]:
# Initialize and fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts([sample_text])
word2id = tokenizer.word_index
id2word = {v: k for k, v in word2id.items()}
vocab_size = len(word2id) + 1  # +1 for padding

In [5]:
# Define window size
window_size = 2

In [6]:
# Generate skip-grams pairs
sequences = tokenizer.texts_to_sequences([sample_text])[0]
pairs, labels = skipgrams(sequences, vocabulary_size=vocab_size, window_size=window_size, shuffle=True)

In [7]:
# Separate target and context words
target_words, context_words = zip(*pairs)
target_words = np.array(target_words, dtype='int32')
context_words = np.array(context_words, dtype='int32')
labels = np.array(labels, dtype='int32')

In [8]:
# Stage c: Train the model
embedding_dim = 50

In [9]:
# Input for target words
target_input = Input(shape=(1,), name='target_input')
# Input for context words
context_input = Input(shape=(1,), name='context_input')

In [10]:
# Embedding layer
embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim, name='embedding_layer')

In [11]:
# Embed target and context words
target_embedding = embedding(target_input)
context_embedding = embedding(context_input)

In [12]:
# Compute dot product between target and context embeddings
dot_product = Dot(axes=-1)([target_embedding, context_embedding])
dot_product = Reshape((1,))(dot_product)

In [14]:
# Define the output
output = Dense(1, activation='sigmoid')(dot_product)

In [15]:
# Define and compile the model
cbow_model = Model(inputs=[target_input, context_input], outputs=output)
cbow_model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])



In [16]:
# Train the model
cbow_model.fit([target_words, context_words], labels, epochs=10, batch_size=64, verbose=2)

Epoch 1/10
5/5 - 1s - loss: 0.6939 - accuracy: 0.4863 - 700ms/epoch - 140ms/step
Epoch 2/10
5/5 - 0s - loss: 0.6929 - accuracy: 0.6199 - 9ms/epoch - 2ms/step
Epoch 3/10
5/5 - 0s - loss: 0.6919 - accuracy: 0.7158 - 12ms/epoch - 2ms/step
Epoch 4/10
5/5 - 0s - loss: 0.6910 - accuracy: 0.7877 - 11ms/epoch - 2ms/step
Epoch 5/10
5/5 - 0s - loss: 0.6900 - accuracy: 0.8082 - 17ms/epoch - 3ms/step
Epoch 6/10
5/5 - 0s - loss: 0.6889 - accuracy: 0.8322 - 16ms/epoch - 3ms/step
Epoch 7/10
5/5 - 0s - loss: 0.6877 - accuracy: 0.8356 - 16ms/epoch - 3ms/step
Epoch 8/10
5/5 - 0s - loss: 0.6863 - accuracy: 0.8390 - 17ms/epoch - 3ms/step
Epoch 9/10
5/5 - 0s - loss: 0.6847 - accuracy: 0.8459 - 14ms/epoch - 3ms/step
Epoch 10/10
5/5 - 0s - loss: 0.6829 - accuracy: 0.8425 - 12ms/epoch - 2ms/step


<keras.src.callbacks.History at 0x20b84fbd590>

In [17]:
# Extract the embeddings from the trained model
word_embeddings = cbow_model.get_layer('embedding_layer').get_weights()[0]

In [18]:
# Function to get the embedding of a word
def get_embedding(word):
    if word in word2id:
        return word_embeddings[word2id[word]]
    else:
        return np.zeros(embedding_dim)

In [19]:
# Example: Get embeddings for some words
words = ['data', 'machine', 'learning', 'information', 'innovation']
for word in words:
    emb = get_embedding(word)
    print(f'Embedding for "{word}":\n{emb}\n')

Embedding for "data":
[-0.01710158  0.0116161  -0.02464188 -0.03927892  0.05300517 -0.00230624
 -0.03641287 -0.04884042 -0.0496954   0.04452392  0.02144654  0.00954155
  0.01109463 -0.055489    0.01203961  0.05245222  0.05445194  0.04193056
  0.00500811  0.03556725 -0.03247273 -0.01749837 -0.0309183  -0.00319361
  0.0517065   0.01266043 -0.01370715  0.04745135  0.02651867 -0.00543878
 -0.00648857  0.03327759 -0.00078968  0.01683643 -0.0657592   0.02087454
 -0.01425486 -0.03813679  0.03883107 -0.01430559  0.04702204  0.05680364
  0.02391579  0.00052433 -0.02342741 -0.00980793 -0.02570198 -0.01734126
  0.00143671  0.01423475]

Embedding for "machine":
[ 0.07005596  0.04625028  0.02980367 -0.02960505 -0.04380663 -0.01746259
 -0.03321838  0.02475217 -0.02357429  0.03886066  0.07400706 -0.07872978
  0.02373296 -0.01008565  0.07244141 -0.03176982 -0.03403271  0.01258605
  0.01888227 -0.00208324  0.01559594 -0.04067697  0.0543967  -0.01956667
 -0.02364531 -0.06183666  0.05409291  0.00216286 -