In [None]:
%matplotlib inline

### Source: [link](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [6]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)

<torch._C.Generator at 0x79c97e742450>

In [None]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.

In [None]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss() # Negative Log Likelihood Loss
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    print("Loss in Epoch {ep}: {l}".format(ep=epoch, l=np.round(total_loss, 2))) # The loss decreased every iteration over the training data!
    losses.append(total_loss)

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
Loss in Epoch 0: 519.75
Loss in Epoch 1: 517.3
Loss in Epoch 2: 514.87
Loss in Epoch 3: 512.45
Loss in Epoch 4: 510.04
Loss in Epoch 5: 507.65
Loss in Epoch 6: 505.26
Loss in Epoch 7: 502.89
Loss in Epoch 8: 500.53
Loss in Epoch 9: 498.18


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.


## Exercise Layout
### 1. <u>Training CBOW Embeddings</u>
1.1) Implement a CBOW Model by completing ```class CBOW(nn.Module)``` and train it on ```raw_text```.    

1.2) Load Datasets ```tripadvisor_hotel_reviews_reduced.csv``` and ```scifi_reduced.txt```.     

1.3) Decide preprocessing steps by completing the function ```def custom_preprocess()```. Describe your decisions. Note that it's your choice to create different preprocessing functions for hotel reviews and scifi datasets or use the same preprocessing function.             

1.4) Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.   

1.5) Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset. Are predictions made by the model sensitive towards the context size?
     
1.6) Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset.  


### 2. <u>Test your Embeddings</u>
Note - Do the following for CBOW2, and optionally for CBOW5

2.1) For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model. List them in your report and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   

2.2) Do the same for Sci-Fi dataset.   

2.3) How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.   

2.4) Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings. Do they have different neighbours? If yes, can you reason why?    

2.5) What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?   


### Tips

1. Switch from CPU to a GPU instance after you have confirmed that your training procedure is working correctly.
2. You can always save your intermediate results (embeddings, preprocessed dataset, model, etc.) in your google drive via colab



### 1.1 Create a CBOW Model by completing ```class CBOW(nn.Module)``` and test it on ```raw_text```
Implement CBOW in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.

In [None]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, vocab_size)
        self.activation_fn = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):
        # Get embeddings for context words
        embeds = self.embeddings(inputs)
        # Aggregate embeddings (mean or sum)
        embeds = torch.mean(embeds, dim=0).view(1, -1)
        # Pass through linear layer and activation
        out = self.activation_fn(self.linear1(embeds))
        return out

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)  # example

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


tensor([19, 38, 30, 43])

In [None]:
model =  CBOW(vocab_size, embedding_dim = 100)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
device = 'cpu'
model.to(device)


for epoch in range(20):
    total_loss = 0
    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)
        context_vector = context_vector.to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        out = model(context_vector)

        # Compute loss
        loss = loss_function(out, torch.tensor([word_to_ix[target]], dtype=torch.long, device=device))
        total_loss += loss.item()

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

    print(f"Loss in Epoch {epoch}: {total_loss:.2f}")

Loss in Epoch 0: 232.50
Loss in Epoch 1: 231.02
Loss in Epoch 2: 229.54
Loss in Epoch 3: 228.07
Loss in Epoch 4: 226.60
Loss in Epoch 5: 225.14
Loss in Epoch 6: 223.69
Loss in Epoch 7: 222.24
Loss in Epoch 8: 220.80
Loss in Epoch 9: 219.36
Loss in Epoch 10: 217.93
Loss in Epoch 11: 216.51
Loss in Epoch 12: 215.09
Loss in Epoch 13: 213.67
Loss in Epoch 14: 212.27
Loss in Epoch 15: 210.86
Loss in Epoch 16: 209.47
Loss in Epoch 17: 208.08
Loss in Epoch 18: 206.70
Loss in Epoch 19: 205.32


### 1.2 Load Datasets

In [7]:
### Load Datasets tripadvisor_hotel_reviews_reduced.csv and scifi_reduced.txt

!gdown 1foE1JuZJeu5E_4qVge9kExzhvF32teuF # For Hotel Reviews
!gdown 13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75 # For Scifi-Text

Downloading...
From: https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF
To: /content/tripadvisor_hotel_reviews_reduced.csv
100% 7.36M/7.36M [00:00<00:00, 213MB/s]
Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /content/scifi_reduced.txt
100% 43.1M/43.1M [00:00<00:00, 113MB/s] 


### 1.3 Preprocess Datasets
### 🗒❓ Describe your decisions for preprocessing the datasets

In [2]:
import re
import spacy
import pandas as pd
from tqdm import tqdm
from collections import Counter
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [3]:
spacy.cli.download("en_core_web_md")
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
### Complete the preprocessing function and apply it to the datasets
def custom_preprocess(text, is_sci_fi=False):

    # Convert to lowercase and remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text.lower())
    doc = nlp(text)

    # Define stop words
    stop_words = nlp.Defaults.stop_words

    # Remove stop words and lemmatize
    tokens = [token.lemma_ for token in doc if token.lemma_ not in stop_words and not token.is_space]

    return tokens

In [5]:
# Read and process the hotel reviews dataset
print("Processing Hotel Reviews Dataset...")
df_reviews = pd.read_csv("tripadvisor_hotel_reviews_reduced.csv")
# Assume the reviews are in a column named 'Review'
reviews = df_reviews['Review'].tolist()

# Process each review individually
hotel_reviews_processed = []
for review in tqdm(reviews, desc="Processing TripAdvisor"):
    hotel_reviews_processed.extend(custom_preprocess(review))

print(f"Total tokens in Hotel Reviews: {len(hotel_reviews_processed)}")

Processing Hotel Reviews Dataset...


Processing TripAdvisor: 100%|██████████| 10000/10000 [01:31<00:00, 109.38it/s]

Total tokens in Hotel Reviews: 969935





In [6]:
from tqdm import tqdm

# Read and process the sci-fi story dataset
print("\nProcessing Sci-Fi Story Dataset...")
with open("scifi_reduced.txt", "r") as f:
    sci_fi_text = f.read()

# Function to split text by word count
def split_by_word_count(text, word_count=500):
    words = text.split()
    for i in range(0, len(words), word_count):
        yield " ".join(words[i:i + word_count])

print(sci_fi_text[:200])


Processing Sci-Fi Story Dataset...
 A chat with the editor  i #  science fiction magazine called IF. The title was selected after much thought because of its brevity and on the theory it is indicative of the field and will be easy to r


In [7]:
# Initialize an empty list to store processed tokens
sci_fi_story_processed = []

# Process each chunk with a progress bar
for chunk in tqdm(split_by_word_count(sci_fi_text), desc="Processing Sci-Fi Story"):
    tokens = custom_preprocess(chunk, is_sci_fi=True)
    sci_fi_story_processed.extend(tokens)

print(f"Total tokens in Sci-Fi Story: {len(sci_fi_story_processed)}")

Processing Sci-Fi Story: 15390it [08:10, 31.41it/s]

Total tokens in Sci-Fi Story: 3420668





In [8]:
# Vocabulary creation with frequency threshold
def create_vocab(tokens, min_freq):
    word_counts = Counter(tokens)
    vocab = [word for word, count in word_counts.items() if count >= min_freq]
    word_to_ix = {word: i for i, word in enumerate(vocab)}
    return vocab, word_to_ix

# Set minimum frequency thresholds
min_freq_reviews = 2
min_freq_sci_fi = 2

print("\nCreating Vocabulary for Hotel Reviews...")
vocab_reviews, word_to_ix_reviews = create_vocab(hotel_reviews_processed, min_freq_reviews)
print(f"Vocabulary size for Hotel Reviews: {len(vocab_reviews)}")

print("\nCreating Vocabulary for Sci-Fi Story...")
vocab_sci_fi, word_to_ix_sci_fi = create_vocab(sci_fi_story_processed, min_freq_sci_fi)
print(f"Vocabulary size for Sci-Fi Story: {len(vocab_sci_fi)}")


Creating Vocabulary for Hotel Reviews...
Vocabulary size for Hotel Reviews: 17475

Creating Vocabulary for Sci-Fi Story...
Vocabulary size for Sci-Fi Story: 50005


In [9]:
def create_context_target_pairs(tokens, context_size, word_to_ix):
    data = []
    for i in range(context_size, len(tokens) - context_size):
        context = [tokens[i + j] for j in range(-context_size, context_size + 1) if j != 0]
        target = tokens[i]
        # Skip if target or context words are not in the vocabulary
        if target in word_to_ix and all(word in word_to_ix for word in context):
            data.append((context, target))
    return data

# Create context-target pairs for Hotel Reviews with context sizes 2 and 5
print("\nCreating context-target pairs for Hotel Reviews...")
context_size_cbow2 = 2
data_reviews_cbow2 = create_context_target_pairs(hotel_reviews_processed, context_size_cbow2, word_to_ix_reviews)
print(f"Total pairs for CBOW2 (Hotel Reviews): {len(data_reviews_cbow2)}")

context_size_cbow5 = 5
data_reviews_cbow5 = create_context_target_pairs(hotel_reviews_processed, context_size_cbow5, word_to_ix_reviews)
print(f"Total 5-grams for CBOW5 (Hotel Reviews): {len(data_reviews_cbow5)}")

# Create context-target pairs for Sci-Fi Story with context size 2
print("\nCreating context-target pairs for Sci-Fi Story...")
context_size_sci_fi_cbow2 = 2
data_sci_fi_cbow2 = create_context_target_pairs(sci_fi_story_processed, context_size_sci_fi_cbow2, word_to_ix_sci_fi)
print(f"Total pairs for CBOW2 (Sci-Fi Story): {len(data_sci_fi_cbow2)}")


Creating context-target pairs for Hotel Reviews...
Total pairs for CBOW2 (Hotel Reviews): 834626
Total 5-grams for CBOW5 (Hotel Reviews): 703888

Creating context-target pairs for Sci-Fi Story...
Total pairs for CBOW2 (Sci-Fi Story): 3127097


In [41]:
# Define Dataset class
class CBOWDataset(Dataset):
    def __init__(self, data, word_to_ix):
        self.data = data
        self.word_to_ix = word_to_ix

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        context, target = self.data[idx]
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        target_idx = torch.tensor(self.word_to_ix[target], dtype=torch.long)
        return context_idxs, target_idx

# Create DataLoaders
batch_size = 124

# Hotel Reviews CBOW2
dataset_reviews_cbow2 = CBOWDataset(data_reviews_cbow2, word_to_ix_reviews)
dataloader_reviews_cbow2 = DataLoader(dataset_reviews_cbow2, batch_size=batch_size, shuffle=True)

# Hotel Reviews CBOW5
dataset_reviews_cbow5 = CBOWDataset(data_reviews_cbow5, word_to_ix_reviews)
dataloader_reviews_cbow5 = DataLoader(dataset_reviews_cbow5, batch_size=batch_size, shuffle=True)

# Sci-Fi Story CBOW2
dataset_sci_fi_cbow2 = CBOWDataset(data_sci_fi_cbow2, word_to_ix_sci_fi)
dataloader_sci_fi_cbow2 = DataLoader(dataset_sci_fi_cbow2, batch_size=batch_size, shuffle=True)

### 1.4 Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.

In [42]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader, Dataset

# Define CBOW Model with Negative Sampling
class MinimalCBOWNegativeSampling(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(MinimalCBOWNegativeSampling, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Initialize embeddings
        self.embeddings.weight.data.uniform_(-0.5 / embedding_dim, 0.5 / embedding_dim)

    def forward(self, context_words, target_words, negative_words):
        # Get embeddings for context words and compute mean
        context_embeds = self.embeddings(context_words)      # [batch_size, context_size, embedding_dim]
        context_mean = torch.mean(context_embeds, dim=1)     # [batch_size, embedding_dim]

        # Positive and negative word embeddings
        pos_embeds = self.embeddings(target_words).squeeze(1)  # [batch_size, embedding_dim]
        neg_embeds = self.embeddings(negative_words)           # [batch_size, K, embedding_dim]

        # Calculate positive and negative scores
        pos_scores = torch.sum(context_mean * pos_embeds, dim=1)   # [batch_size]
        neg_scores = torch.bmm(neg_embeds, context_mean.unsqueeze(2)).squeeze(2)  # [batch_size, K]

        return pos_scores, neg_scores

# Negative Sampling Function
def sample_negatives_uniformly(vocab_size, batch_size, K):
    """Samples negative words uniformly from the vocabulary"""
    return torch.randint(0, vocab_size, (batch_size, K))

# Initialize the model, loss, and optimizer
embedding_dim = 19
vocab_size = len(word_to_ix_reviews)

model = MinimalCBOWNegativeSampling(vocab_size, embedding_dim)
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

# Training function
def train_minimal_cbow(model, dataloader, criterion, optimizer, vocab_size, num_epochs=50, K=5):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for context_words, target_word in dataloader:
            # Move data to the appropriate device
            context_words = context_words.to(device)
            target_word = target_word.to(device).unsqueeze(1)  # [batch_size, 1]

            # Sample negative words
            negative_words = sample_negatives_uniformly(vocab_size, context_words.size(0), K).to(device)  # [batch_size, K]

            # Forward pass
            pos_scores, neg_scores = model(context_words, target_word, negative_words)

            # Define positive and negative labels
            pos_labels = torch.ones(pos_scores.size(0), device=device)      # [batch_size]
            neg_labels = torch.zeros(neg_scores.size(0) * neg_scores.size(1), device=device)  # [batch_size * K]

            # Flatten neg_scores for loss calculation
            neg_scores = neg_scores.view(-1)  # [batch_size * K]

            # Compute losses
            loss_pos = criterion(pos_scores, pos_labels)
            loss_neg = criterion(neg_scores, neg_labels)
            loss = loss_pos + loss_neg

            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

MinimalCBOWNegativeSampling(
  (embeddings): Embedding(17475, 19)
)

In [38]:
# Initialize model, criterion, and optimizer
embedding_dim = 19
vocab_size = len(word_to_ix_reviews)
num_epochs = 50

model_cbow2 = MinimalCBOWNegativeSampling(vocab_size, embedding_dim)
optimizer = optim.SGD(model_cbow2.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_cbow2.to(device)

# Train CBOW5 Model with context size of 5
train_minimal_cbow(model_cbow2, dataloader_reviews_cbow2, criterion, optimizer, vocab_size, num_epochs=num_epochs, K=5)

Epoch [1/50], Loss: 1.3863
Epoch [2/50], Loss: 1.3862
Epoch [3/50], Loss: 1.3862
Epoch [4/50], Loss: 1.3860
Epoch [5/50], Loss: 1.3856
Epoch [6/50], Loss: 1.3846
Epoch [7/50], Loss: 1.3821
Epoch [8/50], Loss: 1.3768
Epoch [9/50], Loss: 1.3665
Epoch [10/50], Loss: 1.3503
Epoch [11/50], Loss: 1.3295
Epoch [12/50], Loss: 1.3060
Epoch [13/50], Loss: 1.2813
Epoch [14/50], Loss: 1.2567
Epoch [15/50], Loss: 1.2327
Epoch [16/50], Loss: 1.2100
Epoch [17/50], Loss: 1.1887
Epoch [18/50], Loss: 1.1690
Epoch [19/50], Loss: 1.1510
Epoch [20/50], Loss: 1.1346
Epoch [21/50], Loss: 1.1192
Epoch [22/50], Loss: 1.1053
Epoch [23/50], Loss: 1.0923
Epoch [24/50], Loss: 1.0805
Epoch [25/50], Loss: 1.0697
Epoch [26/50], Loss: 1.0596
Epoch [27/50], Loss: 1.0503
Epoch [28/50], Loss: 1.0415
Epoch [29/50], Loss: 1.0333
Epoch [30/50], Loss: 1.0253
Epoch [31/50], Loss: 1.0181
Epoch [32/50], Loss: 1.0114
Epoch [33/50], Loss: 1.0047
Epoch [34/50], Loss: 0.9985
Epoch [35/50], Loss: 0.9923
Epoch [36/50], Loss: 0.9866
E

### 1.5 Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset.  

🗒❓ Are predictions made by the model sensitive towards the context size?

In [46]:
# Initialize model, criterion, and optimizer
embedding_dim = 19
vocab_size = len(word_to_ix_reviews)
num_epochs = 50

model_cbow5 = MinimalCBOWNegativeSampling(vocab_size, embedding_dim)
optimizer = optim.SGD(model_cbow5.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_cbow5.to(device)

# Train CBOW5 Model with context size of 5
train_minimal_cbow(model_cbow5, dataloader_reviews_cbow5, criterion, optimizer, vocab_size, num_epochs=num_epochs, K=5)

Epoch [1/50], Loss: 1.3863
Epoch [2/50], Loss: 1.3863
Epoch [3/50], Loss: 1.3863
Epoch [4/50], Loss: 1.3863
Epoch [5/50], Loss: 1.3862
Epoch [6/50], Loss: 1.3862
Epoch [7/50], Loss: 1.3862
Epoch [8/50], Loss: 1.3862
Epoch [9/50], Loss: 1.3861
Epoch [10/50], Loss: 1.3860
Epoch [11/50], Loss: 1.3859
Epoch [12/50], Loss: 1.3856
Epoch [13/50], Loss: 1.3853
Epoch [14/50], Loss: 1.3849
Epoch [15/50], Loss: 1.3842
Epoch [16/50], Loss: 1.3831
Epoch [17/50], Loss: 1.3816
Epoch [18/50], Loss: 1.3795
Epoch [19/50], Loss: 1.3765
Epoch [20/50], Loss: 1.3723
Epoch [21/50], Loss: 1.3668
Epoch [22/50], Loss: 1.3598
Epoch [23/50], Loss: 1.3514
Epoch [24/50], Loss: 1.3417
Epoch [25/50], Loss: 1.3310
Epoch [26/50], Loss: 1.3196
Epoch [27/50], Loss: 1.3077
Epoch [28/50], Loss: 1.2956
Epoch [29/50], Loss: 1.2833
Epoch [30/50], Loss: 1.2711
Epoch [31/50], Loss: 1.2588
Epoch [32/50], Loss: 1.2468
Epoch [33/50], Loss: 1.2348
Epoch [34/50], Loss: 1.2232
Epoch [35/50], Loss: 1.2117
Epoch [36/50], Loss: 1.2008
E

### 1.6 Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset

In [48]:
# Initialize model, criterion, and optimizer
embedding_dim = 19
vocab_size_sci_fi = len(word_to_ix_sci_fi)
num_epochs = 20

model_sci_fi_cbow2 = MinimalCBOWNegativeSampling(vocab_size_sci_fi, embedding_dim)
optimizer = optim.Adam(model_sci_fi_cbow2.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_sci_fi_cbow2.to(device)

# Train CBOW2 Model with context size of 2 for Sci-Fi Story
train_minimal_cbow(model_sci_fi_cbow2, dataloader_sci_fi_cbow2, criterion, optimizer, vocab_size_sci_fi, num_epochs=num_epochs, K=4)

Epoch [1/20], Loss: 0.7048
Epoch [2/20], Loss: 0.6807
Epoch [3/20], Loss: 0.6822
Epoch [4/20], Loss: 0.6813
Epoch [5/20], Loss: 0.6773
Epoch [6/20], Loss: 0.6736
Epoch [7/20], Loss: 0.6766
Epoch [8/20], Loss: 0.6793
Epoch [9/20], Loss: 0.6787
Epoch [10/20], Loss: 0.6753
Epoch [11/20], Loss: 0.6730
Epoch [12/20], Loss: 0.6772
Epoch [13/20], Loss: 0.6785
Epoch [14/20], Loss: 0.6783
Epoch [15/20], Loss: 0.6744
Epoch [16/20], Loss: 0.6733
Epoch [17/20], Loss: 0.6768
Epoch [18/20], Loss: 0.6789
Epoch [19/20], Loss: 0.6778
Epoch [20/20], Loss: 0.6741


### 2.1 For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. (CBOW2 and optionally for CBOW5)
Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model.    

🗒❓ List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   


In [40]:
import torch.nn.functional as F

# Function to get the closest words using cosine similarity
def get_closest_words(word, word_to_ix, embeddings, top_n=5):
    # Get embedding for the target word
    word_idx = word_to_ix[word]
    word_embedding = embeddings(torch.tensor([word_idx], device=device))

    # Compute cosine similarities for all words in a vectorized manner
    all_embeddings = embeddings.weight  # [vocab_size, embedding_dim]
    cos_similarities = F.cosine_similarity(word_embedding, all_embeddings).cpu().detach().numpy()

    # Get top_n closest words (excluding the target word itself)
    closest_word_indices = cos_similarities.argsort()[-top_n-1:][::-1][1:]  # Sort and exclude the word itself
    closest_words = [list(word_to_ix.keys())[i] for i in closest_word_indices]

    return closest_words

# Words to test
nouns = ['hotel', 'room', 'service']
verbs = ['stay', 'enjoy', 'recommend']
adjectives = ['clean', 'comfortable', 'friendly']

# Get embeddings from the model
embeddings = model_cbow2.embeddings  # Access the embedding layer of your MinimalCBOWNegativeSampling model

# Find and print closest words
for category, words in [('Nouns', nouns), ('Verbs', verbs), ('Adjectives', adjectives)]:
    print(f"\n{category}:")
    for word in words:
        closest = get_closest_words(word, word_to_ix_reviews, embeddings)
        print(f"{word}: {closest}")


Nouns:
hotel: ['great', 'good', 'stay', 'lot', 'time']
room: ['small', 'clean', 'bathroom', 'floor', 'problem']
service: ['like', 'desk', 'bit', 'think', 'people']

Verbs:
stay: ['night', 'hotel', 'time', 'place', 'lot']
enjoy: ['visit', 'time', 'love', 'recommend', 'open']
recommend: ['stay', 'time', 'place', 'resort', 'end']

Adjectives:
clean: ['comfortable', 'room', 'bed', 'bathroom', 'large']
comfortable: ['clean', 'bed', 'large', 'room', 'spacious']
friendly: ['staff', 'helpful', 'time', 'extremely', 'people']


In [47]:
# Get embeddings from the CBOW5 model
embeddings = model_cbow5.embeddings  # Access the embedding layer

# Find and print closest words
for category, words in [('Nouns', nouns), ('Verbs', verbs), ('Adjectives', adjectives)]:
    print(f"\n{category}:")
    for word in words:
        closest = get_closest_words(word, word_to_ix_reviews, embeddings)
        print(f"{word}: {closest}")


Nouns:
hotel: ['great', 'stay', 'location', 'service', 'good']
room: ['clean', 'night', 'nice', 'hotel', 'time']
service: ['hotel', 'good', 'location', 'stay', 'great']

Verbs:
stay: ['hotel', 'night', 'great', 'service', 'place']
enjoy: ['great', 'like', 'hotel', 'location', 'area']
recommend: ['excellent', 'book', 'good', 'location', 'stay']

Adjectives:
clean: ['room', 'nice', 'bed', 'view', 'staff']
comfortable: ['clean', 'nice', 'room', 'little', 'book']
friendly: ['staff', 'nice', 'view', 'wonderful', 'area']


### 2.2 Repeat 2.1 for SciFi Dataset

🗒❓ List your findings for SciFi Dataset as well, similarly to 2.1

In [49]:
# Words to test
nouns_sci_fi = ['robot', 'time', 'technology']
verbs_sci_fi = ['explore', 'discover', 'communicate']
adjectives_sci_fi = ['strange', 'advanced', 'intelligent']

# Get embeddings from the Sci-Fi model
embeddings_sci_fi = model_sci_fi_cbow2.embeddings  # Access the embedding layer of your Sci-Fi model

# Find and print closest words for the Sci-Fi dataset
for category, words in [('Nouns', nouns_sci_fi), ('Verbs', verbs_sci_fi), ('Adjectives', adjectives_sci_fi)]:
    print(f"\n{category}:")
    for word in words:
        if word in word_to_ix_sci_fi:  # Check if the word is in the vocabulary
            closest = get_closest_words(word, word_to_ix_sci_fi, embeddings_sci_fi)
            print(f"{word}: {closest}")
        else:
            print(f"{word} is not in the vocabulary.")


Nouns:
robot: ['crew', 'officer', 'build', 'bar', 'seat']
time: ['soon', 'minute', 'long', 'guess', 'expect']
technology: ['silkie', 'actual', 'theirs', 'highly', 'prediction']

Verbs:
explore: ['newly', 'original', 'library', 'future', 'quarter']
discover: ['remain', 'apparently', 'bear', 'build', 'describe']
communicate: ['remote', 'awareness', 'unfortunately', 'encounter', 'activity']

Adjectives:
strange: ['existence', 'race', 'alien', 'deep', 'tiny']
advanced: ['psychology', 'socalle', 'technique', 'teacher', 'efficiency']
intelligent: ['ancestor', 'highly', 'nature', 'generally', 'specie']


### 2.3 🗒❓ How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.

### 2.4 Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings.

🗒❓ Do they have different neighbours? If yes, can you reason why?

In [51]:

words_to_compare = ['room', 'star']

# Hotel review-based embeddings
for word in words_to_compare:
    closest_words = get_closest_words(word, word_to_ix_reviews, embeddings) #LATER CHANGE TO ***model1***
    print(f"Closest words to '{word}' in Hotel Reviews: {closest_words}")

# Sci-fi-based embeddings
for word in words_to_compare:
    closest = get_closest_words(word, word_to_ix_sci_fi, embeddings_sci_fi) # LATER CHANGE TO ***model3*** # Use word_to_ix2 for Sci-fi
    print(f"Closest words to '{word}' in Sci-Fi: {closest}")


Closest words to 'room' in Hotel Reviews: ['clean', 'night', 'nice', 'hotel', 'time']
Closest words to 'star' in Hotel Reviews: ['right', '10', 'night', 'hotel', 'good']
Closest words to 'room' in Sci-Fi: ['lock', 'door', 'table', 'walk', 'open']
Closest words to 'star' in Sci-Fi: ['sun', 'galaxy', 'space', 'middle', 'planet']


### 2.5 🗒❓ What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?    

### Report
The lab report should contain a detailed description of the approaches you have used to solve this exercise. Please also include results.

### Report
The lab report should contain a detailed description of the approaches you have used to solve this exercise. Please also include results.

Answers for the questions marked 🗒❓ goes here as well



1.3 🗒❓ Describe your decisions for preprocessing the datasets

Ans: Our main goal in preprocessing was to reduce the data size as much as possible without affecting the overall context of the data.
For this, we implemented the following things;

  1. Removed common stopwords which occur in the English language. (ex: is, was, the, etc)

  2. Removed all the numbers.

  3. Removed all the punctuation marks.

  4. Lowercased the entire data.

  5. Lemmatization of the data (ex: “running” to “run”).

By following these steps, we ensured consistency across the tokened data and also managed to reduce the size of the data significantly.






1.5 🗒❓ Are predictions made by the model sensitive towards the context size?

Ans: Yes.


2.1 🗒❓ List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.

Ans:

Nouns:
hotel: ['great', 'good', 'stay', 'lot', 'time']

room: ['small', 'clean', 'bathroom', 'floor', 'problem']

service: ['like', 'desk', 'bit', 'think', 'people']

Verbs:
stay: ['night', 'hotel', 'time', 'place', 'lot']

enjoy: ['visit', 'time', 'love', 'recommend', 'open']

recommend: ['stay', 'time', 'place', 'resort', 'end']

Adjectives:
clean: ['comfortable', 'room', 'bed', 'bathroom', 'large']

comfortable: ['clean', 'bed', 'large', 'room', 'spacious']

friendly: ['staff', 'helpful', 'time', 'extremely', 'people']

In my opinion, the model did a pretty decent job, and the neighbours do make sense. The CBOW2 model captures the more localized context of the word and gives out specific word similarities, which may sometimes not be the first words that come into our mind, but in a local context, they make sense.


2.2 🗒❓ Repeat 2.1 for SciFi Dataset, List your findings for SciFi Dataset as well, similarly to 2.1

Ans:

Nouns:
robot: ['crew', 'officer', 'build', 'bar', 'seat']

time: ['soon', 'minute', 'long', 'guess', 'expect']

technology: ['silkie', 'actual', 'theirs', 'highly', 'prediction']

Verbs:
explore: ['newly', 'original', 'library', 'future', 'quarter']

discover: ['remain', 'apparently', 'bear', 'build', 'describe']

communicate: ['remote', 'awareness', 'unfortunately', 'encounter', 'activity']

Adjectives:
strange: ['existence', 'race', 'alien', 'deep', 'tiny']

advanced: ['psychology', 'socalle', 'technique', 'teacher', 'efficiency']

intelligent: ['ancestor', 'highly', 'nature', 'generally', 'specie']

The results seem a bit random, but if we think of a specific context, most of them make sense. There are some typos and errors, this might be an effect of lemmatization and unpunctuating the data and only noticeable in Sci_Fi because of the sheer size of the raw data.

2.3 🗒❓ How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.

Ans:
 It is clear that the embeddings of hotel_reviews is better compared to Sci-fi-based embeddings In my opinion, this is because of the context. For hotel_reviews, all the data are related to reviews of hotels, so the model learns everything that connects to the hotels. But, in contrast, for Sci_fi, the data is much more diverse and unconnected to each other, which makes it really hard for the model to come up with patterns.



2.4 🗒❓ Do they have different neighbours? If yes, can you reason why?

Ans:

Closest words to 'room' in Hotel Reviews: ['clean', 'night', 'nice', 'hotel', 'time']

Closest words to 'star' in Hotel Reviews: ['right', '10', 'night', 'hotel', 'good']

Closest words to 'room' in Sci-Fi: ['lock', 'door', 'table', 'walk', 'open']

Closest words to 'star' in Sci-Fi: ['sun', 'galaxy', 'space', 'middle', 'planet']

The concern which I raised in the previous response can be seen here, for the word 'room' in Sci_fi, while 4 of the predictions make sense, it also says 'walk', it can be interpreted as an entirely unrelated context the model learnt in some random science fiction story. While, in hotel_reviews, the model only learn's about hotels.



2.5 🗒❓ What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?

 Ans: The CBOW2 has a context window of 2 while CBOW5 has a context window of 5. This allows CBOW5 to capture a more generalised view of the word, while CBOW2 captures the view of things which are more closely associated with the word.

 This can be clearly observed in the results we got for the word 'hotel' in CBOW2 and CBOW5;


 CBOW2 hotel: ['great', 'good', 'stay', 'lot', 'time']

 CBOW5 hotel: ['great', 'stay', 'location', 'service', 'good']

 The CBOW5 captures words like 'location' and 'service', which are general for that word.


