<a href="https://colab.research.google.com/github/bosnbos/GA2025_01_SL/blob/main/exercises/ex2/Exercise_2_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%matplotlib inline

In [2]:
# run this cell only if you're working with Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Source: [link](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [3]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)

<torch._C.Generator at 0x7964d00f8f50>

In [4]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.

In [5]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss() # Negative Log Likelihood Loss
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    print("Loss in Epoch {ep}: {l}".format(ep=epoch, l=np.round(total_loss, 2))) # The loss decreased every iteration over the training data!
    losses.append(total_loss)

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
Loss in Epoch 0: 518.43
Loss in Epoch 1: 515.79
Loss in Epoch 2: 513.16
Loss in Epoch 3: 510.56
Loss in Epoch 4: 507.98
Loss in Epoch 5: 505.42
Loss in Epoch 6: 502.87
Loss in Epoch 7: 500.33
Loss in Epoch 8: 497.8
Loss in Epoch 9: 495.29


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.


## Exercise Layout
### 1. <u>Training CBOW Embeddings</u>
1.1) Implement a CBOW Model by completing ```class CBOW(nn.Module)``` and train it on ```raw_text```.    

1.2) Load Datasets ```tripadvisor_hotel_reviews_reduced.csv``` and ```scifi_reduced.txt```.     

1.3) Decide preprocessing steps by completing the function ```def custom_preprocess()```. Describe your decisions. Note that it's your choice to create different preprocessing functions for hotel reviews and scifi datasets or use the same preprocessing function.             

1.4) Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.   

1.5) Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset. Are predictions made by the model sensitive towards the context size?
     
1.6) Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset.  


### 2. <u>Test your Embeddings</u>
Note - Do the following for CBOW2, and optionally for CBOW5

2.1) For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model. List them in your report and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   

2.2) Do the same for Sci-Fi dataset.   

2.3) How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.   

2.4) Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings. Do they have different neighbours? If yes, can you reason why?    

2.5) What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?   

2.6) Load the pretrained embedding model with the given code snippet and retrieve the 5 closest neighbours using the embeddings from this pretrained model for your selection of words. Compare the hereby retrieved neighbours with the ones you retrieved above. Are there any similarities? How do they differ? Can you give judgment about the embedding quality of this pretrained model?


### Tips

1. Switch from CPU to a GPU instance after you have confirmed that your training procedure is working correctly.
2. You can always save your intermediate results (embeddings, preprocessed dataset, model, etc.) in your google drive via colab



### 1.1 Create a CBOW Model by completing ```class CBOW(nn.Module)``` and test it on ```raw_text```
Implement CBOW in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.

In [6]:
import torch
import torch.nn as nn
import random


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

# Create vocab
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {ix: word for word, ix in word_to_ix.items()}

data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):
        # inputs shape: (batch_size, 4)
        embeds = self.embeddings(inputs)  # (batch_size, 4, embedding_dim)
        embeds = embeds.sum(dim=1)        # (batch_size, embedding_dim)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out

    def get_word_embedding(self, word):
        idx = torch.tensor([word_to_ix[word]])
        return self.embeddings(idx).view(1, -1)



make_context_vector(data[0][0], word_to_ix)  # example

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


tensor([48,  2, 26, 14])

In [7]:
### here are some functions to help you make the data ready for use by your model

# Function to generate batches of size 8
def generate_batches(data, batch_size=8):
    random.shuffle(data)
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        contexts, targets = zip(*batch)
        yield list(contexts), list(targets)

# Function to get context vectors with batched data
def make_context_vector2(context_list, word_to_ix):
    idxs = [[word_to_ix[w] for w in context] for context in context_list]
    return torch.tensor(idxs, dtype=torch.long)

# Function to get context vectors with batched data
def make_labels_idx(labels, word_to_ix):
    idxs = [word_to_ix[label] for label in labels]
    return torch.tensor(idxs, dtype=torch.long)


In [8]:
### create your model and train
def train(model, data, vocab, device, word_to_ix=word_to_ix, NUM_EPOCHS=15):
    model.to(device)
    loss_function = nn.NLLLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

    for epoch in range(NUM_EPOCHS):
        total_loss = 0
        for context_list, label_list in generate_batches(data, batch_size=8):
            context_tensor = make_context_vector2(context_list, word_to_ix).to(device)
            label_tensor = make_labels_idx(label_list, word_to_ix).to(device)

            log_probs = model(context_tensor)

            loss = loss_function(log_probs, label_tensor)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{NUM_EPOCHS} - Loss: {total_loss:.4f}")



In [9]:
# train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CBOW(vocab_size=len(vocab), embedding_dim=100)
train(model, data, vocab=vocab, device=device)

Epoch 1/15 - Loss: 32.2150
Epoch 2/15 - Loss: 31.8902
Epoch 3/15 - Loss: 32.2386
Epoch 4/15 - Loss: 31.7501
Epoch 5/15 - Loss: 31.5077
Epoch 6/15 - Loss: 31.5806
Epoch 7/15 - Loss: 31.3683
Epoch 8/15 - Loss: 30.8148
Epoch 9/15 - Loss: 30.6883
Epoch 10/15 - Loss: 31.1063
Epoch 11/15 - Loss: 30.6920
Epoch 12/15 - Loss: 30.6715
Epoch 13/15 - Loss: 30.2580
Epoch 14/15 - Loss: 29.7497
Epoch 15/15 - Loss: 29.7032


In [10]:
test_context = ['People', 'the', 'to', 'direct']
context_tensor = make_context_vector2([test_context], word_to_ix).to(device)
output = model(context_tensor)
predicted_idx = torch.argmax(output, dim=1).item()
print("\nContext:", test_context)
print("Predicted target word:", ix_to_word[predicted_idx])


Context: ['People', 'the', 'to', 'direct']
Predicted target word: to


### 1.2 Load Datasets

In [9]:
import pandas as pd

In [10]:
# Download Datasets tripadvisor_hotel_reviews_reduced.csv and scifi_reduced.txt
!mkdir "content"
!gdown "https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF" -O "content/tripadvisor_hotel_reviews_reduced.csv" # For Hotel Reviews
!gdown "https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75" -O "content/scifi_reduced.txt"  # For Scifi-Text

Downloading...
From: https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF
To: /content/content/tripadvisor_hotel_reviews_reduced.csv
100% 7.36M/7.36M [00:00<00:00, 37.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /content/content/scifi_reduced.txt
100% 43.1M/43.1M [00:00<00:00, 94.1MB/s]


In [11]:
### TODO: Load the datasets
with open(f"/content/content/tripadvisor_hotel_reviews_reduced.csv") as f:
  raw_text_hotel = f.read().splitlines()
with open(f"/content/content/scifi_reduced.txt") as f:
  raw_text_scifi = f.read().splitlines()

In [14]:
raw_text_hotel[:2]

['Review,Rating',
 '"fantastic service large hotel caters business corporates, serve provided better wife experienced- nothing short world.the room upgraded superior room overlooking harbour marina large window 50 feet length, anniversary bottle champagne sent chocolates compliments management, expensive did not regret moment choice hotel, highly recommended exclusive hotel break pamper,  ",5']

In [16]:
# remove the ratings
hotel_reviews = []
for item in raw_text_hotel[1:]: # Skip the header row
    parts = item.rsplit(',', 1)
    if len(parts) == 2:
        review_text = parts[0].strip('"') # Remove potential quotes around the review
        hotel_reviews.append(review_text)

print(hotel_reviews[:5])

['fantastic service large hotel caters business corporates, serve provided better wife experienced- nothing short world.the room upgraded superior room overlooking harbour marina large window 50 feet length, anniversary bottle champagne sent chocolates compliments management, expensive did not regret moment choice hotel, highly recommended exclusive hotel break pamper,  ', 'great hotel modern hotel good location, located just 2 minutes metro sation stop airport bus.very clean equiped rooms, good soundproofing ask overlooking central courtyard hotel main road, bottled water available free room mini bar, breakfast superb want, 10 euros cold buffet 14 euros hot food,  ', '3 star plus glasgowjust got 30th november 4 day visit great city.and good value hotel pleasant expected 3 star spotlessly clean great service staff pleasant helpful great buffet breakfast suit food 15 min walk centre lots interesting shops restaurants route streets 20 euros airport taxi overall great stay recommend,  ', 

### 1.3 Preprocess Datasets
### 🗒❓ Describe your decisions for preprocessing the datasets

In [12]:
### Import libraries for preprocessing
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('porter_stemmer')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Error loading porter_stemmer: Package 'porter_stemmer' not
[nltk_data]     found in index
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [17]:
### Complete the preprocessing function and apply it to the datasets
def custom_preprocess(text_list):
    """
    Applies standard preprocessing steps to a list of text documents, including lemmatization.

    Args:
        text_list (list): A list of strings (documents).

    Returns:
        list: A list of lists, where each inner list contains the preprocessed tokens for a document.
    """
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    preprocessed_text = []

    for text in text_list:
        # Lowercasing
        text = text.lower()
        # Tokenization
        tokens = word_tokenize(text)
        # Removing punctuation and stop words, and lemmatization
        processed_tokens = []
        for token in tokens:
            # Remove punctuation
            token = token.translate(str.maketrans('', '', string.punctuation))
            if token and token not in stop_words:
                # Lemmatization
                lemmatized_token = lemmatizer.lemmatize(token)
                processed_tokens.append(lemmatized_token)
        preprocessed_text.append(processed_tokens)

    return preprocessed_text

In [18]:
hotel_reviews_preprocessed = custom_preprocess(hotel_reviews)
scifi_preprocessed = custom_preprocess(raw_text_scifi)

In [19]:
hotel_reviews_preprocessed[:1]

[['fantastic',
  'service',
  'large',
  'hotel',
  'caters',
  'business',
  'corporates',
  'serve',
  'provided',
  'better',
  'wife',
  'experienced',
  'nothing',
  'short',
  'worldthe',
  'room',
  'upgraded',
  'superior',
  'room',
  'overlooking',
  'harbour',
  'marina',
  'large',
  'window',
  '50',
  'foot',
  'length',
  'anniversary',
  'bottle',
  'champagne',
  'sent',
  'chocolate',
  'compliment',
  'management',
  'expensive',
  'regret',
  'moment',
  'choice',
  'hotel',
  'highly',
  'recommended',
  'exclusive',
  'hotel',
  'break',
  'pamper']]

In [24]:
scifi_preprocessed[:1]

[['chat',
  'editor',
  'science',
  'fiction',
  'magazine',
  'called',
  'title',
  'selected',
  'much',
  'thought',
  'brevity',
  'theory',
  'indicative',
  'field',
  'easy',
  'remember',
  'tentative',
  'title',
  'morning',
  'could',
  'nt',
  'remember',
  'cup',
  'coffee',
  'summarily',
  'discarded',
  'great',
  'deal',
  'thought',
  'effort',
  'lias',
  'gone',
  'formation',
  'magazine',
  'aid',
  'several',
  'talented',
  'generous',
  'people',
  'grateful',
  'much',
  'due',
  'warmhearted',
  'assistance',
  'bulk',
  'formative',
  'work',
  'done',
  'try',
  'maintain',
  'one',
  'finest',
  'book',
  'market',
  'great',
  'public',
  'demand',
  'magazine',
  'short',
  'buy',
  'honesty',
  'say',
  'publish',
  'time',
  'best',
  'science',
  'fiction',
  'field',
  'would',
  'true',
  'access',
  'best',
  'story',
  'get',
  'fair',
  'share',
  'work',
  'best',
  'writer',
  'definitely',
  'talk',
  'adult',
  'juvenile',
  'relative',
  '

In [13]:
### Function to get vocab
# Input will be a list of lists
# Output - vocab set
def get_vocab(raw_llist):
    vocab = set()
    for l in raw_llist:
        for w in l:
            vocab.add(w)
    return vocab

### Function to get word-to-ix dictionary
def get_word2ix(vocab):
    word_to_ix = {word: i for i, word in enumerate(vocab)}
    ix_to_word = {i: word for word, i in word_to_ix.items()}
    return word_to_ix, ix_to_word

In [14]:
### Function to generate tuples of context-target
# Input = list of lists
# Output = list of tuples (list of context_words, target)
def get_data(raw_llist, context_window=5):
    data = []
    for doc in raw_llist:
        for i in range(context_window, len(doc) - context_window):
            context = doc[i - context_window:i] + doc[i + 1:i + context_window + 1]
            target = doc[i]
            data.append((context, target))
    return data

### 1.4 Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.

In [27]:
# Get vocabulary, word_to_ix, and ix_to_word for hotel reviews
hotel_vocab = get_vocab(hotel_reviews_preprocessed)
hotel_word_to_ix, hotel_ix_to_word = get_word2ix(hotel_vocab)

# Generate data for CBOW2 with context window of 2
hotel_data_cbow2 = get_data(hotel_reviews_preprocessed, context_window=2)

# Define embedding dimension
EMBEDDING_DIM = 100 # You can adjust this value

# Create and train the CBOW model for hotel reviews with context window 2
model_hotel_cbow2 = CBOW(vocab_size=len(hotel_vocab), embedding_dim=EMBEDDING_DIM)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Training CBOW2 model for Hotel Reviews dataset...")
train(model_hotel_cbow2, hotel_data_cbow2, vocab=hotel_vocab, device=device, word_to_ix=hotel_word_to_ix)

Training CBOW2 model for Hotel Reviews dataset...
Epoch 1/15 - Loss: 1078831.0679
Epoch 2/15 - Loss: 959396.0303
Epoch 3/15 - Loss: 932204.5639
Epoch 4/15 - Loss: 917881.2076
Epoch 5/15 - Loss: 908400.1707
Epoch 6/15 - Loss: 901334.5401
Epoch 7/15 - Loss: 895690.6413
Epoch 8/15 - Loss: 890951.9237
Epoch 9/15 - Loss: 886799.5770
Epoch 10/15 - Loss: 883119.4745
Epoch 11/15 - Loss: 879748.8030
Epoch 12/15 - Loss: 876633.6908
Epoch 13/15 - Loss: 873729.8752
Epoch 14/15 - Loss: 870992.2059
Epoch 15/15 - Loss: 868398.7461


In [28]:
# Define the path to save the model
model_save_path = '/content/drive/My Drive/cbow2_hotel_reviews_model.pth'

# Save the model's state dictionary
torch.save(model_hotel_cbow2.state_dict(), model_save_path)

print(f"Model saved to: {model_save_path}")

Model saved to: /content/drive/My Drive/cbow2_hotel_reviews_model.pth


### 1.5 Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset.  

🗒❓ Are predictions made by the model sensitive towards the context size?

In [21]:
# Get vocabulary, word_to_ix, and ix_to_word for hotel reviews
hotel_vocab = get_vocab(hotel_reviews_preprocessed)
hotel_word_to_ix, hotel_ix_to_word = get_word2ix(hotel_vocab)

# Generate data for CBOW5 with context window of 5
hotel_data_cbow5 = get_data(hotel_reviews_preprocessed, context_window=5)

# Define embedding dimension (using the same as CBOW2 for comparison)
EMBEDDING_DIM = 100

# Create and train the CBOW model for hotel reviews with context window 5
model_hotel_cbow5 = CBOW(vocab_size=len(hotel_vocab), embedding_dim=EMBEDDING_DIM)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Training CBOW5 model for Hotel Reviews dataset...")
train(model_hotel_cbow5, hotel_data_cbow5, vocab=hotel_vocab, device=device, word_to_ix=hotel_word_to_ix)

Training CBOW5 model for Hotel Reviews dataset...
Epoch 1/15 - Loss: 990371.6391
Epoch 2/15 - Loss: 902010.5678
Epoch 3/15 - Loss: 881519.6471
Epoch 4/15 - Loss: 870400.9332
Epoch 5/15 - Loss: 862901.3902
Epoch 6/15 - Loss: 857248.8459
Epoch 7/15 - Loss: 852681.4981
Epoch 8/15 - Loss: 848826.0552
Epoch 9/15 - Loss: 845447.6541
Epoch 10/15 - Loss: 842403.2030
Epoch 11/15 - Loss: 839610.1717
Epoch 12/15 - Loss: 837033.3654
Epoch 13/15 - Loss: 834590.0192
Epoch 14/15 - Loss: 832290.1829
Epoch 15/15 - Loss: 830089.3052


In [22]:
# Define path to save model
model_save_path = '/content/drive/My Drive/cbow5_hotel_reviews_model.pth'

# Save the model's state dictionary
torch.save(model_hotel_cbow5.state_dict(), model_save_path)

### 1.6 Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset

In [None]:
# Get vocabulary, word_to_ix, and ix_to_word for scifi
scifi_vocab = get_vocab(scifi_preprocessed)
scifi_word_to_ix, scifi_ix_to_word = get_word2ix(scifi_vocab)

# Generate data for CBOW2 with context window of 2
scifi_data_cbow2 = get_data(scifi_preprocessed, context_window=2)

# Define embedding dimension
EMBEDDING_DIM = 100

# Create and train the CBOW model for scifi  with context window 2
model_scifi_cbow2 = CBOW(vocab_size=len(scifi_vocab), embedding_dim=EMBEDDING_DIM)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Training CBOW2 model for Sci-Fi dataset...")
train(model_scifi_cbow2, scifi_data_cbow2, vocab=scifi_vocab, device=device, word_to_ix=scifi_word_to_ix, num_epochs = 3)


Training CBOW2 model for Sci-Fi dataset...
Epoch 1/15 - Loss: 4629274.4707
Epoch 2/15 - Loss: 4282379.0312
Epoch 3/15 - Loss: 4222921.1209
Epoch 4/15 - Loss: 4198167.6785


### 2.1 For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. (CBOW2 and optionally for CBOW5)
Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model.    

🗒❓ List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   


### 2.2 Repeat 2.1 for SciFi Dataset

🗒❓ List your findings for SciFi Dataset as well, similarly to 2.1

### 2.3 🗒❓ How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.

### 2.4 Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings.

🗒❓ Do they have different neighbours? If yes, can you reason why?

### 2.5 🗒❓ What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?    

### 3. Use the following code snippet to load a pretrained GloVe embedding model
GloVe (Global Vectors for Word Representation) (link to paper: https://aclanthology.org/D14-1162/) is a count-based model trained on very large text corpora (Wikipedia, Common Crawl, Twitter, etc.). Unlike CBOW, which learns to predict a target word from its local context, GloVe learns embeddings that capture global co-occurrence statistics of words across the entire corpus.

Each word in GloVe has one static vector, i.e. its embedding does not change depending on context.

Note: Loading pretrained GloVe embeddings does not require a GPU. This runs efficiently on CPU. Change your run time to CPU again to save GPU compute units.

In [None]:
!pip install gensim
!pip install "numpy<=1.26.0"

In [None]:
import numpy as np
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip -O glove.6B.zip
!unzip -q glove.6B.zip -d glove/

In [None]:
# Choose model
glove_input_file = "glove/glove.6B.100d.txt"
word2vec_output_file = "glove/glove.6B.100d.word2vec.txt"

# Convert GloVe format → Word2Vec format
glove2word2vec(glove_input_file, word2vec_output_file)

# Load model (this may take ~1–2 min)
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
print(f"Loaded {len(model.key_to_index):,} word vectors.")

In [None]:
### Insert your word selection from above here
trip_nouns = [...]
trip_verbs = [...]
trip_adj = [...]


for words_to_check in [trip_nouns, trip_verbs, trip_adj]:
  for w in words_to_check:
      if w in model.key_to_index:
          print(f"\nNearest neighbours for '{w}' in pretrained GloVe:")
          for neighbor, sim in model.most_similar(w, topn=5):
              print(f"  {neighbor:>12s}   (cosine similarity = {sim:.3f})")
      else:
          print(f"\n'{w}' not in vocabulary.")

### 3.1 🗒❓ Compare the hereby retrieved neighbours with the ones you retrieved above. Are there any similarities to the results above? How do they differ?

### Report
The lab report should contain a detailed description of the approaches you have used to solve this exercise. Please also include results.

Answers for the questions marked 🗒❓ goes here as well