
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eldanc/mlbootcamp2025/blob/main/lab_5_1_words.ipynb)

# UofT FASE ML Bootcamp
#### Friday June 13, 2025
####  Word Embeddings - Properties, Meaning and Training - Lab 1, Day 5
#### Teaching team: Eldan Cohen, Alex Olson, Nakul Upadhya
##### Lab author: Nakul Upadhya (Based on CARTE-DSI ML Bootcamp 2023 notebook by Prof. Jonathan Rose)

This lab engages you in the properties, meaning, viewing and training of word embeddings (also called word vectors). The specific learning objectives in this assignment are:

1.   To learn word embedding properties, and use them in simple ways.
2.   (optional) To translate vectors into understandable categories of meaning
3.   To understand how embeddings are created, using the Skip Gram method.

---





# Experimenting and Understanding Word Embedding/Vectors


Word embeddings (also known as word vectors) are a way to encode the meaning of words into a set of numbers.

These embeddings are created by training a neural network model using many examples of the use of language.  These examples could be the whole of Wikipedia or a large collection of news articles.

To start, we will explore a set of word embeddings that someone else took the time and computational power to create. One of the most commonly-used pre-trained word embeddings are the **GloVe embeddings**.

## GloVe Embeddings

You can read about the GloVe embeddings here: https://nlp.stanford.edu/projects/glove/, and read the original paper describing how they work here: https://nlp.stanford.edu/pubs/glove.pdf.

There are several variations of GloVe embeddings. They differ in the text used to train the embedding, and the *size* of the embeddings.

Throughout this lab we'll use a package called `staticvectors`, a package for obtaining pre-trained static word embeddings that is part of the `huggingface` model family.

We'll begin by loading a set of GloVe embeddings. The first time you run the code below, it will cause the download of a large file (862MB) containing the embeddings.

In [None]:
# Import the required libraries
!pip install huggingface staticvectors
import torch
from staticvectors import StaticVectors
import pandas as pd
import numpy as np

In [None]:
glove = StaticVectors('neuml/glove-6B')

We can use the loaded glove embeddings to look up the embeddings of individual words.
For example, let's look at what the embedding of the word "apple" looks like:

In [None]:
print("Embedding Shape:", glove.embeddings('apple').shape)
print(glove.embeddings('apple'))

As we can see from the output above, the embedding of a given word is a numpy array with dimension `(50,)`. We don't know what the meaning of each number is, but we do know that there are properties of the embeddings that can be observed. For example, `distances between embeddings` are meaningful.

## Measuring Distance

Let's consider one specific metric of distance between two embedding vectors called the **Euclidean distance**. The Euclidean distance of two vectors $x = [x_1, x_2, ... x_n]$ and
$y = [y_1, y_2, ... y_n]$ is just the 2-norm of their difference $x - y$. We can compute
the Euclidean distance between $x$ and $y$: $\sqrt{\sum_i (x_i - y_i)^2}$


Here we define a function to calculate the norm between two vectors:


In [None]:
def word_euclidean_distance(x, y):
  a = glove.embeddings(x) # get embeddings
  b = glove.embeddings(y)
  return np.linalg.norm(a-b) # use np.linalg to calculate the norm

Now lets use this function to get distances between words:

In [None]:
word_euclidean_distance('apple', 'banana')

In [None]:
word_euclidean_distance('good', 'bad')

In [None]:
word_euclidean_distance('good', 'water')

In [None]:
word_euclidean_distance('good', 'well')

In [None]:
word_euclidean_distance('good', 'perfect')

## Cosine Similarity

An alternative and more commonly-used measure of distance is the **Cosine Similarity**. The cosine similarity measures the *angle* between two vectors, and has the property that it only considers the *direction* of the vectors, not their the magnitudes. It is computed as follows for two vectors A and B:


![picture](https://drive.google.com/uc?id=1hSaQRBjH828lx1xozJCA4F0ZhiX2S0Xt)

Lets create a new function to measure the cosine similarity between two words:


In [None]:
def word_cosine_distance(x, y):
  a = glove.embeddings(x) # get embeddings
  b = glove.embeddings(y)
  norm_a = np.linalg.norm(a)
  norm_b = np.linalg.norm(b)
  return np.dot(a, b) / (norm_a * norm_b)

The cosine similarity is actually a *similarity* measure rather than a *distance* measure, and gives a result between -1 and 1. Thus, the larger the similarity, (closer to 1) the "closer in meaning" the word embeddings are to each other.

In [None]:
word_cosine_distance('cat', 'dog')

In [None]:
word_cosine_distance('good', 'bad')

In [None]:
word_cosine_distance('good', 'water')

In [None]:
word_cosine_distance('good', 'perfect')

In [None]:
word_cosine_distance('watermelon', 'airplane')

## Word Similarity

Now that we have notions of distance and similarity in our embedding space, we can talk about words that are "close" to each other in the embedding space. For now, let's use Euclidean distances to look at how close various words are to the word "cat".

In [None]:
word = 'cat'
other = ['pet', 'dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = word_euclidean_distance(word, w)
    print(w, "\t%5.2f" % float(dist))

Let's do the same thing with cosine similarity:

In [None]:
word = 'cat'
other = ['pet', 'dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = word_cosine_distance(word, w)
    print(w, "\t%5.2f" % float(dist))

We can look through the entire **vocabulary** for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word such as "cat".

In [None]:
def print_closest_words(glove_vector, n=5):
  tokens = glove.tokens
  distances = [None] * len(tokens)
  for i, token in enumerate(tokens):
      distance = np.linalg.norm(glove.embeddings(token) - glove_vector)
      distances[i] = {
          'token': token,
          'distance': distance
      }
  sorted_distances = sorted(distances, key=lambda x: x['distance'])
  df = pd.DataFrame(sorted_distances[1:n+1]) # ignore the word itself
  return df


In [None]:
print_closest_words(glove.embeddings("cat"), n=10)

In [None]:
print_closest_words(glove.embeddings("dog"), n=10)

In [None]:
print_closest_words(glove.embeddings("doctor"), n=10)

---

**Your Turn**

Try searching for similar words for words of your choice here:


In [None]:
#You can also try printing closest words to any other words of your choice here:



---


We could also look at which words are closest to the midpoints of two words:

In [None]:
print_closest_words((glove.embeddings('happy') + glove.embeddings('sad')) / 2)

In [None]:
print_closest_words((glove.embeddings('doctor') + glove.embeddings('engineer')) / 2)


## 1.2 Analogies

One surprising aspect of word embeddings is that the *directions* in the embedding space can be meaningful. For example, some analogy-like relationships like this tend to hold:

$$ king - man + woman \approx queen $$

Analogies show us how relationships between pairs of words that is captured in the learned vectors

In [None]:
print_closest_words(glove.embeddings('king') - glove.embeddings('man') + glove.embeddings('woman'))

The top result is a reasonable answer like "queen", and the other results include words like "princess."

We can also do the reverse!

In [None]:
print_closest_words(glove.embeddings('queen') - glove.embeddings('woman') + glove.embeddings('man'))

In [None]:
print_closest_words(glove.embeddings('king') - glove.embeddings('prince') + glove.embeddings('princess'))

---

**Your Turn**

Consider now the word pair relationships given in Figure 1 below, which comes from Table 1 of the Mikolov [[link](https://arxiv.org/abs/1301.3781)] paper. Choose one of these relationships, but not one of the ones already shown above, and report which one you chose. Write and run code that will generate the second word given the first word. Generate 10 more examples of the same relationship from 10 other words, and comment on the quality of the results.

![picture](https://drive.google.com/uc?id=1O7Zizu63jj5aoZkGkK0sz93CZSEsBDuW)



In [None]:
# TODO
# Choose one of the relationships from the table above and generate 10 examples



---

# 3. Training A Word Embedding Using the Skip-Gram Method on a Small Corpus

So far in this notebook we've used the pre-trained GloVe embeddings. The lecture this morning described the Skip Gram method of training word embeddings. In this section you are going to review code to use that method to train a very small embedding, for a very small vocabulary on a very small corpus of text. The goal is to gain some insight into the general notion of how embeddings are produced. The corpus you are going to use is in the file SmallSimpleCorpus.txt, and was also shown in the lecture.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import spacy

In [None]:
# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

First, we read the file SmallSimpleCorpus.txt into our notebook and do some pre-processing to make the learning easier later. Specifically, we lemmatize the corpus,
which means converting words to their root - for example the word “holds” becomes “hold”, whereas the word  “hold” itself stays the same.
The `prepare_texts` function performs lemmatization using the [spaCy](https://spacy.io/models/en) library.

In [None]:
!wget https://raw.githubusercontent.com/eldanc/mlbootcamp2025/refs/heads/main/SmallSimpleCorpus.txt

with open('./SmallSimpleCorpus.txt', 'r') as file:
    corpus = file.read()
corpus

In [None]:
# Preprocess the text
def prepare_texts(corpus):
    doc = nlp(corpus)
    lemmas = [token.lemma_ for token in doc if token.is_alpha]
    return lemmas
#lematize the corpus and create the vocabulary
lemmas = prepare_texts(corpus)
vocab = set(lemmas)
v2i = {v: i for i, v in enumerate(vocab)} # dictionary to lookup word to index
i2v = {i: v for v, i in v2i.items()} # dictionary to lookup index to word
vocab_size = len(vocab)
print("Vocabulary Size:", vocab_size)

The function `tokenize_and_preprocess_text` takes the lemmatized small corpus as input, along with `v2i` (which serves as a simple, lemma-based tokenizer) and a window size window. Its output should be the Skip Gram training dataset for this corpus: pairs of words in the corpus that “belong” together, in the Skip Gram sense.
That is, for every word in the corpus a set of training examples are generated with that word serving as the (target) input to the predictor,
and all the words that fit within a window of size window surrounding the word would be predicted to be in the “context” of the given word.
The words are expressed as tokens (numbers).

In [None]:
# Tokenize and preprocess the text
def tokenize_and_preprocess_text(lemmas, v2i, window=3):
    data = []
    for i in range(len(lemmas)):
        target = v2i[lemmas[i]]
        context = []
        for j in range(i - window // 2, i + window // 2 + 1):
            if j != i and j >= 0 and j < len(lemmas):
                context.append(v2i[lemmas[j]])
        for c in context:
            data.append((target, c))
    return data

In [None]:
# Create the Skip gram dataset with window size of 5
window_size = 5
data = tokenize_and_preprocess_text(lemmas, v2i, window_size)
print(data[:5])

The result of this is `data`, a list that enumerates co-occurences of tokens together. The task is now to train a classification model that aims to "predict" the second token using the embedding from the first.


Review the code in Word2vecModel. Part of this model ultimately provides the trained embeddings/vectors, and you can see these are defined and initialized to random numbers in the line `self.embedding = torch.nn.Parameter(torch.rand(
            vocab_size, embedding_size))`

In [None]:
# The Word2Vec model
class Word2VecModel(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(Word2VecModel, self).__init__()
        self.embedding = torch.nn.Parameter(torch.rand(
            vocab_size, embedding_size))
        self.fc = nn.Linear(embedding_size, vocab_size)

    def forward(self, x):
        x = self.embedding[x]
        x = self.fc(x)
        return x

In [None]:
# Set the vocab size
embedding_size = 2 # Size of the embedding vector
# Initialize the Word2Vec model
model = Word2VecModel(vocab_size, embedding_size)

Review the code for training the model. It uses Cross Entropy loss function described in the lecture, a batch size of 4, a window size of 5, and 50 Epochs of training. It uses the Adam optimizer, and a learning rate of 0.001.

In [None]:
# Training the model
def train_word2vec(model, data, epochs=50, batch_size=4, learning_rate=0.001):
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        np.random.shuffle(data)
        losses = []
        for i in range(0, len(data), batch_size):
            batch = data[i:i+batch_size]
            inputs, labels = zip(*batch)
            inputs = torch.tensor(inputs, dtype=torch.long)
            labels = torch.tensor(labels, dtype=torch.long)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            losses.append(loss.item())

        print(f'Epoch {epoch+1}, Loss: {np.mean(losses):.4f}')

train_word2vec(model, data)

Run the code below that displays each of the embeddings in a 2-dimensional plot using Matplotlib.


In [None]:
# Visualize the embeddings
import matplotlib.pyplot as plt

def plot_embeddings(model, i2v):
    embeddings = model.embedding.data.numpy()
    plt.figure(figsize=(10, 10))
    for i, word in i2v.items():
        x, y = embeddings[i]
        plt.scatter(x, y)
        plt.text(x + 0.02, y + 0.02, word, fontsize=12)
    plt.show()

plot_embeddings(model, i2v)

---

**Your Turn**

* Look at the original corpus above. Do the results from the embedding make sense?

* What could happen when the window size is too large?

* At what value would window become too large for this corpus?

---