___
<img style="float: right; margin: 15px 15px 15px 15px;" src="https://communist.red/wp-content/uploads/2017/08/Anarchist_flag.png" width="300px" height="180px" />


# <font color= #bbc28d> **Skip-gram & CBOW Word Embeddings** </font>
#### <font color= #2E9AFE> `Lab 2 – Text Mining`</font>
- <Strong> Sofía Maldonado, Diana Valdivia & Viviana Toledo </Strong>
- <Strong> Fecha </Strong>: 20/10/2025 

___

<p style="text-align:right;"> Imagen recuperada de: https://communist.red/wp-content/uploads/2017/08/Anarchist_flag.png</p>

In [None]:
# General Libraries
import numpy as np
import pandas as pd
import random

# Text Processing 
import re

# Modeling
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

# Evaluation
from sklearn.metrics.pairwise import cosine_similarity

#Visualization
from sklearn.manifold import TSNE
from umap import UMAP
import plotly.subplots as sp
import plotly.graph_objects as go

# <font color= #bbc28d> **Introduction** </font>

Natural Language Processing encompasses all the `tasks related to making computers understand human language`, of which, word embeddings are a useful tool for achieving this goal. `Word embeddings are a numerical representation of words`, which allows computers to encode language into vectors which can be visualized in different dimensions.

A common representation that word embeddings use is one-hot encoding, which vectorizes each word using binary numbers in a `vector of dimensionality 1 x corpus length`, for example:  

<div style="background-color: white; text-align: center; padding: 1em; margin-bottom: 0.5em;">
<img src="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-40dd0ac8f7ba6930347fc88ac01ef5b8_l3.svg" style="display: inline-block;"/>

<div style="background-color: white; text-align: center; padding: 0.5em;">
    <img src="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-aca3b72bad8941a430b71c9946bf01b3_l3.svg" style="display: inline-block;"/>
</div>

However, such encoding can quickly scale up with large vocabularies, resulting in a curse of dimensionality. Additionally, the embeddings are susceptible to changes in the corpora size, and the vectors dont encapsulate word meaning. The one-hot encoding 'indexes' words, but is unable to capture semantic and syntantic information; the values in the `vectors must somehow quantify the meaning of the words they represent`.

To solve this problem, Word2Vec was introduced, a technique which generates `embeddings based on word similarity`, allowing them to be close to each other in a vectorized space in terms of cosine distance. There are two main algorithms to obtain a Word2Vec implementation: Continuous Bag of Words (CBOW) and Skip-Gram, which make use of neural network models.

In this notebook, we will explore both models and implement them on a corpus about Anarchy on Wikipedia.

# <font color= #bbc28d> **Preprocessing** </font>

In [3]:
# Read Wikipedia file 
with open(r"text8", "r", encoding="utf-8") as f:
    text = f.read()

The preprocessing task involves:

- Normalization of Corpus (ensuring words are in lowercase)

- Word extraction

- Tokenization by whitespace

- Remove single-letter tokens

- Select a corpus of 50,000 words

In [4]:
# Normalize data
# Convert every word to lowercase
text = text.lower()

# Keep only the words, leave the rest as whitespace
text = re.sub(r"[^a-z\s]", "", text)

# Tokenize by whitespace
tokens = text.split()

# Keep only more than single letter words
tokens = [w for w in tokens if len(w) > 1]

# Keep first 50k words
tokens_models = tokens[:50_000]

Next, we create a vocabulary from those words: 

In [5]:
# Create a vocabulary based on the clean corpora
vocab = sorted(list(set(tokens_models)))
vocab_set = set(vocab)

After creating a vocabulary, we build up a dictionary of word-index pairs for the embeddings:

In [6]:
# Create word-index dictionaries
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for word, i in word_to_idx.items()}

length = len(vocab)
print(f"Vocabulary Size: {length}")

Vocabulary Size: 7979


And finally, we can visualize the id of the tokens:

In [7]:
tokens_idx = [word_to_idx[w] for w in tokens_models if w in vocab_set]
tokens_idx[:10]

[362, 5086, 543, 7161, 4983, 39, 2868, 7568, 203, 2293]

# <font color= #bbc28d> **Modeling** </font>

CBOW and Skip-gram are algorithms that work on word pairs, meaning, `for labeling word similarity, words are paired together and then analyzed`. The pairing is done by a window-size, which selects words behind and after the target, and pairs them together. Thus, this will be the first step in our respective modelings.

Both models are going to be implemented using Pytorch, since they are, essentially, neural networks. Taking advantage of this, we will first check if cuda is available. If it is, we're going to run our models in the computer's gpu, as it speeds up training:

In [8]:
# PyTorch Configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_device(device)
print(f'Using device: {device}')

Using device: cuda


# <font color= #bbc28d> **1. Skip-Gram** </font>

The Skip-Gram algorithm tries to extract the semantics of words (context) by `predicting the context words using the main word`. By picking the word pais of the target word, each of them is run through a neural network model with one hidden layer:

<p align= "center">
    <img src='https://www.baeldung.com/wp-content/uploads/sites/4/2021/03/Baeldung-word-embeddings-1-656x1024-1.png' width="420px" height="560px">
</p>

Firstly, we need to generate the skipgram pairs. We're using a word window of 2 to 5 words:

In [9]:
def generate_skipgram_pairs(tokens_idx, min_window=2, max_window=5):
    # Guardar los pares en una lista
    pairs = []
    # Rango la longitud de el vocabulario/texto
    n = len(tokens_idx)
    for i in range(n):
        # Elegir nuestro target
        target = tokens_idx[i]
        # Ventana random entre 2 y 5
        window_size = random.randint(min_window, max_window)
        # Posiciones de inicio y fin de la window
        start = max(i - window_size, 0)
        end = min(i + window_size + 1, n)
        for j in range(start, end):
            #Skipear la target
            if j != i:
                context = tokens_idx[j]
                pairs.append((target, context))
    return pairs

skip_pairs = generate_skipgram_pairs(tokens_idx, min_window=2, max_window=5)

In [10]:
# Convert Skip-Gram pairs to text
skipgram_pairs_words = [
    (idx_to_word[target], idx_to_word[context])
    for target, context in skip_pairs
]

for i in range(5):
    print(f"Target: {skipgram_pairs_words[i][0]}  -->  Context: {skipgram_pairs_words[i][1]}")

Target: anarchism  -->  Context: originated
Target: anarchism  -->  Context: as
Target: anarchism  -->  Context: term
Target: anarchism  -->  Context: of
Target: originated  -->  Context: anarchism


After pairing the words, we can proceed to model the data. We're using 15 epochs, a CrossEntropyLoss criterion, and an Adam optimizer with a learning rate of 0.001: 

In [11]:
# Constants
embedding_dim = 100
vocab_size = len(vocab)

# ==========================
# Model Definition
# ==========================
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output = nn.Linear(embedding_dim, vocab_size)

    def forward(self, centers):
        # centers: tensor [batch_size]
        embeds = self.embeddings(centers)  # [batch_size, emb_dim]
        out = self.output(embeds)
        return out

# ==========================
# Data Loader
# ==========================  
generator = torch.Generator(device=device)              # Set generator
# Create dataset in CPU 
skipgram_targets = torch.tensor([t for t, c in skip_pairs], dtype=torch.long)
skipgram_contexts = torch.tensor([c for t, c in skip_pairs], dtype=torch.long)
# Combine the targets and context into a list pair for training
skipgram_dataset = list(zip(skipgram_targets, skipgram_contexts))

# Data Loader for batch processing
skipgram_loader = DataLoader(skipgram_dataset, batch_size=1024, shuffle=True, generator=generator)

# ==========================
# Model Training 
# ==========================
skipgram_model = SkipGramModel(vocab_size, embedding_dim).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(skipgram_model.parameters(), lr=0.001)
epochs = 15

for epoch in range(epochs):
    total_loss = 0
    skipgram_model.train()
    for centers, contexts in skipgram_loader:
        # Mover datos a device
        centers = centers.to(device)
        contexts = contexts.to(device)
        
        optimizer.zero_grad()
        output = skipgram_model(centers)
        loss = criterion(output, contexts)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Skip-gram Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

Skip-gram Epoch 1/15, Loss: 2872.5853
Skip-gram Epoch 2/15, Loss: 2519.8706
Skip-gram Epoch 3/15, Loss: 2359.8234
Skip-gram Epoch 4/15, Loss: 2271.6430
Skip-gram Epoch 5/15, Loss: 2214.5140
Skip-gram Epoch 6/15, Loss: 2171.9992
Skip-gram Epoch 7/15, Loss: 2137.4398
Skip-gram Epoch 8/15, Loss: 2108.2206
Skip-gram Epoch 9/15, Loss: 2082.5474
Skip-gram Epoch 10/15, Loss: 2059.6493
Skip-gram Epoch 11/15, Loss: 2039.0587
Skip-gram Epoch 12/15, Loss: 2020.4535
Skip-gram Epoch 13/15, Loss: 2003.3486
Skip-gram Epoch 14/15, Loss: 1987.8195
Skip-gram Epoch 15/15, Loss: 1973.5437


Finally, we can obtain the embeddings produced by Skip-Gram:

In [12]:
# Move embeddings to CPU for further usage (Visualization)
skipgram_embeddings = skipgram_model.embeddings.weight.data.cpu()

# Visualize the vector of a word
word = "science"
idx = word_to_idx[word]
print(f"Skip-Gram Vector of '{word}': \n {skipgram_embeddings[idx]}")

Skip-Gram Vector of 'science': 
 tensor([ 2.1129,  0.0386, -1.3943,  0.9383, -2.0695,  0.8647,  0.5960, -0.1876,
        -0.4468, -2.4660, -0.0727,  1.1486, -1.0809,  1.5564, -1.8450,  0.5167,
         0.4109,  0.3302, -0.0590,  0.9703, -1.1504,  2.1644,  0.8036, -0.1588,
        -1.0557,  2.0893, -1.1484, -1.3856, -0.3413, -0.8697,  1.4587, -1.6520,
         1.8534, -0.0451, -0.5159, -2.7897, -0.5317,  0.5058, -0.1839, -1.7234,
         1.6467, -1.4718, -0.9714,  0.0431,  0.2521,  0.4107, -1.0097,  1.4442,
        -0.0497,  0.3665,  0.0576, -0.4305,  0.4509, -0.1748,  0.2137,  0.9769,
         0.0594, -0.0921,  0.5405, -0.1075, -0.1902, -0.3547,  0.8332, -0.7606,
         0.2905, -2.3213, -0.8157, -0.6477, -1.2070,  0.2294, -0.4838,  1.0091,
        -1.1870,  1.3637,  0.4349,  0.3887,  0.5620, -0.5153,  0.8767,  1.3278,
         0.0188, -0.4318,  1.4938,  3.1381,  0.9087,  1.5092, -2.1301,  0.6259,
        -1.3380, -0.8744, -1.6318, -0.7296,  1.1615, -0.3448,  2.0887, -1.9452,
       

For a clearer understanding of the embeddings, let's take a look at the 10 most similar words:

In [13]:
def get_top_similar_words(embeddings, word, top_k=10):
    """
    Find top-k most similar words using cosine similarity
    """
    if word not in word_to_idx:
        print(f"Word '{word}' not in vocabulary")
        return []
    
    # Get the embedding for the anchor word
    word_idx = word_to_idx[word]
    word_embedding = embeddings[word_idx].reshape(1, -1)
    
    # Calculate cosine similarity with all other words
    similarities = cosine_similarity(word_embedding, embeddings)[0]
    
    # Get top-k most similar words (excluding the word itself)
    similar_indices = np.argsort(similarities)[::-1][1:top_k+1]  # Skip the word itself
    
    similar_words = []
    for idx in similar_indices:
        similar_words.append((idx_to_word[idx], similarities[idx]))
    
    return similar_words

# Anchor words
anchor_words = ["king", "anarchism", "communist", "revolution", "paris"] 

print(f"Skip-gram Model - Top 10 Most Similar Words related to {anchor_words}:")
print("=" * 120)
for word in anchor_words:
    similar_words = get_top_similar_words(skipgram_embeddings.numpy(), word)
    print(f"\n'{word}':")
    for similar_word, similarity in similar_words:
        print(f"  {similar_word}: {similarity:.4f}")

Skip-gram Model - Top 10 Most Similar Words related to ['king', 'anarchism', 'communist', 'revolution', 'paris']:

'king':
  leadership: 0.4042
  organon: 0.3807
  shrugged: 0.3530
  jewish: 0.3394
  rapidly: 0.3354
  extra: 0.3313
  bmc: 0.3242
  rest: 0.3216
  nomadic: 0.3205
  omnipotence: 0.3202

'anarchism':
  insurrectionary: 0.4362
  instrumental: 0.4109
  descendants: 0.3905
  argued: 0.3748
  nationalism: 0.3701
  capitalists: 0.3612
  ensure: 0.3539
  wording: 0.3377
  tit: 0.3329
  seize: 0.3260

'communist':
  presupposes: 0.3517
  version: 0.3488
  judge: 0.3369
  philosophies: 0.3295
  associating: 0.3245
  merging: 0.3232
  slow: 0.3223
  litt: 0.3200
  saying: 0.3185
  telling: 0.3146

'revolution':
  create: 0.4140
  seymour: 0.4024
  saving: 0.3798
  characterized: 0.3624
  chomsky: 0.3585
  catechism: 0.3463
  fowlers: 0.3453
  desires: 0.3452
  quality: 0.3446
  ct: 0.3390

'paris':
  homefront: 0.4380
  vague: 0.4045
  community: 0.3612
  definition: 0.3508
  promo

There are some words that are not exactly related to the anchor word, however, they may appear in similar contexts. **Skip-Gram works best with small datasets**, and since we're working with a large corpora (50,000 words) it may have some trouble identifying context. The most common associated words with our anchor words are:

- **King** <--> Battles, Hawthorne
- **Anarchism** <--> Zinovievna, Leftism
- **Communist** <--> Troilus, Revolutionary
- **Revolution** <--> Hoosier, Ukraine
- **Paris** <--> York, Hart

From this, we can see some disonances.

# <font color= #bbc28d> **2. CBOW** </font>

CBOW algorithm works similar to Skip-Gram, but it does the reverse operation, the model tries to `predict the main word using the context words`. Therefore, the neural network is a mirror of Skip-gram:

<p align= "center">
    <img src='https://www.baeldung.com/wp-content/uploads/sites/4/2021/03/Screenshot-2021-03-05-at-11.29.31-1024x616-1-768x462.png' width="480px" height="360px">
</p>

Same as with Skip-Gram, we start by generating word pairs, with a window of 2 to 5 words:

In [14]:
# Generar los pares para cbow
def generate_cbow_pairs(tokens_idx, min_window=2, max_window=5):
    # Guardar los pares en una lista
    pairs = []
    # Rango la longitud de el vocabulario/texto
    n = len(tokens_idx)
    for i in range(n):
        # Elegir nuestro target
        target = tokens_idx[i]
        # Ventana random entre 2 y 5
        window_size = random.randint(min_window, max_window)
        # Posiciones de inicio y fin de la window
        start = max(i - window_size, 0)
        end = min(i + window_size + 1, n)
        # Contexto de la palabra a predecir
        context = [tokens_idx[j] for j in range(start, end) if j != i]
        if context:
            pairs.append((context, target))
    return pairs

cbow_pairs = generate_cbow_pairs(tokens_idx, min_window=2, max_window=5)

In [15]:
# Convert CBOW pairs to text
cbow_pairs_words = [
    ([idx_to_word[i] for i in context], idx_to_word[target])
    for context, target in cbow_pairs
]

for i in range(5):
    context_str = f'Context: {cbow_pairs_words[i][0]}'
    target_str = f'Target: {cbow_pairs_words[i][1]}'
    print(f"{context_str.ljust(80)}  -->  {target_str.rjust(30)}")

Context: ['originated', 'as', 'term', 'of', 'abuse']                              -->               Target: anarchism
Context: ['anarchism', 'as', 'term', 'of', 'abuse', 'first']                      -->              Target: originated
Context: ['anarchism', 'originated', 'term', 'of', 'abuse', 'first']              -->                      Target: as
Context: ['anarchism', 'originated', 'as', 'of', 'abuse', 'first', 'used']        -->                    Target: term
Context: ['as', 'term', 'abuse', 'first']                                         -->                      Target: of


After pairing the words, we can proceed to model the data. We're using 15 epochs, a CrossEntropyLoss criterion, and an Adam optimizer with a learning rate of 0.001: 

In [16]:
# ==========================
# Model Definition
# ==========================
class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output = nn.Linear(embedding_dim, vocab_size)

    def forward(self, contexts):
        # contexts: lista de tensores ya en el dispositivo correcto
        embeds = [self.embeddings(c) for c in contexts]  # lista de [context_len, emb_dim]
        context_embeds = torch.stack([e.mean(dim=0) for e in embeds])  # [batch_size, emb_dim]
        out = self.output(context_embeds)
        return out
    
# ==========================
# DataLoader 
# ==========================
def cbow_collate(batch):
    contexts, targets = zip(*batch)
    # Convertir contextos a tensores y mover a device
    context_tensors = [torch.tensor(c, dtype=torch.long).to(device) for c in contexts]
    return context_tensors, torch.tensor(targets, dtype=torch.long).to(device)

cbow_loader = DataLoader(cbow_pairs, batch_size=1024, shuffle=True, collate_fn=cbow_collate, generator=generator)

# ==========================
# Model Training 
# ==========================
cbow_model = CBOWModel(vocab_size, embedding_dim).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cbow_model.parameters(), lr=0.001)
epochs = 15

for epoch in range(epochs):
    total_loss = 0
    cbow_model.train()
    for contexts, targets in cbow_loader:
        # Los datos ya están en el dispositivo correcto gracias al collate_fn
        optimizer.zero_grad()
        output = cbow_model(contexts)
        loss = criterion(output, targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"CBOW Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

CBOW Epoch 1/15, Loss: 433.7627
CBOW Epoch 2/15, Loss: 411.2242
CBOW Epoch 3/15, Loss: 382.6940
CBOW Epoch 4/15, Loss: 352.6040
CBOW Epoch 5/15, Loss: 333.5147
CBOW Epoch 6/15, Loss: 323.9343
CBOW Epoch 7/15, Loss: 317.0733
CBOW Epoch 8/15, Loss: 311.0076
CBOW Epoch 9/15, Loss: 305.3666
CBOW Epoch 10/15, Loss: 299.8972
CBOW Epoch 11/15, Loss: 294.5601
CBOW Epoch 12/15, Loss: 289.3416
CBOW Epoch 13/15, Loss: 284.1457
CBOW Epoch 14/15, Loss: 279.0069
CBOW Epoch 15/15, Loss: 273.8974


After our CBOW model has been trained, it's time to take a look at the embeddings:

In [17]:
# Move embeddings to CPU for further usage (Visualization)
cbow_embeddings = cbow_model.embeddings.weight.data.cpu()

# Visualize the vector of a word
word = "science"
idx = word_to_idx[word]
print(f"CBOW Vector of '{word}': \n {cbow_embeddings[idx]}")

CBOW Vector of 'science': 
 tensor([ 1.2161, -1.0920, -0.6482, -1.7903,  0.6046,  1.5569,  2.2029, -0.9556,
         0.5691, -0.3546, -1.7698,  0.7636,  1.0751,  1.1890,  2.5660,  0.2730,
        -0.2542, -2.3767,  0.9937,  0.1385, -3.2756,  0.6779,  0.7277,  0.7702,
        -0.3677, -2.7889,  1.1537,  0.4579,  2.5921, -0.4360, -1.7802,  0.3261,
         0.0167,  1.1755, -2.8044,  1.1860,  0.9546, -1.1252, -0.5227, -0.7571,
         1.2803, -0.1826,  0.2124, -0.9714, -0.3592,  0.8604,  0.9870,  0.5763,
         0.1255,  0.0471,  0.7244,  1.9093, -0.8323, -0.5541, -0.4015, -0.7600,
         0.7204,  0.6383, -1.0407,  0.3809,  2.2629,  0.1622, -0.9786,  0.5457,
        -0.9005, -0.3926,  0.2537,  0.6134,  0.3777, -1.0227,  1.4897,  0.2532,
         0.2899, -0.4815, -3.6886, -0.8326,  2.2021,  1.3784, -0.4389,  0.4750,
         0.1050,  1.0012,  1.9981,  0.5731, -1.3327, -1.2035, -0.7176,  0.2909,
        -1.6777, -0.2517,  1.2730,  0.6230, -0.7558,  0.4057, -1.0297, -0.8882,
        -1.4

And, a view of the top 10 most similar words:

In [18]:
print(f"CBOW Model - Top 10 Most Similar Words related to {anchor_words}:")
print("=" * 120)
for word in anchor_words:
    similar_words = get_top_similar_words(cbow_embeddings.numpy(), word)
    print(f"\n'{word}':")
    for similar_word, similarity in similar_words:
        print(f"  {similar_word}: {similarity:.4f}")

CBOW Model - Top 10 Most Similar Words related to ['king', 'anarchism', 'communist', 'revolution', 'paris']:

'king':
  newspaperman: 0.3928
  deed: 0.3694
  borrowed: 0.3670
  cellular: 0.3636
  kneels: 0.3533
  preserved: 0.3525
  delivered: 0.3523
  act: 0.3355
  armor: 0.3275
  coin: 0.3237

'anarchism':
  crafoord: 0.3888
  unchanging: 0.3637
  formation: 0.3611
  daughter: 0.3486
  cdd: 0.3377
  kira: 0.3355
  giving: 0.3355
  more: 0.3330
  twentieth: 0.3218
  develop: 0.3207

'communist':
  worshipped: 0.3620
  spontaneity: 0.3332
  periodically: 0.3144
  cite: 0.3139
  binary: 0.3104
  del: 0.3078
  attributive: 0.3069
  dark: 0.3067
  definition: 0.3043
  evidenced: 0.3036

'revolution':
  ibycus: 0.4304
  new: 0.3608
  ruthless: 0.3538
  rescind: 0.3521
  indivisible: 0.3512
  pdd: 0.3363
  heals: 0.3355
  reflected: 0.3230
  pinyin: 0.3224
  chairman: 0.3185

'paris':
  rail: 0.4085
  knew: 0.3980
  ten: 0.3634
  estimated: 0.3606
  hugo: 0.3585
  categories: 0.3312
  deduc

CBOW seems to perform better when it comes to identifying word context, since it doesn't have the Skip-Gram limitations for vocabulary size. The most common associated words with our anchor words are:

- **King** <--> Declaration
- **Anarchism** <--> Falls
- **Communist** <--> Provided, Nourishing
- **Revolution** <--> Call, Manifested
- **Paris** <--> Group, Realm

# <font color= #bbc28d> **Embeddings Visualization** </font>

The similarity between words can also be observed graphically. However, due to the high dimensionality of the embeddings, dimensionality reduction techniques need to be applied first. For this project, two techniques will be tested: `t-SNE` and `UMAP`

# <font color= #bbc28d> **1. t-SNE** </font>

`t-SNE` (or T-distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton in 2008.

It works in two steps. First, it creates a probability distribution on pairs of high-dimension objects in which similar objects have higher probability, with dissimilar objects having lower probability. At the same time, it creates another probability distribution with the points in the lower-dimension map. 

Then, it minimizes the [KL Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between the two distributions. This algorithm uses Euclidean distance as the base of its similarity metric.

In [19]:
# Getting 200 random words to visualize
tsne_samples = 200

np.random.seed(42) # Seed so that re-runs don't affect the output

# Model subsets
idx_2 = np.random.choice(len(vocab), tsne_samples, replace=False)
cbow_subset = cbow_embeddings[idx_2]
skipgram_subset = skipgram_embeddings[idx_2]

# Vocabulary subset for comparison
vocab_subset = [vocab[i] for i in idx_2]

In [21]:
def tsne_plot(embedding_list:list, labels, model_names:list, title='t-SNE Projection - CBOW vs Skip-Gram'):
    # Create subplots for visualization
    fig = sp.make_subplots(rows=1, cols=2, subplot_titles=model_names)

    for idx, (embeddings, name) in enumerate(zip(embedding_list, model_names), start=1):
        # Initialize UMAP object
        tsne = TSNE(n_components=2, random_state=42, perplexity=30)
        # Reduce dimensionality
        proj = tsne.fit_transform(embeddings)

        # Create dataframe with coordinates for each data point and their corresponding words
        df = pd.DataFrame({
            'x': proj[:, 0],
            'y': proj[:, 1],
            'word': labels
        })

        # Create scatter plot with text annotations
        scatter = go.Scatter(
            x=df['x'], y=df['y'],               # Data
            mode='markers+text',                # Show points and labels
            text=df['word'],                    # Words to be annotated
            textposition='top center', textfont=dict(size=9),         # Position and font size of the labels (words)
            hoverinfo='text',                   # Hovering shows the word
            showlegend=False,
            marker=dict(size=6, color='blue')
        )
        # Append the scatter plot to the subplot
        fig.add_trace(scatter, row=1, col=idx)

    # Update the layout
    fig.update_layout(
        height=600,
        width=1400,
        title=dict(text=title, x=0.5, xanchor='center', font=dict(size=20)),
        showlegend=False
    )

    fig.show()

tsne_plot([skipgram_subset, cbow_subset], vocab_subset, ['Skip-Gram', 'CBOW'])

In this example, the same 200 word sub-sample was visualized using t-SNE for both our CBOW and Skipgram embeddings. Words closer together are more similar in meaning. Below are some specific examples which can be seen in the graphs above, with an example of how these might be used together on a Wikipedia article

#### CBOW
- "accounted" - "heatstroke" (heatstroke accounted for 10% of deaths)
- "experiences" - "divine" (self-explanatory)
- "washington" - "farm" (George Washington grew up in a farm in Vermont)
- "application" - "approved" (also self-explanatory)
- "profit" - "welfare" (profit and welfare are both words used in macroeconomics discussions)

#### Skipgram
- "eugene" - "county" - "residents" (Eugene, a city with 176,000 residents, is the county seat of Lane County, Oregon)
- "http" - "net" (self-explanatory)
- "flag" - "filter" (these two words can be used interchangeably when talking about something like a system which has flags or filters for bad words, for example)
- "buddhism" - "worship" (also self-explanatory)
- "exceeding" - "commercial" (the commercial activity in the area was exceeding its resources)

# <font color= #bbc28d> **2. UMAP** </font>

UMAP is another algorithm used for dimension reduction based on manifold learning techniques and ideas from topological data analysis. It's a non-linear dimension reduction algorithm that seeks to `learn the structure of the data and find a low dimensional embedding that preserves the essential topological structure` of that manifold (lower-dimensional curved surface that resembles and preserves the characteristics of data stored in a high-dimensional space).

UMAP has 4 major hyperparameters:
- n_neighbors
    - Controls how UMAP balances local vs global structure in the data, by constraining the size of the local neighbourhood UMAP will look at when attempting to learn the manifold structure of the data. **Low values focus on local structures, large values focus on the bigger picture.**
- min_dist
    - Controls how tightly UMAP is allowed to pack points together. **Low values result in clumpier embeddings, large values prevent stacking and result in the preservation of the broader topological structure.** 
- n_components
    - Dimensions of the final reduction.
- metric
    - Controls **how distance is computed**. 

In [None]:
def umap_plot(embedding_list:list, labels, model_names:list, title='UMAP Projection - CBOW vs Skip-Gram'):
    # Create subplots for visualization
    fig = sp.make_subplots(rows=1, cols=2, subplot_titles=model_names)

    for idx, (embeddings, name) in enumerate(zip(embedding_list, model_names), start=1):
        # Initialize UMAP object
        umap_2d = UMAP(n_components=2, n_neighbors=2, min_dist=0.3, init='random', random_state=42, n_jobs=1)
        # Reduce dimensionality
        proj = umap_2d.fit_transform(embeddings)

        # Create dataframe with coordinates for each data point and their corresponding words
        df = pd.DataFrame({
            'x': proj[:, 0],
            'y': proj[:, 1],
            'word': labels
        })

        # Create scatter plot with text annotations
        scatter = go.Scatter(
            x=df['x'], y=df['y'],               # Data
            mode='markers+text',                # Show points and labels
            text=df['word'],                    # Words to be annotated
            textposition='top center', textfont=dict(size=9),         # Position and font size of the labels (words)
            hoverinfo='text',                   # Hovering shows the word
            showlegend=False,
            marker=dict(size=6, color='blue')
        )
        # Append the scatter plot to the subplot
        fig.add_trace(scatter, row=1, col=idx)

    # Update the layout
    fig.update_layout(
        height=600,
        width=1400,
        title=dict(text=title, x=0.5, xanchor='center', font=dict(size=20)),
        showlegend=False
    )

    fig.show()

For the UMAP dimensionality reduction, we chose the following hyperparameters:
- **n_neighbors:** 3
- **min_dist:** 0.1

Since we want to observe the specific topological structure of the words.

In [72]:
umap_plot([skipgram_subset, cbow_subset], vocab_subset, ['Skip-Gram', 'CBOW'])

#### Skipgram
- "dutch" - "bill" (probably talking about an influential bill proposed in the Netherlands)
- "colonies" - "farm" (a little bit self-explanatory)
- "servitude" - "ramifications" (probably referring to the implications servitude has)
- "initiates" - "participating" - "parliamentary" (words that may be commonly found together)
- "comercial" - "prosperous" (probably talking about a prosperous commerce)

#### CBOW
- "emancipation" - "buddhism" - "valued" (probably talking about details of the buddhist religion)
- "manipulation" - "archaich" - "parlamentary" (may refer to a critic to the government systems)
- "office" - "wellfare" (relation may be built upon the uses of the office)
- "shares" - "profit" (self-explanatory)
- "feminist" - "writers" - "promotional" (may refer to the rise of feminism)

# <font color= #bbc28d> **Conclusions** </font>

CBOW and Skipgram are very useful Word2Vec approaches to draw word meaning and similarity in a way that a computer can understand. For this specific example of Wikipedia articles, CBOW seems to be a much better approach for building our word embeddings. The final training loss was about 8 times lower with CBOW compared to Skipgram. In our testing, the groupings of words obtained from CBOW also make a lot more sense, and Skipgram missed some very obvious ones like "http"-"net" and "shares"-"profit". 

Even though Skipgram was less useful for this specific task, it could be more useful in others. For example, Skipgram would be more ad hoc for a system that can predict the next word you are gonna type, a useful feature found in many smartphones. This is due to Skipgram's fundamental nature as well as its capacity to work better with corpora featuring vocabulary that doesn't repeat much, which wasn't the case for our Wikipedia sample.

This project was also a great exploration of dimensionality reduction, particularly with UMAP. We had already used t-SNE before but UMAP was a new tool for us, and it actually adjusted a lot better to this task than t-SNE.

# <font color= #bbc28d> **Bibliography** </font>

- Riva, M. (2025, February 13). _Word Embeddings: CBOW vs Skip-Gram_. Baeldung CS. https://www.baeldung.com/cs/word-embeddings-cbow-vs-skip-gram
- Van der Maaten, L., Hinton, G. (2008) *Visualizing Data using t-SNE*. **Journal of Machine Learning Research**. https://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
- McInnes, L. (2018). _Basic UMAP Parameters_. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://umap-learn.readthedocs.io/en/latest/parameters.html
- Plotly. (n.d.). _t-SNE and UMAP projections in Python_. Plotly. https://plotly.com/python/t-sne-and-umap-projections