# Word Embeddings

While BoW and TF-IDF produce a single vector representation for an entire text document, word embeddings assign vector representations to individual words (or tokens; we say "words" here for simplicity).

These word vectors can then be fed sequentially into models such as RNNs or processed as a sequence by Transformers, allowing the model to take word order into account.

## Static Word Embeddings (example: Word2Vec)

Each word is assigned a single, fixed vector representation. Words with similar meanings tend to have similar vectors. A major limitation is that homonyms (words with multiple meanings) share the same vector regardless of context, even though their meanings may differ.

Example:

In [29]:
import numpy as np

# Fake embedding dictionary
emb = {
    "i": np.array([1.0, 0.0]),
    "love": np.array([0.8, 0.6]),
    "hate": np.array([-0.8, 0.6]),
    "nlp": np.array([0.9, 0.1]),
    "machine": np.array([0.7, 0.7]),
    "learning": np.array([0.6, 0.8]),
    "exams": np.array([-0.9, 0.2])
}


In [33]:
def s2emb(sentence, embeddings):
    vectors = []
    for word in sentence.lower().split():
        vectors.append(embeddings[word])
    return np.array(vectors)

sentence = "I love NLP"
s2emb(sentence, emb)

array([[1. , 0. ],
       [0.8, 0.6],
       [0.9, 0.1]])

These vectors can then be given to a RNN etc.

#### Example of concructing Word2Vec:

In [32]:
sentences = [
    "i love nlp",
    "i love machine learning",
    "nlp loves machine learning",
    "i hate exams"
]


In [34]:
import torch
import torch.nn as nn
import torch.optim as optim

# Tokenize and map to indices

tokenized = [s.split() for s in sentences]

vocab = sorted(set(word for sent in tokenized for word in sent))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for w, i in word2idx.items()}

print(word2idx)

vocab_size = len(vocab)


{'exams': 0, 'hate': 1, 'i': 2, 'learning': 3, 'love': 4, 'loves': 5, 'machine': 6, 'nlp': 7}


In [35]:
# Generate center-context pairs with window size 1

pairs = []

for sent in tokenized:
    for i, center in enumerate(sent):
        center_idx = word2idx[center]
        for j in [i - 1, i + 1]:
            if 0 <= j < len(sent):
                context_idx = word2idx[sent[j]]
                pairs.append((center_idx, context_idx))

pairs[:10]


[(2, 4),
 (4, 2),
 (4, 7),
 (7, 4),
 (2, 4),
 (4, 2),
 (4, 6),
 (6, 4),
 (6, 3),
 (3, 6)]

In [36]:
class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output = nn.Linear(embedding_dim, vocab_size)

    def forward(self, x):
        x = self.embeddings(x) # this is what we care about after training
        x = self.output(x) # the output is the logits for each word in vocab (the probabilities that each word is next to the center word)
        return x


In [46]:
embedding_dim = 5
model = Word2Vec(vocab_size, embedding_dim)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.005)

for epoch in range(1001):
    total_loss = 0
    for center, context in pairs:
        center = torch.tensor([center])
        context = torch.tensor([context])

        optimizer.zero_grad()
        logits = model(center)
        loss = loss_fn(logits, context)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, loss {total_loss:.4f}")


Epoch 0, loss 46.3413
Epoch 100, loss 15.1339
Epoch 200, loss 14.8288
Epoch 300, loss 14.7606
Epoch 400, loss 14.7320
Epoch 500, loss 14.7163
Epoch 600, loss 14.7062
Epoch 700, loss 14.6987
Epoch 800, loss 14.6927
Epoch 900, loss 14.6882
Epoch 1000, loss 14.6846


In [47]:
embeddings = model.embeddings.weight.detach()
print(embeddings)

def similarity(w1, w2):
    v1 = embeddings[word2idx[w1]]
    v2 = embeddings[word2idx[w2]]
    return torch.cosine_similarity(v1, v2, dim=0).item()

print("love vs hate:", similarity("love", "hate"))
print("nlp vs machine:", similarity("nlp", "machine"))
print("machine vs learning:", similarity("machine", "learning"))


tensor([[ 4.0135,  5.5560,  5.5990,  1.2117, -4.2410],
        [-2.8657, -3.0508, -0.8215, -0.6204,  1.7253],
        [-1.6408,  3.1619,  0.9434,  1.6754, -0.0948],
        [ 2.5798,  3.2396,  2.3454, -7.0850,  2.2606],
        [-0.1747, -1.2600,  3.4400, -0.2828,  1.6219],
        [ 0.9783,  1.0156,  2.6860, -0.1128,  4.1431],
        [ 0.5452,  0.2335, -2.6784,  1.4007, -1.9947],
        [-3.3606,  0.9111, -1.7885, -1.1207, -1.9609]])
love vs hate: 0.24068935215473175
nlp vs machine: 0.3324061632156372
machine vs learning: -0.5725799202919006


## Contextual Word Embeddings (BERT, GPT)

Unlike static word embeddings, contextual word embeddings assign a vector to each word based on its **surrounding context**. This means the same word can have different vector representations depending on how it is used in a sentence.

These embeddings are produced by Transformer-based models such as BERT and GPT, which process entire sequences at once and use attention mechanisms to model relationships between words. As a result, the embeddings capture both word meaning and context, including syntax and semantics (homonyms are capured).